Sunday, 2 September 2012

Lesson 5: Input Styles




It's time to dig a little deeper into the technical details of data reading.  Before we do that, it is important to make clear what we mean by "reading" and "writing" in a data step.  "Reading" is the process whereby SAS interprets raw data from a file or an instream cards section of the program, or when it accesses an existing data set.  Thus, "reading' means SAS is bringing data values into computer memory to process in the data step.  "Writing" is the process of saving the finished data to a SAS data set file on a hard drive (or other computer storage device).  It is NOT the same as "printing." which is a term by which we usually mean "use proc print to display text in the Output Window."
Imagine that you are a computer and are instructed to read a file.  What do you actually do?
Well, as a human, how do you read a book?  First you have to open it.  You find the starting place, and you begin reading, which at the most fundamental level means you read a character, interpret it, and move on to the next one, and repeat.  Upon reaching the end of a word, you interpret the word, and move on to the next one.   Upon reaching the end of a line, you move down a line and go back to the left and read the first character, and so on, until you reach the end of the book, at which point you close it and stop.  Well, something like that, anyway.  For us, so much happens automatically, we don't have to think about it.  But, a computer doesn't think at all.  It merely follows a sequence of instructions, and this sequence, in some ways, is similar to processes we humans follow when reading a book.
When SAS opens a file to read it, it creates two pointers, which are nothing more than numbers for keeping track of position in the file.  One pointer is a position in a line of text (column), the other is the line number in the file.  (We are assuming a standard text file here.  SAS can also read binary files, but that is another subject.)  The pointers are initially at column 1, line 1.  As SAS reads data, the pointer moves along the line (not literally, of course, the number just changes).  SAS needs to know when to begin reading a value and when to stop. What is obvious to us is not necessarily obvious to a computer.  Consider the following line:
12 33 42 51 24
Do you see five numbers?  How about this:
1233425124
If I told you this was a string of five two-digit numbers, you would know immediately how to interpret it.  We are now going to consider what must be done to tell SAS how to interpret lines of data in ways somewhat analogous to how a person would interpret them.
SAS has four input styles.  It is easiest to learn about them one at a time, but they can actually be used together (mixed input) on the same file in almost any order.  The input styles are: listcolumn,formatted, and named input.
List input is perhaps the most straightforward.  The data values are simply given in a "list," one after the other, with a delimiter between them.  The delimiter is often just a space (that's the default), but may be a comma, tab, or other character.  List input, in the simplest case, can be recognized if all you see in the input statement is variable names and dollar signs following the character variables.  Typically, the data will not be aligned in columns.
To make our example easier to follow, the data is shown in a "cards" section, but we will discuss it as though it were really an external file.  Some differences involved in actually using an external file will be dealt with later.  Notice that the data values are separated by a single space and are not aligned in columns.  The input statement contains only variable names and one dollar sign.

When SAS processes the input statement, it creates, in memory, an "input vector," which is temporary storage for all the variables in one observation.  At this point, the attributes of the variables (length, character vs. numeric) are fixed.  As the data step goes through its commands, it fills up the fields of the input vector, and when it gets to the end, it outputs one observation to the data set.  If there is more to do, it goes back and starts over, first clearing the input vector, then filling it up again.  One pass like this is known as a data step iteration.
The default length for character variables is eight characters, so in our example, that is the length of the "name" variable.  Age and height are numeric variables, also taking up eight bytes of memory, but that has no bearing on the number of characters to be read.  As reading begins, the pointers are set to line 1, column 1.  SAS is going to read up to eight characters into the variable "name."  However, in the list input style, if first checks to see if the current character is a space (or other blank character).  If so, it advances to the next, and keeps going until it comes to a non-blank character.  Then it starts reading characters into the input vector.  When it comes to another blank, it stops.  If the data value has more than eight characters, SAS will read the first eight and save them, but will then continue advancing the pointer until a space is encountered, then stop. (Longer character variables can be specified using the "colon modifier," ":$w." as has been shown in a previous section.  The colon causes SAS to read in list style even when there is an informat present.  There is also an "ampersand modifier" which allows use of a character informat and reads imbedded spaces.  Multiple spaces then signal the end of the field.)
The second variable, "age," is a number.  With list input, SAS can only read numbers in "standard" format, which essentially means there can be only numerals, decimals, and minus signs present.  SAS again advances the pointer to the first non-blank character and begins reading the number until it encounters a space.  The number of columns read is immaterial this time, as SAS will simply store the number with as much precision as it can, regardless of how many  numerals were given.  When it reaches a space, SAS saves the number to the input vector and stops.  The third variable, "height," is read the same way, except that we are now at the end of a line.  Upon reaching the end of the line, SAS saves the value, moves the column pointer back to 1, and advances the line pointer to the next number.  As it has come to the end of the input statement as well, it considers the observation complete and writes the contents of the input vector to the output data set.  It then clears the input vector and starts over, repeating until the end of  the file is reached.
Missing values for character variables require special care with list input, since spaces are interpreted as delimiters and cannot then represent missing values.  Missing values may have to be coded using a word or symbol (e.g., "missing").  Options that change the delimiter (discussed later) can also solve this problem.  Numeric missing values are represented by a decimal point (period).  If values are missing at the end of a line (without a placeholder of some kind), the infile option missover can be used to set the remaining variables to missing and go on to the next line.  If this is not done, SAS will try to read the variables from the next line.
Column input requires the data to be arranged in aligned columns.  The key element of column input is that you give a range of columns after each variable name, or after the dollar sign of a character variable, in the form of start-end.  Informats cannot be used in the input statement with column input. (They can be specified in a separate informat statement, introduced in the next lesson.)   There is no need to worry about the column pointer in this case, as SAS changes it according to the ranges specified.  The ranges need not be in order.  They can even overlap or be re-read.

The next example illustrates some of the advantages of column input.  Notice that the character variable will read whatever is in the associated columns and is automatically set to the length specified by the range of columns.  Even spaces and special characters are correctly read and stored.  Missing numeric values need not be coded with a period, as spaces will be correctly interpreted as missing.

One of the best features of column input is that no delimiters are required.  Many mainframe programs produce this kind of file, which has no extra spaces or delimiters.  This can greatly reduce the size of a file.  As long as the column positions are known, there is no difficulty with reading this data in SAS.

Formatted input relies on informats and pointer controls to determine where and what to read into each variable.  In the example below, the @n syntax gives the starting column for each variable.  It repositions the column pointer to that column.  The "name" variable has a default length of eight.  The numeric variables are read until a space is encountered.  It is important to note that the syntax here is the @n comes in front of the variable name, and the informat comes in its usual position after the variable name.

Variables need not be in left-to-right order.  Informats may be used as needed.

Even though I have shown the @n for each variable here, it is not really necessary in all cases.  The initial "@1" is not required, as the pointer starts out in column 1.  With formatted input, the line pointer will end up at the next character past the field that was just read.  IF this is the start of the next variable, you are ready to go and do not need to reposition the pointer.  The syntax "+n" may be used to advance the pointer a specified number of columns.  Just as with column input, delimiters are not necessary when using formatted input.

Because formatted input and column input cause SAS to read a specified number of columns, a problem occurs if the end of the line varies, as in the example of the dates given below.  When the data are instream, SAS does not have a problem.  But if the file is external, a line that is too short will cause an error.  The truncover option in the infile statement will cause SAS to treat short lines as if there were spaces in the file to fill out the remainder of the field.  If there is no data at all for the field, the observation will be set to missing.  The syntax is demonstrated here with instream data, but is only necessary for external files.

It is often helpful with formatted input, if there are many variables, to line them up vertically in the input statement, with the @'s all in front and the informats after.  I have not shown an example, but with a large number of variables, this makes it much easier to prevent and correct mistakes.
The log below is the result of trying to read this same data from an external file without the truncover option.  Observe that SAS is looking for the "bday" variable at the beginning of line 2, and since what is actually there is a name variable, it declares the data to be invalid.  The reason for this is that the "bday" given in line 1 is only 7 characters in length.  Since there are not enough characters, SAS goes to the next line to look for the variable.  After failing to read "bday," the end of the input statement is reached, so SAS tries to move the next line to start a new input cycle, but finding the end of the file, it quits and writes only one observation to the data set.

Including the truncover option in the infile statement fixes this problem.

Named input is used when each variable is identified in the data with a variable name and the equals sign.  This input style is not very common.  While named input can be mixed with the other styles, it must be last on the line.  You cannot change back to another style.

However, the idea demonstrated here is very useful.  In some data sets, it is difficult to identify where the variables are in the line.  If they are "signaled" by some identifiable text, we can use that text to determine the position of the pointer and use formatted input.  In the example below, we have given a character string in quotes with the @ sign, where we had the column number in previous examples.  SAS scans from the current column until it matches the text in quotes after the @ sign.  Note that a trailing space was included in the quotes in this example.  An alternative would be to leave out the space and increase the field width to 3.  In any case, you want to make sure the pointer ends up in the right place, which will be right after the text in quotes, and the field width matches the location of the value to be read.  In cases like this, a lot of effort is required to determine what consistent properties of the data source can be reliably used.  Careful study of the file and testing of various scenarios is required, or the result may be misread data, quite possibly with no warning of the errors.

If the variable values are not in a consistent order in the data, we can deal with that too.  By using "@1" to reset the pointer back to the beginning of the line, we can make sure nothing is missed:

To finish this off, here is an example of mixed input styles.  All four styles, list, column, formatted, and named, are demonstrated in one input statement.


Exercises:
For each problem below, write a SAS program that creates a data set using instream data.  Copy and paste the given data segments into the SAS editor. This is important in order to keep the data in the same format as presented.  DO NOT adjust the data, such as by adding spaces or aligning columns.  Include a proc print step in the program, and turn in the editor, output, and log files.   Use appropriate titles and formats, if necessary, in your output.
1.  The variables are Name, Age, Height, Sex, and JoinDate.  Assume that no values will be longer than those included here, and use the shortest possible length for the character variables.  Use list input (with colon modifiers where needed).
John 21 70 M 2/14/97
Jo 18 62 F 3/27/99
Mark 32 68 M 6/22/98
Linda 25 65 F 12/14/97
Carey 27 59 F 8/20/98
2.  The variables and the columns they are found in are LastName (1-15), FirstName (16-30), Address (31-55), City (56-65), State (66-67), and Zip (68-72).  Assume that character variables may fill the entire field width.  Use column input.
Johnson        Michael        121 1st St S             Brookings SD57006
Big Hammer     Beatrice       45031 271st Ave S        Moorhead  MN56560
Helms-Marquart Charlotte      302 N Mason-Dixon Ave    Somewhere DC01221
Cutler         George         Rural Route 2            Zap       ND58563
3.  In this problem, the same variables are used as in the previous problem, but additional variables are included, namely the Social Security Number, Ownership Status, and Move-in Date.  You can see where these new variables are added.  The field widths of the variables that were in the previous problem are unchanged.  Use formatted input to read this data, but change the order of reading to First Name, Last Name, Social Security Number, Ownership Status, Move-in Date, Address, City, State, and Zip. Use at least one example of the "+n" syntax to move the pointer.
503118596Johnson        Michael        Own 121 1st St S             01051974Brookings SD57006
471559684Big Hammer     Beatrice       Rent45031 271st Ave S        10221987Moorhead  MN56560
362995874Helms-Marquart Charlotte      Own 302 N Mason-Dixon Ave    07091991Somewhere DC01221
474843859Cutler         George         RentRural Route 2            12161996Zap       ND58563

No comments:

Post a Comment