Sunday, 2 September 2012

Lesson 7: Data Set Options, Set and Merge Statements




Data Set Options
"Options" are used in SAS in various places, and with various different kinds of syntax.  For this reason, the concept of options can be a bit confusing.  An earlier lesson introduced the idea of a global system options statement, where we can set things like linesize, page numbering, and so forth.  Data set options are something different.  They are commands added to a basic data step statement to refine or modify the work of the data step, or they can be used to modify how a data set is used in a proc step.  (The term, "options," is also used for any optional command in any SAS statement.  This is why we have to distinguish between system options, data set options, and other options.)
We begin with the drop and keep options.  The examples below show how these two options can be used to accomplish the same thing.  The choice between them is purely a matter of convenience.  The drop option lists the variables to be omitted from the data set, while the keep option lists those which are to remain in the data set.  There could be many reasons why you might not want all of the original variables in your data set.  In the next example, we use the original variables for calculations, but do not include them in the data set.  In other cases, we may create variables that are only used in the program and do not need to be saved.
    
Take note of the syntax used here, as it is unique for data set options.  All the options to be applied to a particular data set are enclosed in parentheses, following the name of the data set to which they apply.  An equal sign after each option is followed by the list of items involved.  Here they are part of the data step, but they are NOT called DATA STEP options, but rather DATA SET options, and they can be used any time a data set is referenced, such as in this proc print statement:

(A var statement under proc print would be more appropriate, but this works. Var statements are covered later.)
The Set Statement
The set statement  is used to create a new data set from an existing one.  In the simplest example, it merely copies a data set. The following program will create a new data set called two that is an exact copy of one, providing one exists.

Now suppose that a data set called one has been created, as shown in the data step below, and that you later wanted to create a new data set called two, with some new variables, but leaving the old oneunchanged.  The set statement can be used to copy the original data set and add new variables.

SAS will now go through all the observations in one, and place in the new data set two the original variables together with the new ones that are defined.
There are two data sets now.  Suppose we want to print both of them.  Most procs, including proc print, will use the most recently created data set by default.  Thus the second statement below would print only two.  To specify another data set, use the syntax shown in the first statement below.  Each “proc print” statement prints one data set.

Using Data Set Options with Set Statements
Next we show how data set options can be applied to either input or output data sets when using the set statement to read another data set.  In the example below, data set one is first created to serve as the input data set.  The data sets two and three will be exactly the same, but the processes by which they are produced differ.  When two is created, all four variables from one are read into memory, then when the data are written to two the variables x and y are eliminated.  When three is created, only sumxy and prodxy are read into memory from one, and subsequently saved to three.  If there are many variables and observations involved, the second method is preferred because it is more efficient (saves system resources).  However, you cannot drop variables from an input data set if you intend to use them in calculations!

The firstobs and obs options specify the starting and stopping observation number to read.  They do not apply to output data sets, but are used with set statements and in procs where data sets are referenced.  In the first example, a second data set is created using these options to make a subset of the original data.  Note the observation numbers in the output.

In the next example, the option is given in the proc print statement.  Compare the observation numbers.  In both cases, the observation numbers correspond to the observations actually stored in the data.  They are not renumbered by proc print.

Finally, we discuss the rename option.  At times it will be necessary to change the names of variables.  As with the other options, this can be done when reading or writing the data.  When used with an output data set, it renames the variables when the data set is saved, and does not affect the names used within the data step, as shown below.  Also, the rename option requires another set of parentheses for the list of variables to be renamed.  The name change is specified as "old name=new name."

When the rename statement is used with an input data set, the new names are in effect during the data step, as in this example:

Using Firstobs and Obs in the Infile Statement
Firstobs and obs can also be used in the infile statement when reading from an external file or cards.  Here, the syntax is different, since they are not connected with a data set, and the options are simply typed on the line following the filename or cards keyword.  There is one other difference--in this case, the numbers refer to the starting and ending line in the data, which is not always the same as the observation (e.g., observations might take up two lines each).

Concatenating Data Sets Using the Set Statement
The set statement can also be used to combine data sets in various ways.  The first way is called concatenation, and is simply combining them in order, one after the other.  The example below uses two data sets, but there can be more.  When using concatenation, all of the observations of all of the listed data sets are included in the result.  All variables from all the sets are included, with missing values assigned in cases where a variable does not occur in the original.

Suppose the data sets have different variable names.

You could use data set options to re-align the variables in a case like this.

Combining Observations with the Set Statement
The second method of combining data sets is called "one-to-one reading" and may be thought of as a side-by-side version of concatenation.  The programming difference is that while in concatenation the source data sets are listed in one set statement, in one-to-one reading each source data set is given in a separate set statement.  If the same variable names occur in more than one source data set, the values of the later sets overwrite the earlier ones.  If the data sets do not have the same number of observations, the result is cut off at the length of the shortest set.  The example below shows two data sets of different lengths and the same variables.  The x variable is changed using a rename option during input.  Thus the source variable is overwritten with the second data set values, but because of the name change, we now have x and y coming from the original x variables.

Interleaving Data with the Set Statement
The third method is called interleaving.  This is like concatenation, except that the observations are combined so that they are sorted, instead of having one data set placed after the other.  The variable(s) on which the sort is done are called by variables, and the interleave is done by adding a by statement to the concatenation program.  The by variables must be sorted before the interleave is done, so if they are not in order to begin with, we use proc sort to do the job.  Proc sort will also have a by statement, listing the by variable(s), just like the data step that does the interleave.

Merging Data Sets
Now, if you're thinking ahead, you must realize that we should be able to take this to another level--how about putting the data side-by-side while matching up variables?  That's called a merge. And that's where we're going next.  But before getting to the programming, there are some things that need to be explained.  We need to be very careful about how things match up.  For convenience, we'll visualize the data sets laid out side by side, so that we have a right and a left data set.  The simplest case occurs when there is exactly one item on the left to match exactly one item on the right.  This is called a one-to-one merge.  If there is exactly one on the left to match more than one on the right, we call it a one-to-many merge, and if the roles are reversed, a many-to-one merge.  Finally, if there are multiple instances of variable values on both sides, it is called a many-to-many merge.  The latter should usually be avoided, but an example will be provided to show what happens.
Our first example will be one-to-one reading again, but this time using a merge statement.  The difference is that the data set continues building until the end of the longest data set is reached.  (In the following examples the variables have been given unique names to avoid needing rename options.)

Next, we do a one-to-one merge.  The data sets will be matched by the name variable, therefore this variable may be called the match key.  The match key is identified using a by statement, and it must be the same in both source data sets.  Just as in interleaving, the data sets must be sorted.  To simplify the example, the data are entered in sorted order. Note that if something doesn't have a match, it is included, with missing values where the matching data should be.  Also, this is a good time to mention that SAS character variables are case sensitive, so that "john" and "John" are two different values.  If your data might have mixed case, the upcase function can be used to convert everything to upper case to ensure it matches.

Now consider a one-to-many merge.  We will have one instance of each name on the left, and multiple instances of some names on the right.  There is no change to the program (only the data).  The result is that one observation is produced for each observation on the "many" side, but if there is no match, again the relevant observations on either side are included with missing values.

Here is the result of a many-to-many merge.  The technical details of what happens are a bit complicated, but we can say, in a simplified way, that SAS matches lines side-to-side until one side runs out of the matching observations, then it repeats the last observation from the short side until the long side runs out of the matching value.  (If you go back to the previous examples, you can see that they are special cases of this more general result.)  There are very few situations where this behavior is desirable.  Most merging is done with one-to-one or one-to-many relationships.

As a final comment, note that while there is a useful purpose for having two or more set statements, there is nothing similar for merge statements.  You can put two or more merge statements in a data step, with or without a by statement, and not get an error message, but the resulting data values may be intermingled in "unexpected" ways.

Exercises
1. Refer to the data in Exercise #1 of Lesson 6.  For each of the following, include a proc print to display the results.  Use appropriate titles.  This can all be done in one program.
  1. Create a data set with only the three given variables.  Name the variables x, y, and z. (This is the only time you will use cards.)
  2. Create another data set, using a set statement, that takes the SAS data set in Part a as input, renames the variables length, width, and height during reading (on the way in), and calculates the volume and cost as in Lesson 6.
  3. Do as in Part b, except this time rename the variables when the data set is written (on the way out, this does not mean in proc print).
  4. Do as in Part b, except this time use obs and firstobs to bring in only the third and fourth observations from the original data set.
  5. Using the data set created in Part b as input, create a new data set that contains only the volume and cost variables.
2.  Refer to Exercise #2 of Lesson 5.
  1. Begin with the same input statement as was used in Lesson 5, then create a new name variable of the form "Lastname, Firstname" and another of the form "City, ST zip#".  Use an option to eliminate the original five variables used to create these new variables.
  2. Create another data set where you use obs and firstobs in the infile statement to read the second and third lines of data only.
3.  Download this data. Do not change the file or copy it into your editor.  Look at it with a text editor, such as Notepad.  There are two sections, with headings that say "1." and "2." (These lines are not to be considered observations).
  1. Write two data steps, using this one file as the source in both cases, but using firstobs and obs to control the starting and ending line so that each section is read into its own data set.  The data have a city name and state abbreviation.  Read these as two variables, "City" and "State."  Use a length statement to make the city variable long enough not to cut any of the names short and make the state abbreviation of length 2.  Create a new variable that forms a single five-character abbreviation using the first three letters of the city and the two letters of the state abbreviation.  Create another variable that concatenates the city and state with a comma and one space immediately following the city.  It should look something like this:
           Obs         city       state      shortcity           longcity
             1       Bismarck    ND        BisND        Bismarck, ND
  2. Concatenate the two data sets and print the result.  (Do not sort the data before this step.)
  3. Next, interleave the two data sets.  Remember they must be sorted first.  Use "state" as the by variable. Print the results but use a data set option within proc print to show only the "longcity" variable and observation number. 
  4. Next, merge the two data sets using "state" as the matching variable and applying the rename option as needed (within this data step).  Use an option in the data statement so that only the two original city variables and the state remain in the new data set. Print the result (with no data set options used in proc print this time).
4.  Copy the SAS code below into the editor to start with.  Assume that these data sets represent an inventory list that is being revised at each step.  The prices change each time, but the "itemno" is revised between new1 and new2 only. Write a program that does the following, and print each of the data sets you create.
  1. Merge new2 and new3 with itemno as the match key, and show old and new prices.
  2. Merge new1 and new3 with name as the match key, including  the old and new values of "itemno" and price in the result..
data new1;
input itemno name $ price;
cards;
325 PrintCrd 211
276 KeyPad 37
842 PnclHldr 8
422 PaprShrd 132
523 Basket 29
;
data new2;
input itemno name $ price;
cards;
333 PrintCrd 399
277 KeyPad 25
802 PnclHldr 12
417 PaprShrd 122
515 Basket 17
;
data new3;
input itemno name $ price;
cards;
333 PrintCrd 386
277 KeyPad 25
802 PnclHldr 11
417 PaprShrd 135
515 Basket 15
;

Lesson 6: Creating Variables




Creating Variables with Assignment Statements
You can create variables that are not in the data that you are reading.  In the following program, avgrowth is calculated from the other variables in the data.  Second, dummy is assigned a constant value. Since the value assigned to it is a character string, indicated by the quotes around it, dummy will be a character  variable of length 3, because that is the length of the first value assigned to it.  Unless a length statement is used to set the length of a character variable before it is used, the length will be determined by the first value assigned to it.  Numeric constants can also be assigned, simply by putting a number on the right side of the equal sign (no quotes).


Obviously, variables are created by the input statement, but they are also created if they are specified in a length, attrib, format, or informat statement (see below).  They can also be created by array definitions (a later topic), or by assignment statements, such as those in the example above.  An assignment statement is made up of a variable name, an equal sign, and an expression representing the value to be assigned to the variable.  The variable can appear in its own assigned expression, such as x=x+1, or x=log(x).  A very special form of assignment statement, called a sum statement, or an accumulator, is an exception to this syntax.  In the example below, p and q are accumulators.  Their values are incremented, starting from zero, by the amount specified, for each succeeding observation.  The accumulator p+1, below, is essentially the same as p=p+1, except that it is initialized to zero, which does not automatically happen if you use p=p+1. (See also the retain statement.)

The arithmetic operations and mathematical functions used in assignment statements for numeric values are quite intuitive.  The syntax is similar to that used for formulas on a graphing calculator or spreadsheet.  The arithmetic operations are "+"  (add), "-" (subtract or negative), "*" (multiply), and "/" (divide).  Exponents are given with a double asterisk, such as "3**2" (three to the second power).  Parentheses are used in the usual manner for controlling order of operations.  Many functions are available, and their names can often be guessed because of their similarity to standard mathematical notation.  All functions have at least one argument enclosed in parentheses.  Some examples are sqrt(x) for square root of x (where x can be a number, variable name, or other expression that evaluates to a non-negative number), log(x) for natural log, and exp(x) for the exponential function ("e to the x").  There are also some constants, such as pi, given by the function constant(pi).  For more detailed information about functions, see the SAS Documentation under "Base SAS/SAS Language Reference: Dictionary/Dictionary of Language Elements/Functions and CALL Routines."  (Note:  Some of the documented functions may not work in The Learning Edition.)  Here are a few more examples:

Since dates are numbers, you can do simple things like subtract two dates to find the number of days between them, without any problem.  However, for more complicated tasks, SAS has quite a few date-related functions.  For example day(x) returns the day of the month, month(x) returns the month number for a date, and qtr(x) returns the quarter number.  If you have to do any serious computations with dates, check the SAS documentation for available tools.  Remember, SAS also has date-time values, and functions to go along with them, as well.

For character variables, there is an operation called concatenation, indicated by "||" (two vertical bars), that puts two character strings together.  There are many, many functions for character variables.  We will just look at a few:  substr(source, position, length) which extracts a substring, trim(source) which eliminates trailing blanks, length(source) which calculates the length of the value excluding trailing blanks, and upcase(source) which changes all the letters to upper case.
In the program below, a length statement has been used for the city variable, to allow up to 15 letters.  Note that this method would not work for city names that have more than one word, like "New York City."  The st (state) variable has been given an informat for two characters, but a colon modifier is used so that the pointer will move on to the beginning of the zip code.  Zip codes should always be character variables, otherwise those that start with zero will be shortened.

The first assigned variable, addr1, is created by simply concatenating all three variables.  Note the (possibly undesirable) result, with the "extra" spaces between city and state, and the lack of spaces between state and zip code.  The spaces are there because the variable length is, in fact, 15, and the unused positions are filled with spaces.  Concatenation uses the whole variable, including spaces.
In addr2 we have removed the trailing spaces from city by using a trim function.  Now there are no spaces between any of the combined variables.
In addr3, we have included punctuation and spaces between the variables.  Notice that the concatenation operation works with constant expressions enclosed in quotes, as well as variables.  Spaces are preserved just as written between the quotes, including the one space after the comma and the two spaces in front of the zip code.
The upcase function is demonstrated in addr4, which converts addr3 to all uppercase characters.  Following that, the substring function is used to create a four-letter abbreviation, by extracting the first two letters of city and combining them with the state code.  Note the order of the three arguments, first the source variable, then the starting position, then the number of characters to extract.
The last assignment statement shows how we can combine various functions to perform a specialized task.  The idea here was to find the middle character of the city variable, defined to be the actual middle character for odd lengths and the letter immediately prior to the middle for even lengths.  The substring function is used to extract the character, but the starting position must be calculated.  The length function divided by two would be almost right, as it works fine for even lengths, but for odd lengths gives a half, like 4.5 for "Brookings."  Since the middle character is the next higher whole number, we can use the ceiling function, one of several rounding functions available, this being the one that always rounds any decimal value up to the next integer.
Length, informat, and attrib statements
An alternative to specifying informats in the input statement is to use an informat statement.  The informat statement has the same syntax as the format statement.  It doesn't do anything that can't be done in the input statement, but it might be convenient to keep things organized, as in this example:

Numeric variables have a default length of 8 bytes in SAS.  As we have seen, there is also a default length of 8 for character variables, if they are read using a $ informat, or if an informat is used, the length depends on n in "$n.".  In a later section, we will see that if character variables are created using data step programming statements, they get their lengths from the first value assigned to them.  The length statement can be used to override the default lengths for both character and numeric variables.
It's not often we want to change the length of a numeric variable.  Sometimes space can be saved when the values are integers.  The allowed lengths are from 3 to 8 for PC SAS.  A length of 3 will accommodate accurate integer values from -8192 to 8192.  A length of 4 works to slightly over 2 million.  It is not recommended to use shortened numeric values when fractions (decimals) are involved.

In the above example, you can see that the length statement has syntax similar to the format or informat statements.  However, the "dot" is not required.  Here it has been left out for the numeric length and included for the character length, just for an example.  The dollar sign, however, is required for character variables.  The length statement must occur before the first use of the variable in the program, or it will not have any effect.
Another way to use the length statement is shown below.  This example sets the default numeric length to 3.  Unless you specify other lengths, all numeric variables in this data set will have length 3.  (This only works for numeric variables.)  The character variables will have length 8.

Another way to do this is with the attrib statement, which is more complicated and allows you to set the lengths, formats, informats, and labels all in one command:


Exercises:
For each of the following exercises, copy and paste the data given in the problem into the SAS editor.  Write a data step to read the data and create the new variables described, then print the results using proc print, using appropriate titles.
1.  These numbers represent dimensions of cardboard boxes, length, width, and height, in inches.
  1. Calculate the volume of each box in cubic feet.
  2. Calculate the amount of cardboard needed to make each box in square feet, assuming that the top and bottom flaps meet in the middle (this results in a double layer of cardboard for both top and bottom). 
  3. Suppose the cost of cardboard is $.05 per square foot, and there is a fixed cost of $.25 for manufacturing each box regardless of size.  Calculate the cost of manufacturing each box.
  4. Calculate the cost per cubic foot of volume.
32 18 12
16 15 24
48 12 32
15 30 45
20 30 36
2.  This problem will provide a little practice in writing complicated formulas in SAS, paying attention to order of operations.  Use the data below, with variables a, b, and c, and apply the following formulas to create two new variables called root and trunk.  The first observation's results are -1 and 2.094, respectively.
  ,      
1 6 5
4 -20 2
12 22 -11
3 -15 -9
3.  Read the following data into three variables, making sure to get complete names.  Use the character functions and operators to extract initials from the following names so that they look like "J.F.K."  Then create an abbreviation for each name that looks like "J-n F-d K-y".
John Fitzgerald Kennedy
Martha Helena Goetz
Frederich Anthony Sailer
Albert Blake Codwell

Lesson 5: Input Styles




It's time to dig a little deeper into the technical details of data reading.  Before we do that, it is important to make clear what we mean by "reading" and "writing" in a data step.  "Reading" is the process whereby SAS interprets raw data from a file or an instream cards section of the program, or when it accesses an existing data set.  Thus, "reading' means SAS is bringing data values into computer memory to process in the data step.  "Writing" is the process of saving the finished data to a SAS data set file on a hard drive (or other computer storage device).  It is NOT the same as "printing." which is a term by which we usually mean "use proc print to display text in the Output Window."
Imagine that you are a computer and are instructed to read a file.  What do you actually do?
Well, as a human, how do you read a book?  First you have to open it.  You find the starting place, and you begin reading, which at the most fundamental level means you read a character, interpret it, and move on to the next one, and repeat.  Upon reaching the end of a word, you interpret the word, and move on to the next one.   Upon reaching the end of a line, you move down a line and go back to the left and read the first character, and so on, until you reach the end of the book, at which point you close it and stop.  Well, something like that, anyway.  For us, so much happens automatically, we don't have to think about it.  But, a computer doesn't think at all.  It merely follows a sequence of instructions, and this sequence, in some ways, is similar to processes we humans follow when reading a book.
When SAS opens a file to read it, it creates two pointers, which are nothing more than numbers for keeping track of position in the file.  One pointer is a position in a line of text (column), the other is the line number in the file.  (We are assuming a standard text file here.  SAS can also read binary files, but that is another subject.)  The pointers are initially at column 1, line 1.  As SAS reads data, the pointer moves along the line (not literally, of course, the number just changes).  SAS needs to know when to begin reading a value and when to stop. What is obvious to us is not necessarily obvious to a computer.  Consider the following line:
12 33 42 51 24
Do you see five numbers?  How about this:
1233425124
If I told you this was a string of five two-digit numbers, you would know immediately how to interpret it.  We are now going to consider what must be done to tell SAS how to interpret lines of data in ways somewhat analogous to how a person would interpret them.
SAS has four input styles.  It is easiest to learn about them one at a time, but they can actually be used together (mixed input) on the same file in almost any order.  The input styles are: listcolumn,formatted, and named input.
List input is perhaps the most straightforward.  The data values are simply given in a "list," one after the other, with a delimiter between them.  The delimiter is often just a space (that's the default), but may be a comma, tab, or other character.  List input, in the simplest case, can be recognized if all you see in the input statement is variable names and dollar signs following the character variables.  Typically, the data will not be aligned in columns.
To make our example easier to follow, the data is shown in a "cards" section, but we will discuss it as though it were really an external file.  Some differences involved in actually using an external file will be dealt with later.  Notice that the data values are separated by a single space and are not aligned in columns.  The input statement contains only variable names and one dollar sign.

When SAS processes the input statement, it creates, in memory, an "input vector," which is temporary storage for all the variables in one observation.  At this point, the attributes of the variables (length, character vs. numeric) are fixed.  As the data step goes through its commands, it fills up the fields of the input vector, and when it gets to the end, it outputs one observation to the data set.  If there is more to do, it goes back and starts over, first clearing the input vector, then filling it up again.  One pass like this is known as a data step iteration.
The default length for character variables is eight characters, so in our example, that is the length of the "name" variable.  Age and height are numeric variables, also taking up eight bytes of memory, but that has no bearing on the number of characters to be read.  As reading begins, the pointers are set to line 1, column 1.  SAS is going to read up to eight characters into the variable "name."  However, in the list input style, if first checks to see if the current character is a space (or other blank character).  If so, it advances to the next, and keeps going until it comes to a non-blank character.  Then it starts reading characters into the input vector.  When it comes to another blank, it stops.  If the data value has more than eight characters, SAS will read the first eight and save them, but will then continue advancing the pointer until a space is encountered, then stop. (Longer character variables can be specified using the "colon modifier," ":$w." as has been shown in a previous section.  The colon causes SAS to read in list style even when there is an informat present.  There is also an "ampersand modifier" which allows use of a character informat and reads imbedded spaces.  Multiple spaces then signal the end of the field.)
The second variable, "age," is a number.  With list input, SAS can only read numbers in "standard" format, which essentially means there can be only numerals, decimals, and minus signs present.  SAS again advances the pointer to the first non-blank character and begins reading the number until it encounters a space.  The number of columns read is immaterial this time, as SAS will simply store the number with as much precision as it can, regardless of how many  numerals were given.  When it reaches a space, SAS saves the number to the input vector and stops.  The third variable, "height," is read the same way, except that we are now at the end of a line.  Upon reaching the end of the line, SAS saves the value, moves the column pointer back to 1, and advances the line pointer to the next number.  As it has come to the end of the input statement as well, it considers the observation complete and writes the contents of the input vector to the output data set.  It then clears the input vector and starts over, repeating until the end of  the file is reached.
Missing values for character variables require special care with list input, since spaces are interpreted as delimiters and cannot then represent missing values.  Missing values may have to be coded using a word or symbol (e.g., "missing").  Options that change the delimiter (discussed later) can also solve this problem.  Numeric missing values are represented by a decimal point (period).  If values are missing at the end of a line (without a placeholder of some kind), the infile option missover can be used to set the remaining variables to missing and go on to the next line.  If this is not done, SAS will try to read the variables from the next line.
Column input requires the data to be arranged in aligned columns.  The key element of column input is that you give a range of columns after each variable name, or after the dollar sign of a character variable, in the form of start-end.  Informats cannot be used in the input statement with column input. (They can be specified in a separate informat statement, introduced in the next lesson.)   There is no need to worry about the column pointer in this case, as SAS changes it according to the ranges specified.  The ranges need not be in order.  They can even overlap or be re-read.

The next example illustrates some of the advantages of column input.  Notice that the character variable will read whatever is in the associated columns and is automatically set to the length specified by the range of columns.  Even spaces and special characters are correctly read and stored.  Missing numeric values need not be coded with a period, as spaces will be correctly interpreted as missing.

One of the best features of column input is that no delimiters are required.  Many mainframe programs produce this kind of file, which has no extra spaces or delimiters.  This can greatly reduce the size of a file.  As long as the column positions are known, there is no difficulty with reading this data in SAS.

Formatted input relies on informats and pointer controls to determine where and what to read into each variable.  In the example below, the @n syntax gives the starting column for each variable.  It repositions the column pointer to that column.  The "name" variable has a default length of eight.  The numeric variables are read until a space is encountered.  It is important to note that the syntax here is the @n comes in front of the variable name, and the informat comes in its usual position after the variable name.

Variables need not be in left-to-right order.  Informats may be used as needed.

Even though I have shown the @n for each variable here, it is not really necessary in all cases.  The initial "@1" is not required, as the pointer starts out in column 1.  With formatted input, the line pointer will end up at the next character past the field that was just read.  IF this is the start of the next variable, you are ready to go and do not need to reposition the pointer.  The syntax "+n" may be used to advance the pointer a specified number of columns.  Just as with column input, delimiters are not necessary when using formatted input.

Because formatted input and column input cause SAS to read a specified number of columns, a problem occurs if the end of the line varies, as in the example of the dates given below.  When the data are instream, SAS does not have a problem.  But if the file is external, a line that is too short will cause an error.  The truncover option in the infile statement will cause SAS to treat short lines as if there were spaces in the file to fill out the remainder of the field.  If there is no data at all for the field, the observation will be set to missing.  The syntax is demonstrated here with instream data, but is only necessary for external files.

It is often helpful with formatted input, if there are many variables, to line them up vertically in the input statement, with the @'s all in front and the informats after.  I have not shown an example, but with a large number of variables, this makes it much easier to prevent and correct mistakes.
The log below is the result of trying to read this same data from an external file without the truncover option.  Observe that SAS is looking for the "bday" variable at the beginning of line 2, and since what is actually there is a name variable, it declares the data to be invalid.  The reason for this is that the "bday" given in line 1 is only 7 characters in length.  Since there are not enough characters, SAS goes to the next line to look for the variable.  After failing to read "bday," the end of the input statement is reached, so SAS tries to move the next line to start a new input cycle, but finding the end of the file, it quits and writes only one observation to the data set.

Including the truncover option in the infile statement fixes this problem.

Named input is used when each variable is identified in the data with a variable name and the equals sign.  This input style is not very common.  While named input can be mixed with the other styles, it must be last on the line.  You cannot change back to another style.

However, the idea demonstrated here is very useful.  In some data sets, it is difficult to identify where the variables are in the line.  If they are "signaled" by some identifiable text, we can use that text to determine the position of the pointer and use formatted input.  In the example below, we have given a character string in quotes with the @ sign, where we had the column number in previous examples.  SAS scans from the current column until it matches the text in quotes after the @ sign.  Note that a trailing space was included in the quotes in this example.  An alternative would be to leave out the space and increase the field width to 3.  In any case, you want to make sure the pointer ends up in the right place, which will be right after the text in quotes, and the field width matches the location of the value to be read.  In cases like this, a lot of effort is required to determine what consistent properties of the data source can be reliably used.  Careful study of the file and testing of various scenarios is required, or the result may be misread data, quite possibly with no warning of the errors.

If the variable values are not in a consistent order in the data, we can deal with that too.  By using "@1" to reset the pointer back to the beginning of the line, we can make sure nothing is missed:

To finish this off, here is an example of mixed input styles.  All four styles, list, column, formatted, and named, are demonstrated in one input statement.


Exercises:
For each problem below, write a SAS program that creates a data set using instream data.  Copy and paste the given data segments into the SAS editor. This is important in order to keep the data in the same format as presented.  DO NOT adjust the data, such as by adding spaces or aligning columns.  Include a proc print step in the program, and turn in the editor, output, and log files.   Use appropriate titles and formats, if necessary, in your output.
1.  The variables are Name, Age, Height, Sex, and JoinDate.  Assume that no values will be longer than those included here, and use the shortest possible length for the character variables.  Use list input (with colon modifiers where needed).
John 21 70 M 2/14/97
Jo 18 62 F 3/27/99
Mark 32 68 M 6/22/98
Linda 25 65 F 12/14/97
Carey 27 59 F 8/20/98
2.  The variables and the columns they are found in are LastName (1-15), FirstName (16-30), Address (31-55), City (56-65), State (66-67), and Zip (68-72).  Assume that character variables may fill the entire field width.  Use column input.
Johnson        Michael        121 1st St S             Brookings SD57006
Big Hammer     Beatrice       45031 271st Ave S        Moorhead  MN56560
Helms-Marquart Charlotte      302 N Mason-Dixon Ave    Somewhere DC01221
Cutler         George         Rural Route 2            Zap       ND58563
3.  In this problem, the same variables are used as in the previous problem, but additional variables are included, namely the Social Security Number, Ownership Status, and Move-in Date.  You can see where these new variables are added.  The field widths of the variables that were in the previous problem are unchanged.  Use formatted input to read this data, but change the order of reading to First Name, Last Name, Social Security Number, Ownership Status, Move-in Date, Address, City, State, and Zip. Use at least one example of the "+n" syntax to move the pointer.
503118596Johnson        Michael        Own 121 1st St S             01051974Brookings SD57006
471559684Big Hammer     Beatrice       Rent45031 271st Ave S        10221987Moorhead  MN56560
362995874Helms-Marquart Charlotte      Own 302 N Mason-Dixon Ave    07091991Somewhere DC01221
474843859Cutler         George         RentRural Route 2            12161996Zap       ND58563