What is SAS?
• Developed in the early 1970s at North Carolina State
University
• Originally intended for management and analysis of
agricultural field experiments
• Now the most widely used statistical software
• Used to stand for “Statistical Analysis System”, now it is not
an acronym for anything
• Pronounced “sass”, not spelled out as three letters.
Overview of SAS Products
• Base SAS - data management and basic procedures
• SAS/STAT - statistical analysis
• SAS/GRAPH - presentation quality graphics
• SAS/OR - Operations research
• SAS/ETS - Econometrics and Time Series Analysis
• SAS/IML - interactive matrix language
• SAS/AF - applications facility (menus and interfaces)
• SAS/QC - quality control
There are other specialized products for spreadsheets, access to
databases, connectivity between different machines running SAS,
etc.
Resources: Introductory Books
Mastering the SAS System, 2nd Edition, by Jay A. Jaffe,
Van Nostrand Reinhold
Quick Start to Data Analysis with SAS, by Frank C. DiIorio and
Kenneth A. Hardy, Duxbury Press.
How SAS works: a comprehensive introduction to the SAS System, by
P.A. Herzberg, Springer-Verlag
Applied statistics and the SAS programming language, by R.P. Cody,
North-Holland, New York
The bulk of SAS documentation is available online, at
http://support.sas.com/documentation/onlinedoc/index.html. A
catalog of printed documentation available from SAS can be found at
http://support.sas.com/publishing/index.html.
Online Resources
Online help: Type help in the SAS display manager input window.
Sample Programs, distributed with SAS on all platforms.
SAS Institute Home Page: http://www.sas.com
SAS Institute Technical Support:
http://support.sas.com/resources/
Searchable index to SAS-L, the SAS mailing list:
http://www.listserv.uga.edu/archives/sas-l.html
Usenet Newsgroup (equivalent to SAS-L):
comp.soft-sys.sas
Michael Friendly’s Guide to SAS Resources on the Internet:
http://www.math.yorku.ca/SCS/StatResource.html#SAS
Brian Yandell’s Introduction to SAS:
http://www.stat.wisc.edu/~yandell/software/sas/intro.html
Basic Structure of SAS
There are two main components to most SAS programs - the data
step(s) and the procedure step(s).
The data step reads data from external sources, manipulates and
combines it with other data set and prints reports. The data step is
used to prepare your data for use by one of the procedures (often
called “procs”).
SAS is very lenient about the format of its input - statements can
be broken up across lines, multiple statements can appear on a
single line, and blank spaces and lines can be added to make the
program more readable.
The procedure steps perform analysis on the data, and produce
(often huge amounts of) output.
The most effective strategy for learning SAS is to concentrate on
the details of the data step, and learn the details of each procedure
as you have a need for them.
Accessing SAS
There are four ways to access SAS on a UNIX system:
1. Type sas . This opens the SAS “display manager”, which
consists of three windows (program, log, and output). Some
procedures must be run from the display manager.
2. Type sas -nodms . You will be prompted for each SAS
statement, and output will scroll by on the screen.
3. Type sas -stdio . SAS will act like a standard UNIX
program, expecting input from standard input, sending the log
to standard error, and the output to standard output;
4. Type sas filename.sas . This is the batch mode of SAS -
your program is read from filename.sas, the log goes to
filename.log and the output goes to filename.lst.
Some Preliminary Concepts and Rules
• SAS variable names must be 32 characters or less, constructed
of letters, digits and the underscore character. (Before version
7, the limit was 8.)
• It’s a good idea not to start variable names with an underscore,
because special system variables are named that way.
• Data set names follow similar rules as variables, but they have
a different name space.
• There are virtually no reserved keywords in SAS; it’s very good
at figuring things out by context.
• SAS is not case sensitive, except inside of quoted strings.
Starting in Version 7, SAS will remember the case of variable
names when it displays them.
• Missing values are handled consistently in SAS, and are
represented by a period (.).
• Each statement in SAS must end in a semicolon (;).
Structure of SAS programs
• Lines beginning with an asterisk (*) are treated as comments.
Alternatively you can enclose comments between /* and */.
• You can combine as many data and proc steps in whatever
order you want.
• Data steps begin with the word data and procedure steps
begin with the word proc.
• The run; command signals to SAS that the previous
commands can be executed.
• Terminate an interactive SAS job with the endsas; statement.
• There are global options (like linesize and pagesize) as well as
options specific to datasets and procedures.
• Informative messages are written to the SAS log - make sure
you read it!
The Data Step
The data step provides a wide range of capabilities, among them
reading data from external sources, reshaping and manipulating
data, transforming data and producing printed reports.
The data step is actually an implied do loop whose statements will
be executed for each observation either read from an external
source, or accessed from a previously processed data set.
For each iteration, the data step starts with a vector of missing
values for all the variables to be placed in the new observation. It
then overwrites the missing value for any variables either input or
defined by the data step statements. Finally, it outputs the
observation to the newly created data set.
The true power of the data step is illustrated by the fact that all of
these defaults may be overridden if necessary.
Data Step: Basics
Each data step begins with the word data and optionally one or
more data set names (and associated options) followed by a
semicolon. The name(s) given on the data step are the names of
data sets which will be created within the data step.
If you don’t include any names on the data step, SAS will create
default data set names of the form datan, where n is an integer
which starts at 1 and is incremented so that each data set created
has a unique name within the current session. Since it becomes
difficult to keep track of the default names, it is recommended that
you always explicitly specify a data set name on the data
statement.
When you are running a data step to simply generate a report, and
don’t need to create a data set, you can use the special data set
name _null_ to eliminate the output of observations.
Data Step: Inputting Data
The input statement of SAS is used to read data from an external
source, or from lines contained in your SAS program.
The infile statement names an external file or fileref from which
to read the data; otherwise the cards; or datalines; statement is
used to precede the data.
data one;
infile "input.data";
input a b c;
run;
Reading data from an external file
data one;
input a b c;
datalines;
. . .
;
Reading from inline data
By default, each invocation of the input statement reads another
record. This example uses free-form input, with at least one space
between values.
A fileref is a SAS name, created by the filename statement, which refers to
an external file or other device
Data Step: input Statement
There are three basic forms of the input statement:
1. List input (free form) - data fields must be separated by at
least one blank. List the names of the variables, follow the
name with a dollar sign ($) for character data.
2. Column input - follow the variable name (and $ for character)
with startingcolumn – endingcolumn.
3. Formatted input - Optionally precede the variable name with
@startingcolumn; follow the variable name with a SAS format
designation. (Examples of formats: $10. (10 column
character), 6. (6 column numeric))
When mixing different input styles, note that for column and
formatted input, the next input directive reads from the column
immediately after the previous value, while for list input, the next
directive reads from the second column after the previous value.
Modifiers for List Input
The colon (:) modifier for list input tells SAS to use a format for
input, but to stop when the next whitespace is found. Data like:
17,244 2,500,300 600 12,003
14,120 2,300 4,232 25
could be read using an input statement like
input x1 : comma. x2 : comma. x3 : comma. x4 : comma. ;
The ampersand (&) modifier tells SAS to use two whitespace
characters to signal the end of a character variable, allowing
embedded blanks to be read using list input. Thus, the statements:
length name $ 25;
input name & $ year;
could be used to read data such as
George Washington 1789
John Adams 1797
Thomas Jefferson 1801
Other Modifiers for the Input Statement
+number advance number columns.
#number advance to line number.
/ advance to next line.
trailing @ hold the line to allow further input statements in this
iteration of the data step on the same data.
trailing @@ hold the line to allow continued reading from the line
on subsequent iterations of the data step.
Note: If SAS needs to read an additional line to input all the
variables referenced in the input statement it prints the following
message on the log:
NOTE: SAS went to a new line when INPUT statement reached past
the end of a line.
If you see this note, make sure you understand why it was printed!!
The input Statement
Variable lists can be used on the input statement. For example, the
list var1 - var4 expands to var1 var2 var3 var4.
You can repeat formats for variable lists by including the names
and formats in parentheses: (var1 - var4) (5.) reads four
numeric variables from 20 consecutive columns (5 columns for each
variable).
You can also repeat formats using the notation num*format. The
previous example could be replaced with (4 * 5.).
A null input statement (no variables) can be used to free holding
caused by trailing @-signs.
The @, + and # specifications can all be followed by a variable name
instead of a number.
If you want to make sure your input data is really arranged the way
you think it is, the list; command will display your input data
with a “ruler” showing column numbers.
FTP Access
SAS provides the ability to read data directly from an FTP server,
without the need to create a local copy of the file, through the ftp
keyword of the filename statement.
Suppose there is a data file called user.dat in the directory
public on an ftp server named ftp.myserver.com. If your user
name is joe and your password is secret, the following statement
will establish a fileref for reading the data:
filename myftp ftp ’user.dat’ cd=’/public’ user=’joe’
pass=’secret’ host=’ftp.myserver.com’;
The fileref can now be used in the infile statement in the usual
way.
You can read files from http (web) servers in a similar fashion,
using the url keyword.
Options for the infile statement
For inline data, use the infile name cards or datalines.
missover Sets values to missing if an input statement would
read more than one line.
stopover Like missover, but declares an error and stops
lrecl=num Treats the input as having a length of num characters.
Required if input records are longer than 256 characters.
dlm=’chars’ Uses the characters in chars instead of blanks
and tabs as separators in list (free-form) input.
dsd Read comma-separated data
expandtabs expand tabs to spaces before inputting data.
end=varname creates a SAS variable whose value is 1 when SAS
processes the last line in the file.
obs=n Limits processing of infile to n records
pad Adds blanks to lines that are shorter than the input
statement specifies.
Variable Length Records
Consider the following file, containing the year and name of the
first three American presidents:
1789 George Washington
1797 John Adams
1801 Thomas Jefferson
If we were to use an input statement like
input year 4. @6 name $17.;
SAS would try to read past the end of the second line, since the
name only has 10 characters. The solution is the pad option of the
infile statement. Suppose the data is in a file called p.txt. The
following program correctly reads the data:
data pres;
infile ’p.txt’ pad;
input year 4. @6 name $17.;
run;
Reading SAS programs from external files
The infile statement can be used to read data which is stored in
a file separate from your SAS program. When you want SAS to
read your program from an external file you can use the %include
statement, followed by a filename or fileref. After SAS processes a
%include statement, it continues to read data from its original
source (input file, keyboard or display manager.)
For example, suppose the SAS program statements to read a file
and create a data set are in the system file readit.sas. To process
those statements, and then print the data set, the following
commands can be used:
%include "readit.sas";
proc print;
run;
proc import
For certain simple data files, SAS can create a SAS data set directly
using proc import. The dbms= option informs SAS of the type of
file to be read, and choices include xls (Excel spreadsheets), csv
(Comma-separated values), dbf (Dbase files), dta (Stata files), sav
(SPSS files), and tab (Tab-separated files). For example,to read an
Excel spreadsheet called data.xls into a SAS data set named
xlsdata, the following statements can be used:
proc import dbms=xls datafile=’data.xls’ out=xlsdata;
run;
proc import provides no options for formatting, and may not be
successful with all types of data files
Repetitive Processing of Variables
The array statement can be used to perform the same task on a
group of variables.
array arrayname variable list <$> <(startingvalues)>;
array arrayname{n} variable list <$> <(startingvalues)>;
You can then use the array name with curly braces ({}) and a
subscript, or in a do over loop:
array x x1-x9;
do i = 1 to dim(x);
if x{i} = 9 then x{i} = .;
end;
array x x1-x9;
do over x;
if x = 9 then x = .;
end;
Notes: 1. All the variables in an array must be of the same type.
2. An array can not have the same name as a variable.
3. You can use the keyword _temporary_ instead of a variable list.
4. The statement array x{3}; generates variables x1, x2, and x3.
5. The function dim returns the number of elements in an array.
Titles and Footnotes
SAS allows up to ten lines of text at the top (titles) and bottom
(footnotes) of each page of output, specified with title and
footnote statements. The form of these statements is
title<n> text; or footnote<n> text;
where n, if specified, can range from 1 to 10, and text must be
surrounded by double or single quotes. If text is omitted, the title
or footnote is deleted; otherwise it remains in effect until it is
redefined. Thus, to have no titles, use:
title;
By default SAS includes the date and page number on the top of
each piece of output. These can be suppressed with the nodate and
nonumber system options.
Missing Values
SAS handles missing values consistently throughout various
procedures, generally by deleting observations which contain
missing values. It is therefore very important to inspect the log and
listing output, as well as paying attention to the numbers of
observations used, when your data contains missing values.
For character variables, a missing value is represented by a blank
(" " ; not a null string)
For numeric variables, a missing value is represented by a period
(with no quotes). Unlike many languages, you can test for equality
to missing in the usually fasion:
if string = " " then delete; * character variable;
if num = . then delete; * numeric variable;
if x > 10 then x = .; * set a variable to missing;
Special Missing Values
In addition to the regular missing value (.), you can specify one or
more single alphabetic characters which will be treated as missing
values when encountered in your input.
Most procedures will simply treat these special missing values in
the usual way, but others (such as freq and summary) have options
to tabulate each type of missing value separately. For example,
data one;
missing x;
input vv @@;
datalines;
12 4 5 6 x 9 . 12
;
The 5th and 7th observations will
both be missing, but internally they
are stored in different ways.
Note: When you use a special missing value, it will not be detected
by a statement like if vv = .; in the example above, you would
need to use if vv = .x to detect the special missing value, or to
use the missing function of the data step.
Variable Lists
SAS provides several different types of variable lists, which can be
used in all procedures, and in some data step statements.
• Numbered List - When a set of variables have the same prefix,
and the rest of the name is a consecutive set of numbers, you
can use a single dash (-) to refer to an entire range:
x1 - x3 ) x1, x2, x3; x01 - x03 ) x01, x02, x03
• Colon list - When a set of variables all begin with the same
sequence of characters you can place a colon after the sequence
to include them all. If variables a, b, xheight, and xwidth
have been defined, then x:)xwidth, xheight.
• Special Lists - Three keywords refer to a list with the obvious
meaning: numeric character all
In a data step, special lists will only refer to variables which
were already defined when the list is encountered.
Variable Lists (cont’d)
• Name range list - When you refer to a list of variables in the
order in which they were defined in the SAS data set, you can
use a double dash (--) to refer to the range:
If the input statement
input id name $ x y z state $ salary
was used to create a data set, then
x -- salary ) x, y, z, state, salary
If you only want character or numeric variables in the name
range, insert the appropriate keyword between the dashes:
id -numeric- z ) id, x, y, z
In general, variables are defined in the order they appear in the
data step. If you’re not sure about the order, you can check
using proc contents.
The set statement
When you wish to process an already created SAS data set instead
of raw data, the set statement is used in place of the input and
infile or lines statements.
Each time it encounters a set statement, SAS inputs an
observation from an existing data set, containing all the variables
in the original data set along with any newly created variables.
This example creates a data set called trans with all the variables
in the data set orig plus a new variable called logx:
data trans;
set orig;
logx = log(x);
run;
You can specify the path to a SAS data set in quotes instead of a
data set name. If you use a set statement without specifying a
data set name, SAS will use the most recently created data set.
drop= and keep= data set options
Sometimes you don’t need to use all of the variables in a data set
for further processing. To restrict the variables in an input data
set, the data set option keep= can be used with a list of variable
names. For example, to process the data set big, but only using
variables x, y, and z, the following statements could be used:
data new;
set big(keep = x y z);
. . .
Using a data set option in this way is very efficient, because it
prevents all the variables from being read for each observation. If
you only wanted to remove a few variables from the data set, you
could use the drop= option to specify the variables in a similar
fashion.
drop and keep statements
To control the variables which will be output to a data set, drop or
keep statements can be used. (It is an error to specify both drop
and keep in the same data step). Suppose we have a data set with
variables representing savings and income. We wish to output
only those observations for which the ratio of savings to income is
greater than 0.05, but we don’t need this ratio output to our final
result.
data savers;
set all;
test = savings / income;
if test > .05 then output;
drop test;
run;
As an alternative to drop, the statement
keep income savings;
could have been used instead.
retain statement
SAS’ default behavior is to set all variables to missing each time a
new observation is read. Sometimes it is necessary to “remember”
the value of a variable from the previous observation. The retain
statement specifies variables which will retain their values from
previous observations instead of being set to missing. You can
specify an initial value for retained variables by putting that value
after the variable name on the retain statement.
Note: Make sure you understand the difference between retain
and keep.
For example, suppose we have a data set which we assume is sorted
by a variable called x. To print a message when an out-of-order
observation is encountered, we could use the following code:
retain lastx .; * retain lastx and initialize to missing;
if x < lastx then put ’Observation out of order, x=’ x;
else lastx = x;
sum Statement
Many times the sum of a variable needs to be accumulated between
observations in a data set. While a retain statement could be used,
SAS provides a special way to accumulate values known as the sum
statement. The format is
variable + expression;
where variable is the variable which will hold the accumulated
value, and expression is a SAS expression which evaluates to a
numeric value. The value of variable is automatically initialized
to zero. The sum statement is equivalent to the following:
retain variable 0;
variable = variable + expression;
with one important difference. If the value of expression is
missing, the sum statement treats it as a zero, whereas the normal
computation will propogate the missing value.
Default Data Sets
In most situations, if you don’t specify a data set name, SAS will
use a default dataset, using the following rules:
• When creating data sets, SAS uses the names data1, data2,
etc, if no data set name is specified. This can happen because
of a data step, or if a procedure automatically outputs a data
set which you have not named.
• When processing data sets, SAS uses the most recently created
data set, which has the special name last . This can happen
when you use a set statement with no dataset name, or invoke
a procedure without a data= argument. To override this, you
can set the value of last to a data set of your choice with the
options statement:
options _last_ = mydata;
Temporary Data Sets
By default, the data sets you create with SAS are deleted at the
end of your SAS session. During your session, they are stored in a
directory with a name like SAS workaXXXX, where the Xs are used
to create a unique name. By default, this directory is created
within the system /tmp directory.
You can have the temporary data sets stored in some other
directory using the -work option when you invoke sas, for example:
sas -work .
to use the current directory or, for example,
sas -work /some/other/directory
to specify some other directory.
Note: If SAS terminates unexpectedly, it may leave behind a work
directory which may be very large. If so, it will need to be removed
using operating system commands.
Permanent Data Sets
You can save your SAS data sets permanently by first specifying a
directory to use with the libname statement, and then using a two
level data set name in the data step.
libname project "/some/directory";
data project.one;
Data sets created this way will have filenames of the form
datasetname.sas7bdat.
In a later session, you could refer to the data set directly, without
having to create it in a data step.
libname project "/some/directory";
proc reg data=project.one;
To search more than one directory, include the directory names in
parentheses.
libname both ("/some/directory" "/some/other/directory");
Operators in SAS
Arithmetic operators:
* multiplication + addition / division
- subtraction ** exponentiation
Comparison Operators:
= or eq equal to ^= or ne not equal to
> or gt greater than >= or ge greater than or equal to
< or lt less than <= or le less than or equal to
Boolean Operators:
& or and and | or or or ^ or not negation
Other Operators:
>< minimum <> maximum || char. concatenation
The in operator lets you test for equality to any of several constant
values. x in (1,2,3) is the same as x=1 | x=2 | x=3.
Comparison Operators
Use caution when testing two floating point numbers for equality,
due to the limitations of precision of their internal representations.
The round function can be used to alleviate this problem.
Two SAS comparison operators can be combined in a single
statement to test if a variable is within a given range, without
having to use any boolean operators. For example, to see if the
variable x is in the range of 1 to 5, you can use if 1 < x < 5 ....
SAS treats a numeric missing value as being less than any valid
number. Comparisons involving missing values do not return
missing values.
When comparing characters, if a colon is used after the comparison
operator, the longer argument will be truncated for the purpose of
the comparison. Thus, the expression name =: "R" will be true
for any value of name which begins with R.
Logical Variables
When you write expressions using comparison operators, they are
processed by SAS and evaluated to 1 if the comparison is true, and
0 if the comparison is false. This allows them to be used in logical
statements like an if statement as well as directly in numerical
calculations.
For example, suppose we want to count the number of observations
in a data set where the variable age is less than 25. Using an if
statement, we could write:
if age < 25 then count + 1;
(Note the use of the sum statement.)
With logical expressions, the same effect can be acheived as follows:
count + (age < 25);
Logical Variables (cont’d)
As a more complex example, suppose we want to create a
categorical variable called agegrp from the continuous variable age
where agegrp is 1 if age is less than 20, 2 if age is from 21 to 30, 3
if age is from 31 to 40, and 4 if age is greater than 40. To perform
this transformation with if statements, we could use statements
like the following:
agegrp = 1;
if 20 < age <= 30 then agegrp = 2;
if 30 < age <= 40 then agegrp = 3;
if age > 40 then agegrp = 4;
Using logical variables provides the following shortcut:
agegrp = 1 + (age > 20) + (age > 30) + (age > 40);
Variable Attributes
There are four attributes common to SAS variables.
• length - the number of bytes used to store the variable in a
SAS data set
• informat - the format used to read the variable from raw data
• format - the format used to print the values of the variable
• label - a descriptive character label of up to 40 characters
You can set any one of these attributes by using the statement of
the appropriate name, or you can set all four of them using the
attrib statement.
Since named variable lists depend on the order in which variables
are encountered in the data step, a common trick is to use a
length or attribute statement, listing variables in the order you
want them stored, as the first statement of your data step.
Variable Lengths: Character Values
• For character variables, SAS defaults to a length of 8
characters. If your character variables are longer than that,
you’ll need to use a length statement, an informat statement or
supply a format on the input statement.
• When specifying a length or format for a character variable,
make sure to precede the value with a dollar sign ($):
attrib string length = $ 12 format = $char12.;
• The maximum length of a SAS character variable is 32767.
• By default SAS removes leading blanks in character values. To
retain them use the $charw. informat.
• By default SAS pads character values with blanks at the end.
To remove them, use the trim function.
Variable Lengths: Numeric Values
• For numeric variables, SAS defaults to a length of 8 bytes
(double precision.) For non-integers, you should probably not
change from the default.
• For integers, the following chart shows the maximum value
which can be stored in the available lengths:
length Max. value length Max. value
3 8,192 6 137,438,953,472
4 2,097,152 7 35,184,372,088,832
5 536,870,912 8 9,007,199,254,740,992
• You can use the default= option of the length statement to set
a default for all numeric variables produced:
length default = 4;
• Even if a numeric variable is stored in a length less than 8, it
will be promoted to double precision for all calculations.
Initialization and Termination
Although the default behavior of the data step is to automatically
process each observation in an input file or existing SAS data set, it
is often useful to perform specific tasks at the very beginning or
end of a data step. The automatic SAS variable _n_ counts the
number of iterations of the data set. It is always available within
the data step, but never output to a data set. This variable will be
equal to 1 only on the first iteration of the data step, so it can be
used to signal the need for initializations.
To tell when the last observation is being processed in a data step,
the end= variable of either the infile or set statement can be
used. This variable is not output to a data set, but will be equal to
1 only when the last observation of the input file or data set is
being processed, and will equal 0 otherwise; thus any actions to be
done at the very end of processing can be performed when this
variable is equal to 1.
Flow Control: if-then-else
The if-then statement (with optional else) is used to
conditionally execute SAS statements:
if x < 5 then group = "A";
t may be followed by a (separate) else statement:
if x < 5 then group = "A";
else group = "B";
To execute more than one statement (for either the then or the
else), use a do-end block:
if x < 5 then do;
group = "A";
use = 0;
end;
Flow Control: Subsetting if
Using an if statement without a corresponding then serves as a
filter; observations which do not meet the condition will not be
processed any further.
For example, the statement
if age < 60;
is equivalent to the statement
if age >= 60 then delete;
and will prevent observations where age is not less than 60 from
being output to the data set. This type of if statement is therefore
known as a subsetting if.
Note: You can not use an else statement with a subsetting if.
ifc and ifn functions
If your goal is to set a variable to a value based on some logical
expression, the ifc or ifn function may be more convenient than
using an if/else statement. For example, to set a tax rate based
on whether or not a state name is equal to california, the following
could be used:
rate = ifn(state = ’california’,7.25,5.25);
ifn returns numeric values, while ifc returns character values.
result = ifc(score > 80,’pass’,’fail’)
An optional fourth argument can be used to handle the case where
the first argument is missing.
Flow Control: goto statement
You can use the goto statement to have SAS process statements in
some other part of your program, by providing a label followed by a
colon before the statements you wish to jump to. Label names
follow the same rules as variable names, but have a different name
space. When a labeled statement is encountered in normal
processing, it is ignored.
Use goto statements with caution, since they can make program
logic difficult to follow.
data two;
set one;
if x ^= . then goto out;
x = (y + z) / 2;
out: if x > 20 then output;
run;
Flow Control: stop, abort, return
Although rarely necessary, it is sometimes useful to override SAS’
default behavior of processing an entire set of data statements for
each observation. Control within the current execution of the data
step can be acheived with the goto statement; these statements
provide more general control.
stop immediately discontinue entire execution of the data step
abort like stop, but set error to 1
error like abort, but prints a message to the SAS log
return begin execution of next iteration of data step
For example, the following statement would stop processing the
current data step and print an error message to the log:
if age > 99 then error "Age is too large for subject number " subjno ;
Do-loops
Do-loops are one of the main tools of SAS programming. They
exist in several forms, always terminated by an end; statement
• do; - groups blocks of statements together
• do over arrayname; - process array elements
• do var=start to end <by inc>; - range of numeric values
• do var=list-of-values;
• do while(expression); (expression evaluated before loop)
• do until(expression); (expression evaluated after loop)
The do until loop is guaranteed to be executed at least once.
Some of these forms can be combined, for example
do i= 1 to end while (sum < 100);
Iterative Do-loops: Example 1
Do-loops can be nested. The following example calculates how long
it would take for an investment with interest compounded monthly
to double:
data interest;
do rate = 4,4.5,5,7,9,20;
mrate = rate / 1200; * convert from percentage;
months = 0;
start = 1;
do while (start < 2);
start = start * (1 + mrate);
months + 1;
end;
years = months / 12;
output;
end;
keep rate years;
run;
Iterative Do-loops: Example 2
Suppose we have a record of the number of classes students take in
each year of college, stored in variables class1-class5. We want
to find out how long it takes students to take 10 classes:
data ten;
set classes;
array class class1-class5;
total = 0;
do i = 1 to dim(class) until(total >= 10);
total = total + class{i};
end;
year = i;
if total lt 10 then year = .;
drop i total;
run;
Getting out of Do-loops
There are two options for escaping a do-loop before its normal
termination:
You can use a goto statement to jump outside the loop:
count = 0;
do i=1 to 10;
if x{i} = . then count = count + 1;
if count > 5 then goto done:
end;
done: if count < 5 then output;
. . .
You can also force termination of a do-loop by modifying the value
of the index variable. Use with caution since it can create an
infinite loop.
do i=1 to 10;
if x{i} = . then count = count + 1;
if count > 5 then i=10;
end;
SAS Functions: Mathematical
Each function takes a single argument, and may return a missing
value (.) if the function is not defined for that argument.
Name Function Name Function
abs absolute value arcos arccosine
digamma digamma function arsin arcsin
erf error function atan arctangent
exp power of e (2.71828 · · ·) cos cosine
gamma gamma function cosh hyperbolic cosine
lgamma log of gamma sin sine
log log (base e) sinh hyperbolic sine
log2 log (base 2) tan tangent
log10 log (base 10) tanh hyperbolic tangent
sign returns sign or zero
sqrt square root
SAS Functions: Statistical Summaries
The statistical summary functions accept unlimited numbers of
arguments, and ignore missing values.
Name Function Name Function
css corrected range maximium − minimum
sum of squares skewness skewness
cv coefficient std standard deviation
of variation stderr standard error
kurtosis kurtosis of the mean
max maximum sum sum
mean mean uss uncorrected
median median sum of squares
min minimun var variance
pctl percentiles
In addition, the function ordinal(n,...) gives the nth ordered
value from its list of arguments.
Using Statistical Summary Functions
You can use variable lists in all the statistical summary functions
by preceding the list with the word “of”; for example:
xm = mean(of x1-x10);
vmean = mean(of thisvar -- thatvar);
Without the of, the single dash is interpreted in its usual way, that
is as a minus sign or the unary minus operator; thus
xm = mean(of x1-x10);
is the same as
xm = mean(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10);
but
xm1 = mean(x1-x10);
calculates the mean of x1 minus x10, and
xm2 = mean(x1--x10);
calculates the mean of x1 plus x10.
Concatenating Character Strings
SAS provides the following functions for joining together character
strings:
cat - preserve all spaces
cats - remove trailing blanks
catt - remove all blanks
catx - join with separator (first argument)
Each function accepts an unlimited number of arguments. To join
together all the elements in a variable list, use the of keyword:
x1 = ’one’;
x2 = ’two’;
x3 = ’three’;
all = catx(’ ’,of x1-x3); * or catx(’ ’,x1,x2,x3);
The variable all will have the value ’one two three’
SAS Functions: Character Manipulation
compress(target,<chars-to-remove>)
expr = "one, two: three:";
new = compress(expr,",:"); *new => "one two three"
With no second argument compress removes blanks.
count(string,substring) - counts how many times substring
appears in string
index(source,string) - finds position of string in source
where = "university of california";
i = index(where,"cal"); * i => 15
indexc(source,string) - finds position of any character in
string in source
where = "berkeley, ca";
i = indexc(where,"abc"); * i=1 (b is in position 1);
index and indexc return 0 if there is no match
SAS Functions: Character Manipulation (cont’d)
left(string) - returns a left-justified character variable
length(string) - returns number of characters in a string
length returns 1 if string is missing, 12 if string is uninitialized
repeat(string,n) - repeats a character value n times
reverse(string) - reverses the characters in a character variable
right(string) - returns a right-justified character variable
scan(string,n,<delims>) - returns the nth “word” in string
field = "smith, joe";
first = scan(field,2," ,"); * first will be ’joe’;
negative numbers count from right to left.
substr(string,position,<n>) - returns pieces of a variable
field = "smith, joe";
last = substr(field,1,index(field,",") - 1);
results in last equal to "smith".
SAS Functions: Character Manipulation (cont’d)
translate(string,to,from) - changes from chars to to chars
word = "eXceLLent";
new = translate(word,"xl","XL"); *new => "excellent";
transwrd(string,old,new) - changes old to new in string
trim(string) - returns string with leading blanks removed
upcase(string) - converts lowercase to uppercase
verify(source,string) - return position of first char. in source
which is not in string
check = verify(val,"0123456789.");
results in check equal to 0 if val is a character string containing
only numbers and periods.
Regular Expressions in SAS
The prxmatch and prxchange functions allow the use of
Perl-compliant regular expressions in SAS programs. For example,
to find the location of the first digit followed by a blank in a
character string, the following code could be used:
str = ’275 Main Street’;
wh = prxmatch(’/\d /’,str); * wh will be equal to 3;
To reverse the order of two names separated by commas, the
following could be used:
str = ’Smith, John’;
newstr = prxchange(’s/(\w+?), (\w+?)/$2 $1/’,-1,str);
The second argument is the number of changes to make; −1 means
to change all occurences.
For more efficiency, regular expresssions can be precompiled using
the prxparse function.
SAS Functions for Random Number Generation
Each of the random number generators accepts a seed as its first
argument. If this value is greater than 0, the generator produces a
reproducible sequence of values; otherwise, it takes a seed from the
system clock and produces a sequence which can not be reproduced.
The two most common random number functions are
ranuni(seed) - uniform variates in the range (0, 1), and
rannor(seed) - normal variates with mean 0 and variance 1.
Other distributions include binomial (ranbin), Cauchy (rancau),
exponential (ranexp), gamma (rangam), Poisson (ranpoi), and
tabled probability functions (rantbl).
For more control over the output of these generators, see the
documention for the corresponding call routines, for example call
ranuni.
Generating Random Numbers
The following example, which uses no input data, creates a data set
containing simulated data. Note the use of ranuni and the int
function to produce a categorical variable (group) with
approximately equal numbers of observations in each category.
data sim;
do i=1 to 100;
group = int(5 * ranuni(12345)) + 1;
y = rannor(12345);
output;
end;
keep group y;
run;
Creating Multiple Data Sets
To create more than one data set in a single data step, list the
names of all the data sets you wish to create on the data statement.
When you have multiple data set names on the data statement
observations will be automatically output to all the data sets unless
you explicitly state the name of the data set in an output
statement.
data young old;
set all;
if age < 25 then output young;
else output old;
run;
Note: If your goal is to perform identical analyses on subgroups of
the data, it is usually more efficient to use a by statement or a
where statement.
Subsetting Observations
Although the subsetting if is the simplest way to subset
observations you can actively remove observations using a delete
statement, or include observations using a output statement.
• delete statement
if reason = 99 then delete;
if age > 60 and sex = "F" then delete;
No further processing is performed on the current observation
when a delete statement is encountered.
• output statement
if reason ^= 99 and age < 60 then output;
if x > y then output;
Subsequent statements are carried out (but not reflected in the
current observation). When a data step contains one or more
output statements, SAS’ usual automatic outputting at the end
of each data step iteration is disabled — only observations
which are explicitly output are included in the data set.
Random Access of Observations
In the usual case, SAS automatically processes each observation in
sequential order. If you know the position(s) of the observation(s)
you want in the data set, you can use the point= option of the set
statement to process only those observations.
The point= option of the set statement specifies the name of a
temporary variable whose value will determine which observation
will be read. When you use the point= option, SAS’ default
behavior of automatically looping through the data set is disabled,
and you must explicitly loop through the desired observations
yourself, and use the stop statement to terminate the data step.
The following example also makes use of the nobs= option of the
set statement, which creates a temporary variable containing the
number of observations contained in the data set.
Random Access of Observations: Example
The following program reads every third observation from the data
set big:
data sample;
do obsnum = 1 to total by 3;
set big point=obsnum nobs=total;
if _error_ then abort;
output;
end;
stop;
run;
Note that the set statement is inside the do-loop. If an attempt is
made to read an invalid observation, SAS will set the automatic
variable error to 1. The stop statement insures that SAS does
not go into an infinite loop;
Application: Random Sampling I
Sometimes it is desirable to use just a subsample of your data in an
analysis, and it is desired to extract a random sample, i.e. one in
which each observation is just as likely to be included as each other
observation. If you want a random sample where you don’t control
the exact number of observations in your sample, you can use the
ranuni function in a very simple fashion. Suppose we want a
random sample consisting of roughly 10% of the observations in a
data set. The following program will randomly extract the sample:
data sample;
set giant;
if ranuni(12345) < .1;
run;
Application: Random Sampling II
Now suppose we wish to randomly extract exactly n observations
from a data set. To insure randomness, we must adjust the fraction
of observations chosen depending on how many observations we
have already chosen. This can be done using the nobs= option of
the set statement. For example, to choose exactly 15 observations
from a data set all, the following code could be used:
data some;
retain k 15 n ;
drop k n;
set all nobs=nn;
if _n_ = 1 then n = nn;
if ranuni(0) < k / n then do;
output;
k = k - 1;
end;
if k = 0 then stop;
n = n - 1;
run;
Application: Random Sampling III
The point= option of the set statement can often be used to create
many random samples efficiently. The following program creates
1000 samples of size 10 from the data set big , using the variable
sample to identify the different samples in the output data set:
data samples;
do sample=1 to 1000;
do j=1 to 10;
r = round(ranuni(1) * nn);
set big point=r nobs=nn;
output;
end;
end;
stop;
drop j;
run;
By Processing in Procedures
In procedures, the by statement of SAS allows you to perform
identical analyses for different groups in your data. Before using a
by statement, you must make sure that the data is sorted (or at
least grouped) by the variables in the by statement.
The form of the by statement is
by <descending> variable-1 · · · <<descending> variable-n <notsorted>>;
By default, SAS expects the by variables to be sorted in ascending
order; the optional keyword descending specifies that they are in
descending order.
The optional keyword notsorted at the end of the by statement
informs SAS that the observations are grouped by the by variables,
but that they are not presented in a sorted order. Any time any of
the by variables change, SAS interprets it as a new by group.
Selective Processing in Procedures: where statement
When you wish to use only some subset of a data set in a
procedure, the where statement can be used to select only those
observations which meet some condition. There are several ways to
use the where statement.
As a procedure statement: As a data set option:
proc reg data=old; proc reg data=old(where = (sex eq ’M’));
where sex eq ’M’; model y = x;
model y=x; run;
run;
In the data step:
data survey;
input id q1-q10;
where q2 is not missing and q1 < 4;
data new;
set old(where = (group = ’control’));
where statement: Operators
Along with all the usual SAS operators, the following are available
in the where statement:
between/and - specify a range of observations
where salary between 20000 and 50000;
contains - select based on strings contained in character variables
where city contains ’bay’;
is missing - select based on regular or special missing value
where x is missing and y is not missing;
like - select based on patterns in character variables
(Use % for any number of characters, _ for exactly one)
where name like ’S%’;
sounds like (=*) - select based on soundex algorithm
where name =* ’smith’;
You can use the word not with all of these operators to reverse the
sense of the comparison.
• Developed in the early 1970s at North Carolina State
University
• Originally intended for management and analysis of
agricultural field experiments
• Now the most widely used statistical software
• Used to stand for “Statistical Analysis System”, now it is not
an acronym for anything
• Pronounced “sass”, not spelled out as three letters.
Overview of SAS Products
• Base SAS - data management and basic procedures
• SAS/STAT - statistical analysis
• SAS/GRAPH - presentation quality graphics
• SAS/OR - Operations research
• SAS/ETS - Econometrics and Time Series Analysis
• SAS/IML - interactive matrix language
• SAS/AF - applications facility (menus and interfaces)
• SAS/QC - quality control
There are other specialized products for spreadsheets, access to
databases, connectivity between different machines running SAS,
etc.
Resources: Introductory Books
Mastering the SAS System, 2nd Edition, by Jay A. Jaffe,
Van Nostrand Reinhold
Quick Start to Data Analysis with SAS, by Frank C. DiIorio and
Kenneth A. Hardy, Duxbury Press.
How SAS works: a comprehensive introduction to the SAS System, by
P.A. Herzberg, Springer-Verlag
Applied statistics and the SAS programming language, by R.P. Cody,
North-Holland, New York
The bulk of SAS documentation is available online, at
http://support.sas.com/documentation/onlinedoc/index.html. A
catalog of printed documentation available from SAS can be found at
http://support.sas.com/publishing/index.html.
Online Resources
Online help: Type help in the SAS display manager input window.
Sample Programs, distributed with SAS on all platforms.
SAS Institute Home Page: http://www.sas.com
SAS Institute Technical Support:
http://support.sas.com/resources/
Searchable index to SAS-L, the SAS mailing list:
http://www.listserv.uga.edu/archives/sas-l.html
Usenet Newsgroup (equivalent to SAS-L):
comp.soft-sys.sas
Michael Friendly’s Guide to SAS Resources on the Internet:
http://www.math.yorku.ca/SCS/StatResource.html#SAS
Brian Yandell’s Introduction to SAS:
http://www.stat.wisc.edu/~yandell/software/sas/intro.html
Basic Structure of SAS
There are two main components to most SAS programs - the data
step(s) and the procedure step(s).
The data step reads data from external sources, manipulates and
combines it with other data set and prints reports. The data step is
used to prepare your data for use by one of the procedures (often
called “procs”).
SAS is very lenient about the format of its input - statements can
be broken up across lines, multiple statements can appear on a
single line, and blank spaces and lines can be added to make the
program more readable.
The procedure steps perform analysis on the data, and produce
(often huge amounts of) output.
The most effective strategy for learning SAS is to concentrate on
the details of the data step, and learn the details of each procedure
as you have a need for them.
Accessing SAS
There are four ways to access SAS on a UNIX system:
1. Type sas . This opens the SAS “display manager”, which
consists of three windows (program, log, and output). Some
procedures must be run from the display manager.
2. Type sas -nodms . You will be prompted for each SAS
statement, and output will scroll by on the screen.
3. Type sas -stdio . SAS will act like a standard UNIX
program, expecting input from standard input, sending the log
to standard error, and the output to standard output;
4. Type sas filename.sas . This is the batch mode of SAS -
your program is read from filename.sas, the log goes to
filename.log and the output goes to filename.lst.
Some Preliminary Concepts and Rules
• SAS variable names must be 32 characters or less, constructed
of letters, digits and the underscore character. (Before version
7, the limit was 8.)
• It’s a good idea not to start variable names with an underscore,
because special system variables are named that way.
• Data set names follow similar rules as variables, but they have
a different name space.
• There are virtually no reserved keywords in SAS; it’s very good
at figuring things out by context.
• SAS is not case sensitive, except inside of quoted strings.
Starting in Version 7, SAS will remember the case of variable
names when it displays them.
• Missing values are handled consistently in SAS, and are
represented by a period (.).
• Each statement in SAS must end in a semicolon (;).
Structure of SAS programs
• Lines beginning with an asterisk (*) are treated as comments.
Alternatively you can enclose comments between /* and */.
• You can combine as many data and proc steps in whatever
order you want.
• Data steps begin with the word data and procedure steps
begin with the word proc.
• The run; command signals to SAS that the previous
commands can be executed.
• Terminate an interactive SAS job with the endsas; statement.
• There are global options (like linesize and pagesize) as well as
options specific to datasets and procedures.
• Informative messages are written to the SAS log - make sure
you read it!
The Data Step
The data step provides a wide range of capabilities, among them
reading data from external sources, reshaping and manipulating
data, transforming data and producing printed reports.
The data step is actually an implied do loop whose statements will
be executed for each observation either read from an external
source, or accessed from a previously processed data set.
For each iteration, the data step starts with a vector of missing
values for all the variables to be placed in the new observation. It
then overwrites the missing value for any variables either input or
defined by the data step statements. Finally, it outputs the
observation to the newly created data set.
The true power of the data step is illustrated by the fact that all of
these defaults may be overridden if necessary.
Data Step: Basics
Each data step begins with the word data and optionally one or
more data set names (and associated options) followed by a
semicolon. The name(s) given on the data step are the names of
data sets which will be created within the data step.
If you don’t include any names on the data step, SAS will create
default data set names of the form datan, where n is an integer
which starts at 1 and is incremented so that each data set created
has a unique name within the current session. Since it becomes
difficult to keep track of the default names, it is recommended that
you always explicitly specify a data set name on the data
statement.
When you are running a data step to simply generate a report, and
don’t need to create a data set, you can use the special data set
name _null_ to eliminate the output of observations.
Data Step: Inputting Data
The input statement of SAS is used to read data from an external
source, or from lines contained in your SAS program.
The infile statement names an external file or fileref from which
to read the data; otherwise the cards; or datalines; statement is
used to precede the data.
data one;
infile "input.data";
input a b c;
run;
Reading data from an external file
data one;
input a b c;
datalines;
. . .
;
Reading from inline data
By default, each invocation of the input statement reads another
record. This example uses free-form input, with at least one space
between values.
A fileref is a SAS name, created by the filename statement, which refers to
an external file or other device
Data Step: input Statement
There are three basic forms of the input statement:
1. List input (free form) - data fields must be separated by at
least one blank. List the names of the variables, follow the
name with a dollar sign ($) for character data.
2. Column input - follow the variable name (and $ for character)
with startingcolumn – endingcolumn.
3. Formatted input - Optionally precede the variable name with
@startingcolumn; follow the variable name with a SAS format
designation. (Examples of formats: $10. (10 column
character), 6. (6 column numeric))
When mixing different input styles, note that for column and
formatted input, the next input directive reads from the column
immediately after the previous value, while for list input, the next
directive reads from the second column after the previous value.
Modifiers for List Input
The colon (:) modifier for list input tells SAS to use a format for
input, but to stop when the next whitespace is found. Data like:
17,244 2,500,300 600 12,003
14,120 2,300 4,232 25
could be read using an input statement like
input x1 : comma. x2 : comma. x3 : comma. x4 : comma. ;
The ampersand (&) modifier tells SAS to use two whitespace
characters to signal the end of a character variable, allowing
embedded blanks to be read using list input. Thus, the statements:
length name $ 25;
input name & $ year;
could be used to read data such as
George Washington 1789
John Adams 1797
Thomas Jefferson 1801
Other Modifiers for the Input Statement
+number advance number columns.
#number advance to line number.
/ advance to next line.
trailing @ hold the line to allow further input statements in this
iteration of the data step on the same data.
trailing @@ hold the line to allow continued reading from the line
on subsequent iterations of the data step.
Note: If SAS needs to read an additional line to input all the
variables referenced in the input statement it prints the following
message on the log:
NOTE: SAS went to a new line when INPUT statement reached past
the end of a line.
If you see this note, make sure you understand why it was printed!!
The input Statement
Variable lists can be used on the input statement. For example, the
list var1 - var4 expands to var1 var2 var3 var4.
You can repeat formats for variable lists by including the names
and formats in parentheses: (var1 - var4) (5.) reads four
numeric variables from 20 consecutive columns (5 columns for each
variable).
You can also repeat formats using the notation num*format. The
previous example could be replaced with (4 * 5.).
A null input statement (no variables) can be used to free holding
caused by trailing @-signs.
The @, + and # specifications can all be followed by a variable name
instead of a number.
If you want to make sure your input data is really arranged the way
you think it is, the list; command will display your input data
with a “ruler” showing column numbers.
FTP Access
SAS provides the ability to read data directly from an FTP server,
without the need to create a local copy of the file, through the ftp
keyword of the filename statement.
Suppose there is a data file called user.dat in the directory
public on an ftp server named ftp.myserver.com. If your user
name is joe and your password is secret, the following statement
will establish a fileref for reading the data:
filename myftp ftp ’user.dat’ cd=’/public’ user=’joe’
pass=’secret’ host=’ftp.myserver.com’;
The fileref can now be used in the infile statement in the usual
way.
You can read files from http (web) servers in a similar fashion,
using the url keyword.
Options for the infile statement
For inline data, use the infile name cards or datalines.
missover Sets values to missing if an input statement would
read more than one line.
stopover Like missover, but declares an error and stops
lrecl=num Treats the input as having a length of num characters.
Required if input records are longer than 256 characters.
dlm=’chars’ Uses the characters in chars instead of blanks
and tabs as separators in list (free-form) input.
dsd Read comma-separated data
expandtabs expand tabs to spaces before inputting data.
end=varname creates a SAS variable whose value is 1 when SAS
processes the last line in the file.
obs=n Limits processing of infile to n records
pad Adds blanks to lines that are shorter than the input
statement specifies.
Variable Length Records
Consider the following file, containing the year and name of the
first three American presidents:
1789 George Washington
1797 John Adams
1801 Thomas Jefferson
If we were to use an input statement like
input year 4. @6 name $17.;
SAS would try to read past the end of the second line, since the
name only has 10 characters. The solution is the pad option of the
infile statement. Suppose the data is in a file called p.txt. The
following program correctly reads the data:
data pres;
infile ’p.txt’ pad;
input year 4. @6 name $17.;
run;
Reading SAS programs from external files
The infile statement can be used to read data which is stored in
a file separate from your SAS program. When you want SAS to
read your program from an external file you can use the %include
statement, followed by a filename or fileref. After SAS processes a
%include statement, it continues to read data from its original
source (input file, keyboard or display manager.)
For example, suppose the SAS program statements to read a file
and create a data set are in the system file readit.sas. To process
those statements, and then print the data set, the following
commands can be used:
%include "readit.sas";
proc print;
run;
proc import
For certain simple data files, SAS can create a SAS data set directly
using proc import. The dbms= option informs SAS of the type of
file to be read, and choices include xls (Excel spreadsheets), csv
(Comma-separated values), dbf (Dbase files), dta (Stata files), sav
(SPSS files), and tab (Tab-separated files). For example,to read an
Excel spreadsheet called data.xls into a SAS data set named
xlsdata, the following statements can be used:
proc import dbms=xls datafile=’data.xls’ out=xlsdata;
run;
proc import provides no options for formatting, and may not be
successful with all types of data files
Repetitive Processing of Variables
The array statement can be used to perform the same task on a
group of variables.
array arrayname variable list <$> <(startingvalues)>;
array arrayname{n} variable list <$> <(startingvalues)>;
You can then use the array name with curly braces ({}) and a
subscript, or in a do over loop:
array x x1-x9;
do i = 1 to dim(x);
if x{i} = 9 then x{i} = .;
end;
array x x1-x9;
do over x;
if x = 9 then x = .;
end;
Notes: 1. All the variables in an array must be of the same type.
2. An array can not have the same name as a variable.
3. You can use the keyword _temporary_ instead of a variable list.
4. The statement array x{3}; generates variables x1, x2, and x3.
5. The function dim returns the number of elements in an array.
Titles and Footnotes
SAS allows up to ten lines of text at the top (titles) and bottom
(footnotes) of each page of output, specified with title and
footnote statements. The form of these statements is
title<n> text; or footnote<n> text;
where n, if specified, can range from 1 to 10, and text must be
surrounded by double or single quotes. If text is omitted, the title
or footnote is deleted; otherwise it remains in effect until it is
redefined. Thus, to have no titles, use:
title;
By default SAS includes the date and page number on the top of
each piece of output. These can be suppressed with the nodate and
nonumber system options.
Missing Values
SAS handles missing values consistently throughout various
procedures, generally by deleting observations which contain
missing values. It is therefore very important to inspect the log and
listing output, as well as paying attention to the numbers of
observations used, when your data contains missing values.
For character variables, a missing value is represented by a blank
(" " ; not a null string)
For numeric variables, a missing value is represented by a period
(with no quotes). Unlike many languages, you can test for equality
to missing in the usually fasion:
if string = " " then delete; * character variable;
if num = . then delete; * numeric variable;
if x > 10 then x = .; * set a variable to missing;
Special Missing Values
In addition to the regular missing value (.), you can specify one or
more single alphabetic characters which will be treated as missing
values when encountered in your input.
Most procedures will simply treat these special missing values in
the usual way, but others (such as freq and summary) have options
to tabulate each type of missing value separately. For example,
data one;
missing x;
input vv @@;
datalines;
12 4 5 6 x 9 . 12
;
The 5th and 7th observations will
both be missing, but internally they
are stored in different ways.
Note: When you use a special missing value, it will not be detected
by a statement like if vv = .; in the example above, you would
need to use if vv = .x to detect the special missing value, or to
use the missing function of the data step.
Variable Lists
SAS provides several different types of variable lists, which can be
used in all procedures, and in some data step statements.
• Numbered List - When a set of variables have the same prefix,
and the rest of the name is a consecutive set of numbers, you
can use a single dash (-) to refer to an entire range:
x1 - x3 ) x1, x2, x3; x01 - x03 ) x01, x02, x03
• Colon list - When a set of variables all begin with the same
sequence of characters you can place a colon after the sequence
to include them all. If variables a, b, xheight, and xwidth
have been defined, then x:)xwidth, xheight.
• Special Lists - Three keywords refer to a list with the obvious
meaning: numeric character all
In a data step, special lists will only refer to variables which
were already defined when the list is encountered.
Variable Lists (cont’d)
• Name range list - When you refer to a list of variables in the
order in which they were defined in the SAS data set, you can
use a double dash (--) to refer to the range:
If the input statement
input id name $ x y z state $ salary
was used to create a data set, then
x -- salary ) x, y, z, state, salary
If you only want character or numeric variables in the name
range, insert the appropriate keyword between the dashes:
id -numeric- z ) id, x, y, z
In general, variables are defined in the order they appear in the
data step. If you’re not sure about the order, you can check
using proc contents.
The set statement
When you wish to process an already created SAS data set instead
of raw data, the set statement is used in place of the input and
infile or lines statements.
Each time it encounters a set statement, SAS inputs an
observation from an existing data set, containing all the variables
in the original data set along with any newly created variables.
This example creates a data set called trans with all the variables
in the data set orig plus a new variable called logx:
data trans;
set orig;
logx = log(x);
run;
You can specify the path to a SAS data set in quotes instead of a
data set name. If you use a set statement without specifying a
data set name, SAS will use the most recently created data set.
drop= and keep= data set options
Sometimes you don’t need to use all of the variables in a data set
for further processing. To restrict the variables in an input data
set, the data set option keep= can be used with a list of variable
names. For example, to process the data set big, but only using
variables x, y, and z, the following statements could be used:
data new;
set big(keep = x y z);
. . .
Using a data set option in this way is very efficient, because it
prevents all the variables from being read for each observation. If
you only wanted to remove a few variables from the data set, you
could use the drop= option to specify the variables in a similar
fashion.
drop and keep statements
To control the variables which will be output to a data set, drop or
keep statements can be used. (It is an error to specify both drop
and keep in the same data step). Suppose we have a data set with
variables representing savings and income. We wish to output
only those observations for which the ratio of savings to income is
greater than 0.05, but we don’t need this ratio output to our final
result.
data savers;
set all;
test = savings / income;
if test > .05 then output;
drop test;
run;
As an alternative to drop, the statement
keep income savings;
could have been used instead.
retain statement
SAS’ default behavior is to set all variables to missing each time a
new observation is read. Sometimes it is necessary to “remember”
the value of a variable from the previous observation. The retain
statement specifies variables which will retain their values from
previous observations instead of being set to missing. You can
specify an initial value for retained variables by putting that value
after the variable name on the retain statement.
Note: Make sure you understand the difference between retain
and keep.
For example, suppose we have a data set which we assume is sorted
by a variable called x. To print a message when an out-of-order
observation is encountered, we could use the following code:
retain lastx .; * retain lastx and initialize to missing;
if x < lastx then put ’Observation out of order, x=’ x;
else lastx = x;
sum Statement
Many times the sum of a variable needs to be accumulated between
observations in a data set. While a retain statement could be used,
SAS provides a special way to accumulate values known as the sum
statement. The format is
variable + expression;
where variable is the variable which will hold the accumulated
value, and expression is a SAS expression which evaluates to a
numeric value. The value of variable is automatically initialized
to zero. The sum statement is equivalent to the following:
retain variable 0;
variable = variable + expression;
with one important difference. If the value of expression is
missing, the sum statement treats it as a zero, whereas the normal
computation will propogate the missing value.
Default Data Sets
In most situations, if you don’t specify a data set name, SAS will
use a default dataset, using the following rules:
• When creating data sets, SAS uses the names data1, data2,
etc, if no data set name is specified. This can happen because
of a data step, or if a procedure automatically outputs a data
set which you have not named.
• When processing data sets, SAS uses the most recently created
data set, which has the special name last . This can happen
when you use a set statement with no dataset name, or invoke
a procedure without a data= argument. To override this, you
can set the value of last to a data set of your choice with the
options statement:
options _last_ = mydata;
Temporary Data Sets
By default, the data sets you create with SAS are deleted at the
end of your SAS session. During your session, they are stored in a
directory with a name like SAS workaXXXX, where the Xs are used
to create a unique name. By default, this directory is created
within the system /tmp directory.
You can have the temporary data sets stored in some other
directory using the -work option when you invoke sas, for example:
sas -work .
to use the current directory or, for example,
sas -work /some/other/directory
to specify some other directory.
Note: If SAS terminates unexpectedly, it may leave behind a work
directory which may be very large. If so, it will need to be removed
using operating system commands.
Permanent Data Sets
You can save your SAS data sets permanently by first specifying a
directory to use with the libname statement, and then using a two
level data set name in the data step.
libname project "/some/directory";
data project.one;
Data sets created this way will have filenames of the form
datasetname.sas7bdat.
In a later session, you could refer to the data set directly, without
having to create it in a data step.
libname project "/some/directory";
proc reg data=project.one;
To search more than one directory, include the directory names in
parentheses.
libname both ("/some/directory" "/some/other/directory");
Operators in SAS
Arithmetic operators:
* multiplication + addition / division
- subtraction ** exponentiation
Comparison Operators:
= or eq equal to ^= or ne not equal to
> or gt greater than >= or ge greater than or equal to
< or lt less than <= or le less than or equal to
Boolean Operators:
& or and and | or or or ^ or not negation
Other Operators:
>< minimum <> maximum || char. concatenation
The in operator lets you test for equality to any of several constant
values. x in (1,2,3) is the same as x=1 | x=2 | x=3.
Comparison Operators
Use caution when testing two floating point numbers for equality,
due to the limitations of precision of their internal representations.
The round function can be used to alleviate this problem.
Two SAS comparison operators can be combined in a single
statement to test if a variable is within a given range, without
having to use any boolean operators. For example, to see if the
variable x is in the range of 1 to 5, you can use if 1 < x < 5 ....
SAS treats a numeric missing value as being less than any valid
number. Comparisons involving missing values do not return
missing values.
When comparing characters, if a colon is used after the comparison
operator, the longer argument will be truncated for the purpose of
the comparison. Thus, the expression name =: "R" will be true
for any value of name which begins with R.
Logical Variables
When you write expressions using comparison operators, they are
processed by SAS and evaluated to 1 if the comparison is true, and
0 if the comparison is false. This allows them to be used in logical
statements like an if statement as well as directly in numerical
calculations.
For example, suppose we want to count the number of observations
in a data set where the variable age is less than 25. Using an if
statement, we could write:
if age < 25 then count + 1;
(Note the use of the sum statement.)
With logical expressions, the same effect can be acheived as follows:
count + (age < 25);
Logical Variables (cont’d)
As a more complex example, suppose we want to create a
categorical variable called agegrp from the continuous variable age
where agegrp is 1 if age is less than 20, 2 if age is from 21 to 30, 3
if age is from 31 to 40, and 4 if age is greater than 40. To perform
this transformation with if statements, we could use statements
like the following:
agegrp = 1;
if 20 < age <= 30 then agegrp = 2;
if 30 < age <= 40 then agegrp = 3;
if age > 40 then agegrp = 4;
Using logical variables provides the following shortcut:
agegrp = 1 + (age > 20) + (age > 30) + (age > 40);
Variable Attributes
There are four attributes common to SAS variables.
• length - the number of bytes used to store the variable in a
SAS data set
• informat - the format used to read the variable from raw data
• format - the format used to print the values of the variable
• label - a descriptive character label of up to 40 characters
You can set any one of these attributes by using the statement of
the appropriate name, or you can set all four of them using the
attrib statement.
Since named variable lists depend on the order in which variables
are encountered in the data step, a common trick is to use a
length or attribute statement, listing variables in the order you
want them stored, as the first statement of your data step.
Variable Lengths: Character Values
• For character variables, SAS defaults to a length of 8
characters. If your character variables are longer than that,
you’ll need to use a length statement, an informat statement or
supply a format on the input statement.
• When specifying a length or format for a character variable,
make sure to precede the value with a dollar sign ($):
attrib string length = $ 12 format = $char12.;
• The maximum length of a SAS character variable is 32767.
• By default SAS removes leading blanks in character values. To
retain them use the $charw. informat.
• By default SAS pads character values with blanks at the end.
To remove them, use the trim function.
Variable Lengths: Numeric Values
• For numeric variables, SAS defaults to a length of 8 bytes
(double precision.) For non-integers, you should probably not
change from the default.
• For integers, the following chart shows the maximum value
which can be stored in the available lengths:
length Max. value length Max. value
3 8,192 6 137,438,953,472
4 2,097,152 7 35,184,372,088,832
5 536,870,912 8 9,007,199,254,740,992
• You can use the default= option of the length statement to set
a default for all numeric variables produced:
length default = 4;
• Even if a numeric variable is stored in a length less than 8, it
will be promoted to double precision for all calculations.
Initialization and Termination
Although the default behavior of the data step is to automatically
process each observation in an input file or existing SAS data set, it
is often useful to perform specific tasks at the very beginning or
end of a data step. The automatic SAS variable _n_ counts the
number of iterations of the data set. It is always available within
the data step, but never output to a data set. This variable will be
equal to 1 only on the first iteration of the data step, so it can be
used to signal the need for initializations.
To tell when the last observation is being processed in a data step,
the end= variable of either the infile or set statement can be
used. This variable is not output to a data set, but will be equal to
1 only when the last observation of the input file or data set is
being processed, and will equal 0 otherwise; thus any actions to be
done at the very end of processing can be performed when this
variable is equal to 1.
Flow Control: if-then-else
The if-then statement (with optional else) is used to
conditionally execute SAS statements:
if x < 5 then group = "A";
t may be followed by a (separate) else statement:
if x < 5 then group = "A";
else group = "B";
To execute more than one statement (for either the then or the
else), use a do-end block:
if x < 5 then do;
group = "A";
use = 0;
end;
Flow Control: Subsetting if
Using an if statement without a corresponding then serves as a
filter; observations which do not meet the condition will not be
processed any further.
For example, the statement
if age < 60;
is equivalent to the statement
if age >= 60 then delete;
and will prevent observations where age is not less than 60 from
being output to the data set. This type of if statement is therefore
known as a subsetting if.
Note: You can not use an else statement with a subsetting if.
ifc and ifn functions
If your goal is to set a variable to a value based on some logical
expression, the ifc or ifn function may be more convenient than
using an if/else statement. For example, to set a tax rate based
on whether or not a state name is equal to california, the following
could be used:
rate = ifn(state = ’california’,7.25,5.25);
ifn returns numeric values, while ifc returns character values.
result = ifc(score > 80,’pass’,’fail’)
An optional fourth argument can be used to handle the case where
the first argument is missing.
Flow Control: goto statement
You can use the goto statement to have SAS process statements in
some other part of your program, by providing a label followed by a
colon before the statements you wish to jump to. Label names
follow the same rules as variable names, but have a different name
space. When a labeled statement is encountered in normal
processing, it is ignored.
Use goto statements with caution, since they can make program
logic difficult to follow.
data two;
set one;
if x ^= . then goto out;
x = (y + z) / 2;
out: if x > 20 then output;
run;
Flow Control: stop, abort, return
Although rarely necessary, it is sometimes useful to override SAS’
default behavior of processing an entire set of data statements for
each observation. Control within the current execution of the data
step can be acheived with the goto statement; these statements
provide more general control.
stop immediately discontinue entire execution of the data step
abort like stop, but set error to 1
error like abort, but prints a message to the SAS log
return begin execution of next iteration of data step
For example, the following statement would stop processing the
current data step and print an error message to the log:
if age > 99 then error "Age is too large for subject number " subjno ;
Do-loops
Do-loops are one of the main tools of SAS programming. They
exist in several forms, always terminated by an end; statement
• do; - groups blocks of statements together
• do over arrayname; - process array elements
• do var=start to end <by inc>; - range of numeric values
• do var=list-of-values;
• do while(expression); (expression evaluated before loop)
• do until(expression); (expression evaluated after loop)
The do until loop is guaranteed to be executed at least once.
Some of these forms can be combined, for example
do i= 1 to end while (sum < 100);
Iterative Do-loops: Example 1
Do-loops can be nested. The following example calculates how long
it would take for an investment with interest compounded monthly
to double:
data interest;
do rate = 4,4.5,5,7,9,20;
mrate = rate / 1200; * convert from percentage;
months = 0;
start = 1;
do while (start < 2);
start = start * (1 + mrate);
months + 1;
end;
years = months / 12;
output;
end;
keep rate years;
run;
Iterative Do-loops: Example 2
Suppose we have a record of the number of classes students take in
each year of college, stored in variables class1-class5. We want
to find out how long it takes students to take 10 classes:
data ten;
set classes;
array class class1-class5;
total = 0;
do i = 1 to dim(class) until(total >= 10);
total = total + class{i};
end;
year = i;
if total lt 10 then year = .;
drop i total;
run;
Getting out of Do-loops
There are two options for escaping a do-loop before its normal
termination:
You can use a goto statement to jump outside the loop:
count = 0;
do i=1 to 10;
if x{i} = . then count = count + 1;
if count > 5 then goto done:
end;
done: if count < 5 then output;
. . .
You can also force termination of a do-loop by modifying the value
of the index variable. Use with caution since it can create an
infinite loop.
do i=1 to 10;
if x{i} = . then count = count + 1;
if count > 5 then i=10;
end;
SAS Functions: Mathematical
Each function takes a single argument, and may return a missing
value (.) if the function is not defined for that argument.
Name Function Name Function
abs absolute value arcos arccosine
digamma digamma function arsin arcsin
erf error function atan arctangent
exp power of e (2.71828 · · ·) cos cosine
gamma gamma function cosh hyperbolic cosine
lgamma log of gamma sin sine
log log (base e) sinh hyperbolic sine
log2 log (base 2) tan tangent
log10 log (base 10) tanh hyperbolic tangent
sign returns sign or zero
sqrt square root
SAS Functions: Statistical Summaries
The statistical summary functions accept unlimited numbers of
arguments, and ignore missing values.
Name Function Name Function
css corrected range maximium − minimum
sum of squares skewness skewness
cv coefficient std standard deviation
of variation stderr standard error
kurtosis kurtosis of the mean
max maximum sum sum
mean mean uss uncorrected
median median sum of squares
min minimun var variance
pctl percentiles
In addition, the function ordinal(n,...) gives the nth ordered
value from its list of arguments.
Using Statistical Summary Functions
You can use variable lists in all the statistical summary functions
by preceding the list with the word “of”; for example:
xm = mean(of x1-x10);
vmean = mean(of thisvar -- thatvar);
Without the of, the single dash is interpreted in its usual way, that
is as a minus sign or the unary minus operator; thus
xm = mean(of x1-x10);
is the same as
xm = mean(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10);
but
xm1 = mean(x1-x10);
calculates the mean of x1 minus x10, and
xm2 = mean(x1--x10);
calculates the mean of x1 plus x10.
Concatenating Character Strings
SAS provides the following functions for joining together character
strings:
cat - preserve all spaces
cats - remove trailing blanks
catt - remove all blanks
catx - join with separator (first argument)
Each function accepts an unlimited number of arguments. To join
together all the elements in a variable list, use the of keyword:
x1 = ’one’;
x2 = ’two’;
x3 = ’three’;
all = catx(’ ’,of x1-x3); * or catx(’ ’,x1,x2,x3);
The variable all will have the value ’one two three’
SAS Functions: Character Manipulation
compress(target,<chars-to-remove>)
expr = "one, two: three:";
new = compress(expr,",:"); *new => "one two three"
With no second argument compress removes blanks.
count(string,substring) - counts how many times substring
appears in string
index(source,string) - finds position of string in source
where = "university of california";
i = index(where,"cal"); * i => 15
indexc(source,string) - finds position of any character in
string in source
where = "berkeley, ca";
i = indexc(where,"abc"); * i=1 (b is in position 1);
index and indexc return 0 if there is no match
SAS Functions: Character Manipulation (cont’d)
left(string) - returns a left-justified character variable
length(string) - returns number of characters in a string
length returns 1 if string is missing, 12 if string is uninitialized
repeat(string,n) - repeats a character value n times
reverse(string) - reverses the characters in a character variable
right(string) - returns a right-justified character variable
scan(string,n,<delims>) - returns the nth “word” in string
field = "smith, joe";
first = scan(field,2," ,"); * first will be ’joe’;
negative numbers count from right to left.
substr(string,position,<n>) - returns pieces of a variable
field = "smith, joe";
last = substr(field,1,index(field,",") - 1);
results in last equal to "smith".
SAS Functions: Character Manipulation (cont’d)
translate(string,to,from) - changes from chars to to chars
word = "eXceLLent";
new = translate(word,"xl","XL"); *new => "excellent";
transwrd(string,old,new) - changes old to new in string
trim(string) - returns string with leading blanks removed
upcase(string) - converts lowercase to uppercase
verify(source,string) - return position of first char. in source
which is not in string
check = verify(val,"0123456789.");
results in check equal to 0 if val is a character string containing
only numbers and periods.
Regular Expressions in SAS
The prxmatch and prxchange functions allow the use of
Perl-compliant regular expressions in SAS programs. For example,
to find the location of the first digit followed by a blank in a
character string, the following code could be used:
str = ’275 Main Street’;
wh = prxmatch(’/\d /’,str); * wh will be equal to 3;
To reverse the order of two names separated by commas, the
following could be used:
str = ’Smith, John’;
newstr = prxchange(’s/(\w+?), (\w+?)/$2 $1/’,-1,str);
The second argument is the number of changes to make; −1 means
to change all occurences.
For more efficiency, regular expresssions can be precompiled using
the prxparse function.
SAS Functions for Random Number Generation
Each of the random number generators accepts a seed as its first
argument. If this value is greater than 0, the generator produces a
reproducible sequence of values; otherwise, it takes a seed from the
system clock and produces a sequence which can not be reproduced.
The two most common random number functions are
ranuni(seed) - uniform variates in the range (0, 1), and
rannor(seed) - normal variates with mean 0 and variance 1.
Other distributions include binomial (ranbin), Cauchy (rancau),
exponential (ranexp), gamma (rangam), Poisson (ranpoi), and
tabled probability functions (rantbl).
For more control over the output of these generators, see the
documention for the corresponding call routines, for example call
ranuni.
Generating Random Numbers
The following example, which uses no input data, creates a data set
containing simulated data. Note the use of ranuni and the int
function to produce a categorical variable (group) with
approximately equal numbers of observations in each category.
data sim;
do i=1 to 100;
group = int(5 * ranuni(12345)) + 1;
y = rannor(12345);
output;
end;
keep group y;
run;
Creating Multiple Data Sets
To create more than one data set in a single data step, list the
names of all the data sets you wish to create on the data statement.
When you have multiple data set names on the data statement
observations will be automatically output to all the data sets unless
you explicitly state the name of the data set in an output
statement.
data young old;
set all;
if age < 25 then output young;
else output old;
run;
Note: If your goal is to perform identical analyses on subgroups of
the data, it is usually more efficient to use a by statement or a
where statement.
Subsetting Observations
Although the subsetting if is the simplest way to subset
observations you can actively remove observations using a delete
statement, or include observations using a output statement.
• delete statement
if reason = 99 then delete;
if age > 60 and sex = "F" then delete;
No further processing is performed on the current observation
when a delete statement is encountered.
• output statement
if reason ^= 99 and age < 60 then output;
if x > y then output;
Subsequent statements are carried out (but not reflected in the
current observation). When a data step contains one or more
output statements, SAS’ usual automatic outputting at the end
of each data step iteration is disabled — only observations
which are explicitly output are included in the data set.
Random Access of Observations
In the usual case, SAS automatically processes each observation in
sequential order. If you know the position(s) of the observation(s)
you want in the data set, you can use the point= option of the set
statement to process only those observations.
The point= option of the set statement specifies the name of a
temporary variable whose value will determine which observation
will be read. When you use the point= option, SAS’ default
behavior of automatically looping through the data set is disabled,
and you must explicitly loop through the desired observations
yourself, and use the stop statement to terminate the data step.
The following example also makes use of the nobs= option of the
set statement, which creates a temporary variable containing the
number of observations contained in the data set.
Random Access of Observations: Example
The following program reads every third observation from the data
set big:
data sample;
do obsnum = 1 to total by 3;
set big point=obsnum nobs=total;
if _error_ then abort;
output;
end;
stop;
run;
Note that the set statement is inside the do-loop. If an attempt is
made to read an invalid observation, SAS will set the automatic
variable error to 1. The stop statement insures that SAS does
not go into an infinite loop;
Application: Random Sampling I
Sometimes it is desirable to use just a subsample of your data in an
analysis, and it is desired to extract a random sample, i.e. one in
which each observation is just as likely to be included as each other
observation. If you want a random sample where you don’t control
the exact number of observations in your sample, you can use the
ranuni function in a very simple fashion. Suppose we want a
random sample consisting of roughly 10% of the observations in a
data set. The following program will randomly extract the sample:
data sample;
set giant;
if ranuni(12345) < .1;
run;
Application: Random Sampling II
Now suppose we wish to randomly extract exactly n observations
from a data set. To insure randomness, we must adjust the fraction
of observations chosen depending on how many observations we
have already chosen. This can be done using the nobs= option of
the set statement. For example, to choose exactly 15 observations
from a data set all, the following code could be used:
data some;
retain k 15 n ;
drop k n;
set all nobs=nn;
if _n_ = 1 then n = nn;
if ranuni(0) < k / n then do;
output;
k = k - 1;
end;
if k = 0 then stop;
n = n - 1;
run;
Application: Random Sampling III
The point= option of the set statement can often be used to create
many random samples efficiently. The following program creates
1000 samples of size 10 from the data set big , using the variable
sample to identify the different samples in the output data set:
data samples;
do sample=1 to 1000;
do j=1 to 10;
r = round(ranuni(1) * nn);
set big point=r nobs=nn;
output;
end;
end;
stop;
drop j;
run;
By Processing in Procedures
In procedures, the by statement of SAS allows you to perform
identical analyses for different groups in your data. Before using a
by statement, you must make sure that the data is sorted (or at
least grouped) by the variables in the by statement.
The form of the by statement is
by <descending> variable-1 · · · <<descending> variable-n <notsorted>>;
By default, SAS expects the by variables to be sorted in ascending
order; the optional keyword descending specifies that they are in
descending order.
The optional keyword notsorted at the end of the by statement
informs SAS that the observations are grouped by the by variables,
but that they are not presented in a sorted order. Any time any of
the by variables change, SAS interprets it as a new by group.
Selective Processing in Procedures: where statement
When you wish to use only some subset of a data set in a
procedure, the where statement can be used to select only those
observations which meet some condition. There are several ways to
use the where statement.
As a procedure statement: As a data set option:
proc reg data=old; proc reg data=old(where = (sex eq ’M’));
where sex eq ’M’; model y = x;
model y=x; run;
run;
In the data step:
data survey;
input id q1-q10;
where q2 is not missing and q1 < 4;
data new;
set old(where = (group = ’control’));
where statement: Operators
Along with all the usual SAS operators, the following are available
in the where statement:
between/and - specify a range of observations
where salary between 20000 and 50000;
contains - select based on strings contained in character variables
where city contains ’bay’;
is missing - select based on regular or special missing value
where x is missing and y is not missing;
like - select based on patterns in character variables
(Use % for any number of characters, _ for exactly one)
where name like ’S%’;
sounds like (=*) - select based on soundex algorithm
where name =* ’smith’;
You can use the word not with all of these operators to reverse the
sense of the comparison.
No comments:
Post a Comment