Data Science and Machine Learning Internship ...
- 22k Enrolled Learners
- Live Class
In this blog, I will introduce you to some of the important concepts of SAS programming. Before we get started, it is important you get familiar with SAS. My previous blog on SAS Tutorial will help you understand SAS, its applications and will help you install SAS University Edition, which we would be using here as a programming environment.Wondering what are the skills, you should master this year? Also, if you’ve been planning to step into Data Analytics, SAS Certification Training is one of the best ways to get started with the same.
Edureka 2019 Tech Career Guide is out! Hottest job roles, precise learning paths, industry outlook & more in the guide. Download now.
So without any further delay, let’s get started with SAS programming, shall we?
This blog will help you understand the following topics:
Large organisations and training institutes prefer using SAS Windows. SAS Windows has a lot of utilities that help reduce the time required to write codes.
The following image shows the different parts of SAS Windows.
A few organisations use Linux, however, with no graphical user interface you have to write code for every query. Hence it is inconvenient to use.
SAS data sets are called as data files. Data files constitute of rows and columns. Rows hold observations and columns hold Variable names.
SAS has two types of variables:
SAS library is a collection of SAS files that are stored in the same folder or directory on your computer.
SAS programming is based on two building blocks:
A SAS program should follow below mentioned rules:
Now that we have seen a few basic terminologies, let us get started with SAS programming with this basic code:
DATA Employee_Info; input Emp_ID Emp_Name$ Emp_Vertical$; datalines; 101 Mak SQL 102 Rama SAS 103 Priya Java 104 Karthik Excel 105 Mandeep SAS ; Run;
In the above code, we created a data set called as Employee_Info. It has three variables, one numeric variable as Emp_Id and two character variables as Emp_Name and Emp_Verticals. The Run command displays the data set in the Output Window.
The image below shows the output of the above mentioned code.
Suppose you want to see the result in print view, well you can do that by using a PROC PRINT procedure, the rest of the code remains same.
DATA Employee_Info; input Emp_ID Emp_Name$ Emp_Vertical$; datalines; 101 Mak SQL 102 Rama SAS 103 Priya Java 104 Karthik Excel 105 Mandeep SAS ; Run; PROC PRINT DATA=Employee_Info; Run;
The image below, shows the output of the above code.
We just created a data set and understood how the PRINT procedure works. Now, let us take the above data set and use it for further programming. Let’s say we want to add employee’s Date of joining to the data set. So we create a variable called as DOJ, give it as input and print the result.
DATA Employee_Info; input Emp_ID Emp_Name$ Emp_Vertical$ DOJ; datalines; 101 Mak SQL 18/08/2013 102 Rama SAS 25/06/2015 103 Priya Java 21/02/2010 104 Karthik Excel 19/05/2007 105 Mandeep SAS 11/09/2016 ; Run; PROC PRINT DATA=Employee_Info; Run;
The below image shows the output of the above code. It is visible that a variable was created, but the value of DOJ wasn’t printed. Instead, we see dots have replaced the date values.
Why did this happen? Well, DOJ variable is without a suffix ‘$’, that means, by default SAS will read it as a numeric variable. But, the data we entered has a special character ‘/’, hence it does not print the result since it is not purely numeric data. If you check the log window you will see an error message as ‘invalid data for variable DOJ’
Now how do we solve this problem? Well, one way to solve it is by using a suffix ‘$’ for DOJ variable. This will convert DOJ variable to character and you will be able to print date values. Let us make the changes to the code and see the output.
DATA Employee_Info; input Emp_ID Emp_Name$ Emp_Vertical$ DOJ$; datalines; 101 Mak SQL 18/08/2013 102 Rama SAS 25/06/2015 103 Priya Java 21/02/2010 104 Karthik Excel 19/05/2007 105 Mandeep SAS 11/09/2016 ; Run; PROC PRINT DATA=Employee_Info; Run;
The output screen will display the following output.
You can see that the data values are displayed as dates by converting DOJ to character. However, this is a temporary solution. Let me explain it how?
Well, imagine a bank has a similar data set. The data set has account holder details like loan amount, installments, and due date for loan installment. Imagine, the holder has missed his deadline to pay an installment and bank wants to calculate the delay. The bank will have to calculate the difference between the deadline date and the current date.
But, if the bank’s data set has dates in character format, then the bank won’t be able to perform mathematical operations on it. This issue may affect our data set too. So how do we solve this problem?
The next concept will help you overcome this issue.
It is important that you understand this topic well if you want to be good at SAS programming. If you can recall, I mentioned earlier that SAS has two standard variable types:
When SAS comes across non standard variables, SAS will throw an error or you won’t get the desired output. To overcome this problem, SAS uses Informats and Formats.
Informats are typically used to read or input data from external files or flat files (like text files or sequential files). The informat instructs SAS on how to read data into SAS variables. SAS has three types of Informats: character, numeric, and date/ time. Informats are named according to the following syntax structure:
The ‘$’ indicates a character informat. INFORMAT refers to the sometimes optional SAS informat name. The ‘w’ indicates the width (bytes or number of columns) of the variable. The ‘d’ is used for numeric data to specify the number of digits to the right of the decimal place. All informats must contain a decimal point(.) so that SAS can
differentiate an informat from a SAS variable.
Let us go back to our previous code and see if Date/ Time Informat helps us. So let’s change the code accordingly and add a Date Informat to it as follows:
DATA Employee_Info; input Emp_ID Emp_Name$ Emp_Vertical$ DOJ; INFORMAT DOJ ddmmyy10.; datalines; 101 Mak SQL 18/08/2013 102 Rama SAS 25/06/2015 103 Priya Java 21/02/2010 104 Karthik Excel 19/05/2007 105 Mandeep SAS 11/09/2016 ; Run; PROC PRINT DATA=Employee_Info; Run;
Line number 3 in the code instructs SAS to read in the variable ‘date of joining’ (DOJ) using the date
informat MMDDYYw. For each date field occupies 10 spaces, the ‘w.’ qualifier is set to 10.
The output of the code would look like as follows.
The result shows we still don’t have the desired result, instead the DOJ column is holding some numeric values and not the dates we specified. Now, why is that? Well, once a date is read with a date informat, SAS stores the date as a number. That means, it is read as the number of days between the date and January 1, 1960 (For example: 3/15/1994 is stored as 12492).
The reason behind this is that SAS has three separate counters which keep track of dates and time. These date counters started at zero on January 1, 1960. Hence dates before 1/1/1960 have negative values, and any date after has a positive value. Every day at midnight, the date counter is incremented by one.
One story has it that the founders of SAS wanted to use the approximate birth date of the IBM 370 system, and they chose January 1, 1960 as an easy to remember approximation.
Now that you know the reason why the column DOJ displayed those numbers, let us try to solve this problem. To overcome this problem we use Format.
Informats are the instructions for reading data, whereas formats are the instructions used to display or output data. Defining a format for a variable is how you tell SAS to display the values in the variable. Formats are grouped into the same three classes as informats (character, numeric, and date-time) and also always contain a dot.
The general form of a format statement is:
Let us go back to our code having dataset Employee_Info to see if we can display the date correctly using FORMAT command.
DATA Employee_Info; input Emp_ID Emp_Name$ Emp_Vertical$ DOJ; INFORMAT DOJ ddmmyy10.; FORMAT DOJ ddmmyy10.; datalines; 101 Mak SQL 18/08/2013 102 Rama SAS 25/06/2015 103 Priya Java 21/02/2010 104 Karthik Excel 19/05/2007 105 Mandeep SAS 11/09/2016 ; Run; PROC PRINT DATA=Employee_Info; Run;
We have used FORMAT command in line number 4 in the above code. The following output screen will give us the desired output.
We have successfully displayed the data set using Date format command. I hope you have understood how to use format and informat. Let us move ahead with our SAS programming blog and take a look at another important concept.
While doing SAS programming, we may encounter situations where we repeatedly need to execute a block of code several number of times. It is inconvenient to write the same set of statements again and again. This is where loops come into picture. In SAS, the Do statement is used to implement loops. It is also known as the Do Loop. The image below shows the general form of the Do loop statements in SAS.
Following are the types of DO loops in SAS:
We use an index variable as a start and stop value for Do Index loop. The SAS statements get executed repeatedly till the index variable reaches its final value.
Do indexvariable = initialvalue to finalvalue; SAS statements; End;
Let us take a look at sample code to understand Do Index Loop. In the below code, VAR is the index variable.
DATA SampleLoop; SUM=0; Do VAR = 1 to 10; SUM = SUM + VAR; END; PROC PRINT DATA = SampleLoop; Run;
When you execute the above code, you will get the following output.
The Do While loop uses a WHILE condition. This Loop executes the block of code when the condition is true and keeps executing it, till the condition becomes false. Once the condition becomes false, the loop is terminated.
Do While (condition); SAS statements; End;
Following sample code will help you understand DO WHILE loop.
DATA SampleLoop; SUM=0; VAR=1; Do While(VAR<15); SUM = SUM + VAR; VAR+1; END; PROC PRINT DATA = SampleLoop; Run;
The above code will give you following output.
The Do Until loop uses an Until condition.This Loop executes the block of code when the condition is false and keeps executing it, till the condition becomes true. Once the condition becomes true, the loop is terminated.
Do Until (condition); SAS statements; END;
Let us take a look at sample program.
DATA SampleLoop; SUM=0; VAR=1; Do Until(VAR>15); SUM=SUM+VAR; VAR+1; END; PROC PRINT; Run;
The code has the following output.
Now let us take a look some statistical procedures. These procedures will form a base for advanced analytical procedures.
Subscribe to our youtube channel to get new updates..!
This procedure is used to calculate arithmetic mean and standard deviation. For people who are new to statistics may find it difficult to understand these terms. So before we start coding and use this procedure. I will try to explain what these terms mean.
Let’s start with arithmetic mean and see how PROC MEANS is used in SAS programming to calculate it.
Sum of the value of numeric variables, divided by the number of variables gives you the arithmetic mean. It is also known as mean and is a measure of central tendency. A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.
In SAS programming, you use PROC MEANS to calculate the arithmetic mean. This procedure lets you find mean of all variables or few variables of a data set. You can also form groups and calculate mean of variables specific to that group.
PROC MEANS DATA = DATASET; Class Variables ; Var Variables;
Mean Of A Dataset
If you supply only the data set name without any variables, you can calculate the mean of all the variables in a data set.
Let us take a look at a sample code. I have considered a predefined SAS data set called as ‘cars’. The following command will display the data set.
PROC PRINT data=sashelp.CARS; Run;
The image below shows the output of above code.
Now let us use this data set code and calculate the mean of each variable in the data set ‘cars’.
PROC MEANS DATA = sashelp.CARS Mean SUM MAXDEC=2; Run;
Image below shows mean of all the variables in the data set upto two decimals.
Mean Of Selected Variables
By supplying the names in the Var option you can get the mean of the specified variables. Please refer the code below.
PROC MEANS DATA = sashelp.CARS mean SUM MAXDEC=2; var horsepower cylinders; Run;
Mean By Class
You can find the mean of the numeric variables by organizing them into groups by using some parameter to group them. Consider following sample code. Lets find out the mean of horsepower for different groups categorized by the classes ‘make’ and ‘type’ of different cars.
PROC MEANS DATA = sashelp.CARS MEANS SUM MAXDEC=2; class make type; var horsepower; Run;
The image below shows the output of the above code.
Let us continue with our SAS Programming blog and take a look at another important statistical concept.
Standard deviation (SD) is a measure of how varied is the data in a given data set. Mathematically, it tells you how close is each data point to the mean value of a data set. If the value of standard deviation is close to 0, it indicates that the data points are very close to the mean of the data set and a high standard deviation indicates that the data points are spread out over a wide range of values.
In SAS, you can calculate the value of Standard Deviation using two procedures. They are:
Standard Deviation Using PROC MEANS
You can measure the Standard Deviation using proc means, you have to choose the STD option in the PROC step. It will display the Standard Deviation values for each numeric variable in the data set.
PROC MEANS DATA = dataset STD;
Consider this sample code, let us create another data set CARS1 from the CARS data set in the SASHELP library. To do this we let us use PROC SQL procedure. Let us group the data using ‘type’ and ‘make’ of cars and calculate standard deviation for selected variables using the STD option with the PROC means step.
PROC SQL; create table CARS1 as SELECT make,type,horsepower,cylinders,weight FROM SASHELP.CARS WHERE make in ('Audi','BMW') ; RUN; PROC MEANS DATA=CARS1 STD; Run;
The above code will give Standard deviation for selected variables. Following image displays the output.
This procedure is used to measure Standard Deviation along with some advance features like measuring Standard Deviation for categorical variables and the variance.
PROC SURVEYMEANS options statistic-keywords; By variables; Class variables; Var variables;
Following is the description of the parameters used:
Let us take a look at this sample code which describes the use of the class parameter, that creates the statistics for each of the values in the class variable.
PROC SURVEYMEANS DATA=CARS1 STD; Class type; Var type horsepower; ods output statistics=rectangle; Run; PROC PRINT DATA=rectangle; Run;
The images below shows the output of the code above. It shows distribution of data for variable ‘Horsepower’ for 95% confidence interval. (Confidence interval means a range of values so defined that there is a specified probability that the value of a parameter lies within it.)
So, that brings us to the end of SAS programming blog. For any doubt or issue with the content of the blog, please leave them in the comments section, I will solve them at the earliest and respond back.
If you wish to learn SAS and build a career in the analytics domain, then check out our SAS Training & Certification which comes with instructor-led live training and real-life project experience. This training will help you understand SAS in depth and help you master various concepts of SAS programming language.
Got a question for us? Please mention it in the comments section and we will get back to you.