831 likes | 1.04k Views
Chapter 5. Creating SAS Data Sets from Raw Files and Excel Work-sheets. Overview. Raw Data in External Files. Excel and Other Types of Data. Data Entry . DATA STEP. Examine the Raw Data File and File Layout.
E N D
Chapter 5 Creating SAS Data Sets from Raw Files and Excel Work-sheets
Overview Raw Data in External Files Excel and Other Types of Data Data Entry DATA STEP
Examine the Raw Data File and File Layout Each field represents a variable. To read the data set, one needs to define a variable name for each variable: Partial listing of Sales.dat raw data file: >----+----10---+----20---+----30- SMITH JAN 140097.98 440356.70DAVIS JAN 385873.00 178234.00JOHNSON JAN 98654.32 339485.00SMITH FEB 225983.09 12250.00DAVIS FEB 88456.23 55564.00
Steps to Create a SAS Data Set from a Raw Data File This is accomplished in the DATA Step, which requires program statements for conducting the tasks: • Provide a physical location for the new SAS data set to be store. • Identify the location, name of the external file • Define a name for the new SAS data set • Provide a reference to identify the external file • Define and describe the variables and data values to be read • Conduct any additional data manipulations for processing the data
A Summary of SAS Statements for accomplishing the required tasks
The DATA Step to read external data: Libname libref ‘__________________’; filename fileref ‘ ‘;data _______________ ; infile fileref; input _________ ; . . . run; NOTE: If you copy a program from PPT or from Word File to SAS system, you MUST retype the quotation marks in SAS system. They are defined differently. Or: data _______________ ; infile ‘ ‘ ; input _________ ; . . . run;
Example Our objective is to read the raw data: salesdata.dat stored in folder of the C-drive with the path: C:\math707\RawData\RawData_dat And create a SAS data set Sales_sasdata, then store this SAS data in the Sales folder, a folder needed to be created prior to writing your program, in C-drive with the path: C:\math707\Sales Two SAS program statements are required in your SAS program before reading the file: • A statement to reference the folder to the SAS data set. • A statement to reference the external data set.
Reference SAS Data Library LIBNAME saleslib ‘C:\math707\Sales’; This statement defines a SAS data library saleslib referring to the folder Sales, which will be used to store the new SAS data set to be created.
Reference the External Raw Data File FILENAMEsal_dat ‘C:\math707\RawData\RawData_dat\ salesdata.dat’ ; NOTE: we define the external raw data file reference name is sal_dat, and data set is located in the HD described in the path above. NOTE: The rules of external file reference name are the same as Library reference name. • 1-8 characters, starting with alphabet or underscore, contains only letters, numbers or underscores.
More on FILENAME statement • It is a global statement. • It can reference to ONE external raw data file or a folder of external data files. NOTE: LIBNAME references to a folder of SAS data set, not to ONE SAS data set. Syntax to reference to ONE external data file: FILENAME fileref ‘path-to-the_external_datafile_Name’; NOTE: The fileref will be use in the INFILE statement later to inform SAS to locate the exact external raw data set. Ex: FILENAME sal_dat ‘C:\math707\RawData\RawData_dat\ salesdata.dat’ ;
FILENAME for Referencing a GROUP of External Data files Syntax to reference to a GROUP of external raw data files: FILENAME fileref ‘path-to the_external_datafile_Folder’; Ex: FILENAME EXT_DAT ‘C:\math707\RawData\RawData_dat’ ;
Define the SAS data set name in DATA Step to read the external data set DATA sas_data_setName; A SAS Data Step is to read data files into SAS system for further processing and creating a new SAS data set. Once the external raw data is read and processed, it requires a new SAS data set name. This is defined at the DATA statement. For the example of reading salesdata.dat, we can call the new SAS data set: Sales_sasdata. Ex: DATASALESLIB.Sales_Sasdata; This creates a SAS data set sales_sasdata, which is stored in the SAS library SALESLIB.
Identify the External Data Set to be INPUT into SAS system In order to read the external raw data set, SAS will need two statements to accomplish this: • One is to inform SAS system where to find the External raw data set. The statement is INFILE statement. • One is to read variables in each record correctly. The Statement is INPUT statement.
INFILE Statement General syntax: INFILE file-specification <options>; The file-specification depends on how the FILENAME statement defines the fileref. • If the fileref references to exactly ONE external raw data set , then, file-specification is the fileref. Ex: FILENAME sal_dat‘C:\math707\RawData\RawData_dat\salesdata.dat ‘; INFILE sal_dat;
IINFLE statement Continued If fileref references to a folder of external raw data sets (an aggregated group of raw data sets), then, file-specification needs to be specifically pointing to the exact data set using: INFILE fileref(data-set-name.file_extension)
Example for INFILE statement When FILEREF references to an aggregated group of Raw data sets Ex: FILENAME EXT_DAT ‘C:\math707\RawData\RawData_dat’ ; INFILE ext_dat(salesdata.dat); The fileref is EXT_DAT, which references to the entire folder of external raw data sets. The raw data set in the folder to be input is salesdata.dat
Describe and Read the Raw External Data: Fixed Column INPUT usingThe INPUT Statement Now, we have informed SAS where to get the raw data and where to store the new SAS data. The next is to describe the variables and read the data values of the variables from the raw data set. SAS uses the INPUT statement to accomplish. SAS needs to know exactly the formats of variables in the data set. Different INPUT statements are needed to handle different types of formats in the data set. In this chapter, we will focus on the variables with STANDARD and FIXED format.
Determine Variable Type:Numeric Vs. Character Data Types Based on examining the raw data file or the file layout, every SAS variable can be one of two types: • character • numeric.
Character data Type A variable is considered to be character if the data values of the variable contains any combination of the following: • letters (A - Z, a - z) • numbers (0-9) • special characters (!, @, #, %, and so on). NOTE: characters are case-sensitive. ‘Tom’ is different from ‘tom’ or ‘TOM’. NOTE: Character data is displayed left-adjusted. Examples: Mr. John Doe 126 Apt. A $34,540 583
Numeric Data Type A variable is considered to be numericif it contains • numbers (0-9). It may also contain • a decimal point (.) • a minus sign (-) • a letter E to indicate scientific notation. NOTE: Numeric data is displayed right-adjusted Examples: 25.6 543 -5.7 4.12E5 [This is 4.12 x 105]
Standard Vs. Nonstandard Numeric Data Standard numeric data can contain only • Numbers • Decimal places • Numbers in scientific or E-notation (ex, 4.2E3) • Plus or minus signs Nonstandard numeric data includes • Values contain special characters, such as %, $, comma (,), etc. • Date and time values • Data in fractions, integer binary, real binary,, hexadecimal forms, etc.
Determine if each of the following numeric data standard or nonstandard data 345.12 Standard $345.12 Nonstandard 3,456.12 Nonstandard 20DEC2010 Nonstandard date 12/20/2010 Nonstandard date
Fixed Format Vs. Free format Fixed format means a variable occupies in a fixed range of columns from observation to observation. Free format means the data values are not in a fixed range of columns. Ex:Fixed format Free format 12345678901234567890 12345678901234567890 -------------------- -------------------- HIGH 340 12.5 F HIGH 340 12.5 F LOW 5630 7.5 F LOW 5630 7.5 F MEDIAN 674 26.73 M MEDIAN 674 26.73 M
Column INPUT SAS can read a variety of different and complicate standard and nonstandard data values. This chapter focuses on reading raw data set with FIXED columns and in STANDARD format. The Column INPUT statement describes the columns in each observation of the raw data set to SAS. Each variable defined in the INPUT statement • provides a name to represent each variable in the data set • indicates a type of character or numeric • indicates the starting column and ending column.
The Column INPUT Statement General form of the Column INPUT statement: variable is a valid SAS variable name. $ indicates a character variable. start identifies the starting position. end identifies the ending position. INPUTvariable $ start - end . . . ;
The Column INPUT StatementThere are various ways to read data in the INPUT statement. The following is ‘column Input’. For the Salesdata Example: input last_name $ 1 - 7 month $ 9 - 11 residential 13 -21 commercial 23 – 31 ;
The INPUT Statement This way of describing the input raw data record to SAS is called column input because it defines the starting and ending positions of each field. This implies that each field in a raw data record is in the same position in every record of the file.
Data Step to read external raw data WITH FILENAME Statement When reading a raw data set, one must inform SAS system where to read the raw data. The approach we discussed is to use the following statements (WITH FILENAME statement): FILENAME sal_dat ‘C:\math707\RawData\RawData_dat\salesdata.dat’ ; DATA saleslib.sales_sasdata; INFILE sal_dat; INPUT last_name $ 1-7 month $ 9-11 residential 13 -21 commercial 23 – 31; NOTE: You can refer to the same raw data set in other DATA steps by using the fileref.
Data Step to read external raw data WITHOUT FILENAME Statement To inform SAS where the raw data set is located, we can ignore the FILEMANE statement and combine the path into the INFLE statement: DATA saleslib.Sales_sasdata; INFILE ‘C:\math707\RawData\RawData_dat\salesdata.dat’ ; INPUT last_name $ 1-7 month $ 9-11 residential 13 -21 commercial 23 – 31; NOTE: No Fileref is defined in the above statements to read the salesdata.dat.
The DATA Step General form for the complete DATA step without FILEMANE statement: DATA SAS_data_set_name; INFILE ‘path-to-input-raw-data-file’; INPUT variable $ start - end . . .; RUN; General form for the complete DATA step with Filename statement: DATA SAS_data_set_name; FILENAMEFileref ‘path-to-input-raw-data-file’; INFILE Fileref ; INPUT variable $ start - end . . .; RUN;
The order of the variables in the INPUT statement Using INPUT to read fixed column data set, it does not need to be in sequential order of variables in the raw data set. For example: INPUT residential 13-21 commercial 23-31 Last_name $ 1-7 Month $ 9-11; will read the variables residential from col. 13 to 21, commercial from 23-31, then, move line pointer back to column #1 to read 1-7 for Last_Name and then 9-11 for month. The output SAS data set will have the variables in the order of Residential, commercial, Last_name, Month
Use the option ‘OBS= ‘ in INFILE statement When the data set is very large, it is not a good idea to run the draft program using the entire data set. It is important to make sure there is no syntax error and reduce potential data error before processing the entire data set. There are two ways to do this: • Use SYSTTEM OPTIONS introduced in previous chapter: OPTIONS firstobs = n1, obs=n2; (2) Use OBS= as an option inside INFILE statement: INFILE fileref obs = n;
Use _NULL_ as the SAS Data Set Name in Data Step Similar to INFLIE fileref OBS = n; for preventing processing the entire data set until the program is correct, one does not need to create any SAS data set (including temporary SAS data set). This can be done by using: DATA _NULL_ ; /* data set name _NULL_ means ‘Do not create any SAS data set in this data step’ */ FILENAME fileref ‘ ‘ ; INFILE fileref OBS = n ; INPUT ; RUN;
Assignment statements in DATA Step Assignment statement is to modify, transform, redefine existing variables or create new variables. The general syntax is Variable = expression ; /*For the Sales data example, the following assignment statement computes the total sales in the Salesdata: */ Data work.sales; INFILE ‘ ‘; INPUT ; Totalsale = residential + commercial; /*The following assign statements compute the average sales per month:*/ AvgSale1 = (residential+commercial)/Month; AvgSale2 = totalsale/Month; /* AvgSale2 statement must appear after the Totalsale statement */ RUN;
Exercise The following data admitfix.dat is posted on the class website.
Write a SAS program to perform the following tasks: • Read the data set admitfix.dat using column format, and create the SAS data set admit_sasdata in the Work library • Compute BMI using the formula: • Use PROC CONTENTS to see the variable attributes • Use PROC PRINT to print the admit_sasdata, and use the date9. to display the date variable. Save the program as c5_colInp to your SASEx folder
Solution filename adm_fix 'C:\math707\RawData\RawData_dat\admitfix.dat'; data admit_sasdata; infile adm_fix; input ID 1-4 name $ 5-20 sex $ 22 age 28-29 date 32-36 height 42-43 weight 49-51 acelevel $ 52-55 fee 59-63; bmi=weight*703/(height**2); proccontents; run; procprint; format date date9.; run;
Date Constants As discussed in Chapter 4, SAS treats date as numeric value. SAS defines the date 01/01/1960 as the date value 0, and sequentially adding the # of dates for the later date, subtracting # of dates for the earlier date. Here are some examples: SAS also provides various formats to display date(as discussed in Chapter 4: DATE9. and MMDDYY10. are two common date display formats). Besides how to handle dates, SAS also provides several formats to represent a date Constant: ‘ddmmmyy’d , ‘ddmmmyyyy’d or with double quotation marks.
Date Constant (continued) Here are some examples: NOTE: TODAY() is a SAS function, which provides today’s date. One can request today’s date by an assignment statement: Today_date = today(); The result will be a numeric date value for today’s date counting from 01/01/1960. To properly print (display) the date, refer to Chapter 4: using Date Format such as DATE9. , MMDDYY10.
Time constant, Date-Time constant in SAS In addition to Date constant, SAS also provides TIME constant for any given date: ‘hour:minutes’t for up to minute. ‘hour:minutes:second’t for up to second. Example: Duetime = ’23:59’t ; TIME constant for a SPECIFIC date: ‘ddmmmyyyy:hour:minute:second’dt Example: DueDate = ’09sep2009:23:59:59’dt
Exercise Write a program to practice the following: • Find out today’s date using TODAY() SAS function. • Define the July 4th, 2011 as a date constant. • Define the begin time and end time for Math 707 using time constant. Begin time is 17:00:00 , end time is 19:45:00 • Define the first second of the year of 2011 using datetime constant. • Print the date constant using DATE9. , print time constant using TIME10. , print the datetime constant using DATETIME25. NOTE: DATE9. display date, TIMEw. displays time, and DATETIMEw. display datetime. W is the width needed to display the values. It should be large enough as needed.
Solution data datetime; today_D=today(); d_july4='04jul2011'd; bg_time_S575 = '17:00:00't; en_time_s575 = '19:59:59't; dt_jan_2011 = '01jan2011:00:00:01'dt; procprint; format today_d d_july4 date9. bg_time_s575 en_time_s575 time10. dt_jan_2011 datetime25.; run;
Subsetting data cases using conditional IF statement In Chapter 4, we use WHERE statement in PROC step, such as PROC PRINT; to select cases. In this chapter, we introduce how to use the conditional IF statement for case selection in the DATA Step. In the later chapters, we will discuss further the difference between WHERE and IF. In DATA Step, we can use the statement: IF expression; To select cases that only satisfy the IF expression statement. NOTE: For cases which do not satisfy the IF condition will not be kept in the output SAS data set.
Conditional IF to select observations General Syntax: IF condition; NOTE: When the condition is true, the observation is selected, otherwise, not selected.. For example: IF sex = ‘M’; Will only select subjects whose sex is ‘M’. NOTE: if the data value is ‘m’, it is not selected, since data value is case sensitive.
Example: Using IF for selecting only Month in Jan, Feb, March for the Salesdata Data work.sale; INFLIE ‘C:\math707\RawData\RawData_dat \Salesdata.dat’; input last_name $ 1 – 7 month $ 9 - 11 residential $ 13 -21 commercial $ 23–31; If Month in (‘JAN’, ‘FEB’, ‘MAR’); run; Proc print; Run;
Can we read data from within the SAS program? Answer is YES. Here is the general steps: DATA …. ; INPUT ………….; ………. DATALINES; /*CARDS; also works */ Actual data values that is entered based on the format stated in the INPUT statement. ……… ; RUN;