SAS System
Important Note:
• Please submit your SAS code in one file. Use your name as the filename of the SAS file. Use /* */
comments to separate each question. See format below.
• Please work alone on the final assignment. Do not discuss the exam contents with your classmates
or anyone else. Evidence revealing identical solutions will be considered cheating and students will
receive an F for the term grade.
• All the exam contents are related to the lecture notes. There is more than one way to solve
each problem. However, you must use what you’ve learned from this class to solve the
problems; otherwise, you will receive 0 credit.
• I will not provide any hints for this final exam. If you are unclear about the exam problems, please
email me directly.
/*
Name: Your name
*/
/*
Question 1
*/
…
Your SAS code
…
/*
Question 2
*/
…
Your SAS code
…
…
…
…
/*
Question 5
*/
…
Your SAS code
…
1
Final Assignment
Problem 1 (7 points)
You will use two data sets: geocode.sas7bdat and households.sas7bdat. These data sets were originally
downloaded from the US Census Bureau. The description of these two data sets is listed below:
geocode.sas7bdat:
VARIABLE TYPE DESCRIPTION EXAMPLE
GEOID CHAR 9 – digit Geography Code 04000US34
STATE CHAR State Name New Jersey
households.sas7bdat:
VARIABLE TYPE DESCRIPTION EXAMPLE
GEOID NUM 1- or 2-digit Geography Code 34
TOTHOUSE NUM Total Households 3064645
UNMARRIED NUM Total Unmarried-Partner Households 151318
The data set geocode.sas7bdat contains 51 observations and the data set households contains 52 observations. For this problem, you will need to create one single data set that contains the variables STATE,
GEOID (in 1- or 2-digits), TOTHOUSE, and UNMARRIED, and only contains observations that occur
from both data sets. Notice that the last two digits from the 9-digit geography code are the same as the
2-digit geography codes. When you combine these two data sets, be careful about the variable type. The
first five observations of your final data set should look similar to the one below:
The SAS System
Obs state id tothouse unmarried
1 Alabama 1 1737080 58537
2 Alaska 2 221600 16568
3 Arizona 4 1901327 118196
4 Arkansas 5 1042696 40543
5 California 6 11502870 683516
2
Final Assignment
Problem 2 (8 points)
You will use the base.sas7bdat for this problem. Here are the complete observations of the data set:
Obs ID SBP visit_time trtmt_time
1 1 140 02/13/2013 05/15/2013
2 1 130 09/30/2013 05/15/2013
3 1 132 07/13/2013 05/15/2013
4 1 138 05/15/2013 05/15/2013
5 2 122 04/05/2013 06/05/2013
6 2 128 06/05/2013 06/05/2013
7 2 130 07/09/2013 06/05/2013
8 2 125 04/30/2013 06/05/2013
In the BASE data set, the variable VISIT TIME is the visiting time. Please keep it in mind that the
visiting time is not properly ordered in the data set. TRTMT TIME is the treatment time or the baseline
measurement time. SBP is the systolic blood pressures that are measured at each visiting time. Based
on this data set, create the following two variables:
• B SBP: contains the SBP value at the treatment time. For SBP that is measured before the
treatment time, B SBP will be set to missing.
• C SBP: the difference between the current SBP measurement and the baseline SBP measurement.
For SBP that was measured before treatment date or on the treatment date, C SBP will be set to
missing.
The final data set should look similar to the one below:
Obs ID SBP visit_time trtmt_time b_sbp c_sbp
1 1 140 02/13/2013 05/15/2013 . .
2 1 138 05/15/2013 05/15/2013 138 .
3 1 132 07/13/2013 05/15/2013 138 -6
4 1 130 09/30/2013 05/15/2013 138 -8
5 2 122 04/05/2013 06/05/2013 . .
6 2 125 04/30/2013 06/05/2013 . .
7 2 128 06/05/2013 06/05/2013 128 .
8 2 130 07/09/2013 06/05/2013 128 2
3
Final Assignment
Problem 3 (6 points)
Write a macro named impute_num, which is used to replace the missing numeric value of a variable with
either the mean or the median value of this variable. The macro takes four arguments:
dat : the name of the data set.
var name : the name of the numeric variable that you want to impute.
method : you can use either mean or median for its value. If you specify mean, the macro will use the
mean value to replace the missing value. Similarly, if you specify median, the macro will use the
median value. Set the default value to mean.
result : you can use either var only or all for its value. Using var only means you only need to keep
the newly-imputed variable in the result data. Using all means you need to keep the newly-imputed
variable in the result in addition to all the variables from in the input data. Set the default value
to var only.
Also you need to add new as suffix for the newly-imputed variable name. For example, if you are imputing
the variable HDL, the newly-imputed variable name will be HDLnew.
The following example imputes HDL variable by replacing the missing value with the mean of HDL. Only
the newly-imputed variable HDLnew is kept in the output data.
%impute_num(dat=patients, var_name=HDL)
The SAS System
Obs HDLnew
1 32.0000
2 60.0000
3 55.6667
4 65.0000
5 55.6667
6 32.0000
7 55.6667
8 70.0000
9 55.6667
10 75.0000
4
Final Assignment
The following example imputes TGL variable by replacing the missing value with the median of TGL. The
output data contains all the variables from the original data plus the newly-imputed variable TGLnew.
%impute_num(dat=patients, var_name=TGL, method=median, result=all)
The SAS System
Obs ID GLUC TGL HDL LDL HRT MAMM SMOKE TGLnew
1 A 88 . 32 99 Y ever 180
2 B . 150 60 . no never 150
3 C 110 . . 120 N 180
4 D . 200 65 165 yes never 200
5 E 90 210 . 150 Y never 210
6 F 88 . 32 210 yes ever 180
7 G 120 164 . . Y yes 164
8 H 110 170 70 188 ever 170
9 I . 190 . 190 N no 190
10 J 90 . 75 . yes never 180
5
Final Assignment
Problem 4 (6 points)
Write a macro named impute_freq, which is used to replace the missing value of a variable (numeric or
character) with the value of the highest frequency of this variable. The macro takes three arguments:
dat : the name of the data set.
var name : the name of the variable that you want to impute.
result : you can use either var only or all for its value. Using var only means you only need to keep
the newly-imputed variable in the result. Using all means you need to keep the newly-imputed
variable in the result in addition to all the variables from in the input data. Set the default value
to var only.
Also you need to add new as suffix for the newly-imputed variable name. For example, if you are imputing
the variable smoke, the newly-imputed variable name will be smokenew.
The following example imputes smoke variable by replacing the missing value with the most frequent
value of the smoke. Only the newly-imputed variable smokenew is kept in the output data.
%impute_freq(dat=patients, var_name=smoke)
The SAS System
Obs smokenew
1 ever
2 never
3 never
4 never
5 never
6 ever
7 never
8 ever
9 never
10 never
6
Final Assignment
The following example imputes HRT variable by replacing the missing value with the most frequent value
of the HRT. The output data contains all the variables from the original data plus the newly-imputed
variable HRTnew.
%impute_freq(dat=patients, var_name=HRT, result=all)
The SAS System
Obs ID GLUC TGL HDL LDL HRT MAMM SMOKE HRTnew
1 A 88 . 32 99 Y ever Y
2 B . 150 60 . no never Y
3 C 110 . . 120 N N
4 D . 200 65 165 yes never Y
5 E 90 210 . 150 Y never Y
6 F 88 . 32 210 yes ever Y
7 G 120 164 . . Y yes Y
8 H 110 170 70 188 ever Y
9 I . 190 . 190 N no N
10 J 90 . 75 . yes never Y
7
Final Assignment
Problem 5 (3 points)
Write a macro named impute, which is used to impute the missing values for one or more numeric variables
with either the mean or the median value of these numeric variables and/or impute one or more variables
with the value of the highest frequency. The macro takes four arguments:
dat : the name of the data set.
num vars : the name(s) of one or more numeric variables. For this group of variable, you need to replace
the missing values with either the mean or the median values.
method : you can use either mean or median for its value. If you specify mean, the macro will use the
mean value to replace the missing value. Similarly, if you specify median, the macro will use the
median value. Set the default value to mean.
freq vars : the name(s) of the one or more variables. You want to replace the missing values with the
value of the highest frequency.
Please make sure the result data will contains all the newly-imputed variable in addition to all the variables
from in the input data. Please test all the macro calls below to ensure your macro works properly.
The following macro call imputes HRT variable with the most frequent value of the HRT.
%impute(dat=patients, freq_vars=HRT)
The SAS System
Obs ID GLUC TGL HDL LDL HRT MAMM SMOKE HRTnew
1 A 88 . 32 99 Y ever Y
2 B . 150 60 . no never Y
3 C 110 . . 120 N N
4 D . 200 65 165 yes never Y
5 E 90 210 . 150 Y never Y
6 F 88 . 32 210 yes ever Y
7 G 120 164 . . Y yes Y
8 H 110 170 70 188 ever Y
9 I . 190 . 190 N no N
10 J 90 . 75 . yes never Y
The following macro call imputes HRT, MAMM, and SMOKE variable with the most frequent value of these
three variables.
%impute(dat=patients, freq_vars=HRT MAMM SMOKE)
8
Final Assignment
The SAS System
S
M M
H A O
S R M K
G M M T M E
O L T H L H A O n n n
b I U G D D R M K e e e
s D C L L L T M E w w w
1 A 88 . 32 99 Y ever Y yes ever
2 B . 150 60 . no never Y no never
3 C 110 . . 120 N N yes never
4 D . 200 65 165 yes never Y yes never
5 E 90 210 . 150 Y never Y yes never
6 F 88 . 32 210 yes ever Y yes ever
7 G 120 164 . . Y yes Y yes never
8 H 110 170 70 188 ever Y yes ever
9 I . 190 . 190 N no N no never
10 J 90 . 75 . yes never Y yes never
The following macro call imputes GLUC with the mean value of this variable.
%impute(dat=patients, num_vars=GLUC)
The SAS System
G
L
S U
G M M C
O L T H L H A O n
b I U G D D R M K e
s D C L L L T M E w
1 A 88 . 32 99 Y ever 88.000
2 B . 150 60 . no never 99.429
3 C 110 . . 120 N 110.000
4 D . 200 65 165 yes never 99.429
5 E 90 210 . 150 Y never 90.000
6 F 88 . 32 210 yes ever 88.000
7 G 120 164 . . Y yes 120.000
8 H 110 170 70 188 ever 110.000
9 I . 190 . 190 N no 99.429
10 J 90 . 75 . yes never 90.000
9
Final Assignment
The following macro call imputes GLUC, GLUC, HDL, and LDL variables with the median values, and imputes
HRT, and SMOKE variables with the most frequent value.
%impute(dat=patients, num_vars=GLUC TGL HDL LDL, method=median, freq_vars=HRT SMOKE)
The SAS System
Obs ID GLUC TGL HDL LDL HRT
1 A 88 . 32 99 Y
2 B . 150 60 .
3 C 110 . . 120 N
4 D . 200 65 165
5 E 90 210 . 150 Y
6 F 88 . 32 210
7 G 120 164 . . Y
8 H 110 170 70 188
9 I . 190 . 190 N
10 J 90 . 75 .
Obs MAMM SMOKE GLUCnew TGLnew HDLnew LDLnew HRTnew SMOKEnew
1 ever 88 180 32.0 99 Y ever
2 no never 90 150 60.0 165 Y never
3 110 180 62.5 120 N never
4 yes never 90 200 65.0 165 Y never
5 never 90 210 62.5 150 Y never
6 yes ever 88 180 32.0 210 Y ever
7 yes 120 164 62.5 165 Y never
8 ever 110 170 70.0 188 Y ever
9 no 90 190 62.5 190 N never
10 yes never 90 180 75.0 165 Y never
Top-quality papers guaranteed
100% original papers
We sell only unique pieces of writing completed according to your demands.
Confidential service
We use security encryption to keep your personal data protected.
Money-back guarantee
We can give your money back if something goes wrong with your order.
Enjoy the free features we offer to everyone
-
Title page
Get a free title page formatted according to the specifics of your particular style.
-
Custom formatting
Request us to use APA, MLA, Harvard, Chicago, or any other style for your essay.
-
Bibliography page
Don’t pay extra for a list of references that perfectly fits your academic needs.
-
24/7 support assistance
Ask us a question anytime you need to—we don’t charge extra for supporting you!
Calculate how much your essay costs
What we are popular for
- English 101
- History
- Business Studies
- Management
- Literature
- Composition
- Psychology
- Philosophy
- Marketing
- Economics