Determine the amount of missing data on lipid2018_b.
For variables with more than 5%missing,
creating a missing value indicator variable
with the prefix mi_ followed by the variable name.
For all variables with missing values
do median imputation with imputation
stratified by gender
Check to make sure that you now have a new version of lipid2018_b that has no missing values.
Sometimes on needs to combine data for either subject matter or statistical reasons.
Examine smoking by gender on the data set lipid2018_b
Create a new version of lipid2018_b with no missing values for smoking.
For observations with missing data for smoking, randomly assign them to smoking category by gender.
Use the percentages from the above examination to generate a random category.
Use Greenacres method to justify combining smoking into a single 0,1 variable,
currsmok=1 if current smoker, 0 otherwise
Create a new version of lipid2018_b that contains the variable currsmok and does not contain the variable smoking.
Although I'm not going to use principal components in my course examples, they are often used to reduce the dimensionality of data. The following is a short introduction to SAS's PROC PRINCOMP
Use the rank option in proc corr to
examine multicolinearity among the variables in lipid2018_b
Create a new version of lipid2018_b that does not contain the variables dbp and ldl.
Note in a more complete analysis, I would probably do separate model developments, one using ldl and one using chol. The decision to drop ldl is a bit arbitrary but fits reasonably into common practice.