Pre-processing, Part2

Dealing with missing data.

The standard method for treating missing data by most software including SAS is to do a complete case analysis.

The main problem with complete case analyses is that it makes scoring of new cases difficult.

Missing data for categorical variables requires special treatment.

Dealing with missing values on the chd2018_a data set, 1.

Dealing with missing values on the chd2018_a data set, 2.

Dealing with missing values on the chd2018_a data set, 3.

Determine the amount of missing data on lipid2018_b.
For variables with more than 5%missing, creating a missing value indicator variable with the prefix mi_ followed by the variable name.
For all variables with missing values do median imputation with imputation stratified by gender
Check to make sure that you now have a new version of lipid2018_b that has no missing values.

The Problem of complete and quasi-separation.

You should be familiar with the concepts of complete and quasi-complete separation in order recognize when they occur and that they are problematic.

Combining cells in categorical variables.

Sometimes on needs to combine data for either subject matter or statistical reasons.

An example of a subject matter collapsing of cells is the smoking categorical variable

Examine smoking by gender on the data set lipid2018_b
Create a new version of lipid2018_b with no missing values for smoking. For observations with missing data for smoking, randomly assign them to smoking category by gender.
Use the percentages from the above examination to generate a random category.

A data driven method for combining cells is Greenacre's method.

Use Greenacres method to justify combining smoking into a single 0,1 variable, currsmok=1 if current smoker, 0 otherwise
Create a new version of lipid2018_b that contains the variable currsmok and does not contain the variable smoking.

Quasi-complete separation is a case where, when modeling, cells should be combined.

Dealing with multicollinearity.

Multicollinearity, a simple case

Principal Components

Although I'm not going to use principal components in my course examples, they are often used to reduce the dimensionality of data. The following is a short introduction to SAS's PROC PRINCOMP

Introduction, a data set with a large number of highly correlated variables

An initial look involves examining correlations. PROC CORR has an option RANK that we will use later.

Use the rank option in proc corr to examine multicolinearity among the variables in lipid2018_b
Create a new version of lipid2018_b that does not contain the variables dbp and ldl.
Note in a more complete analysis, I would probably do separate model developments, one using ldl and one using chol. The decision to drop ldl is a bit arbitrary but fits reasonably into common practice.

A bit of formalization of principal components.

PROC PRINCOMP, Introduction

Scoring Principal Components and using them in a model

The slides used in the videos are found here