Pre-processing

Pre-process (data munging/wrangling) is a first step in data analytics.

There are a few simple things that I usually start with.

The first thing is to make an exact copy of the file containing the data to be analyzed.

Homework
Create a copy of the lipid2018 data set named lipid2018_b in a library with libname b on you computer.

Examine the contents of the file.

Homework
Run a proc contents on the data set lipid2018_b. Include the option to provide the listing of the variables on the file in the order they appear on each record.

Examine the number of unique values for each variable.

Homework
Examine the number of unique values of all of the variables on the data set lipid2018_b.

Examine the the numeric variables on the file.

Homework
Examine the numeric variables on the file lipid2018_b.

Macros are handy for repetitive tasks.

An autoexec.sas file can also save lots of coding time.

Examine the coding of character variables using PROC FREQ.

Homework
Use proc freq to examine the coding of the character variables on the data set lipid2018_b

I usually change character variables to numeric.

Homework
Change the character variables on lipid2018_b, smkcat, chd, gender, and diabetes to numeric variables. Use the coding in the video for the numeric variables. Drop the character variables from lipid2018_b.

Intra-individual variability, replace repeat measurements with the mean.

Homework
Create two new variables on lipid2018_b: sbp=mean of sbp1,sbp2, and sbp3 and dbp=mean of dbp1, dbp2, and dbp3. Drop the variables sbp1,sbp2,sbp3,dbp1,dbp2,dbp3 from lipid2018_b.

Compute body mass index.

Standardize vital capacity for height.

Homework
Add the variable bmi, as defined in the course materials to lipid2018_b
Add the variable fvcht, as defined in the course materials to lipid2018_b
Drop the variables fvc, weight and height from lipid2018_b.

What is left to do in Preprocessing, Part 2

The slides used in the videos are found here