0:00

This lecture's about preprocessing predictor variables.

Â And, so as we saw on previous lectures, you need to plot the variables

Â upfront so you can see if there's any sort of weird behavior of those variables.

Â And sometimes predictors will look very strange

Â or the distribution will be very strange, and

Â you might need to transform them in order

Â to make them more useful for prediction algorithms.

Â This is particularly true when you're using model based algorithms.

Â Like linear discriminate analysis, naive Bayes,

Â linear regression and things like that.

Â We'll talk about all those methods later in the

Â class, but just keep in mind that pre-processing can

Â be more useful often when you're using model based

Â approaches, then when you're using more non parametric approaches.

Â 0:37

So, why preprocess?

Â Here again I'm loading the later, the caret package, and I'm

Â learning the kernlab package and then I'm attaching the spam data.

Â Again just like I talked about previously, when you're deciding how to preprocess

Â data, or how to explore data we only look at the training set.

Â So, we split data right away into training and testing

Â data and we set the testing data aside for later.

Â Now if I look at one of the variables, so again, this is spam data,

Â so we're trying to predict whether the data is spam or if it's good emails, ham.

Â And, so the variables are things like how many capitals do we see in a row?

Â What's the run length for the number of capitals in a row in an email?

Â If you take, make a histogram of those values, you see, for example,.

Â That almost all of the run links are very small,

Â but there are a few that are much, much larger.

Â This is an example of a variable that is very skewed, and, so it's

Â very hard to deal with in model

Â based predictors and so you might want to preProcess.

Â So, if you take the mean of this variable, it's about 4.7.

Â But the standard deviation is huge, it's much much larger.

Â So, it's much more highly variable variable.

Â And so, what you might want to do is do some sort of preprocessing, so

Â the machine learning algorithms don't get tricked by

Â the fact that it's skewed and highly variable.

Â 1:51

So, one way that you can do this is by basically standardizing variables, and the

Â usual way to standardize variables, is to

Â take the variable values, and subtract their mean.

Â Then put, so you take the value, subtract the mean,

Â and then divide that whole quantity by the standard deviation.

Â When you do that the mean of the variables that you have will be zero.

Â And the standard deviation will be one, so that will

Â reduce a lot of that variability that we saw previously.

Â And it will, standardize the variable somewhat.

Â 2:21

One thing to keep in mind is, again, when

Â we apply a prediction algorithm to the test set.

Â We have to be aware that we can only

Â use parameters that we estimated in the training set.

Â In other words, when we apply this same standardization

Â to the test set, we have to use the

Â mean from the training set, and the standard deviation

Â from the training set, to standardize the testing set values.

Â What does this mean?

Â It means that when you do the standardization, the

Â mean will not be exactly zero in the test set.

Â And the standard deviation will not be exactly one, because

Â we've standardized by parameters estimated in the training set, but

Â hopefully they'll be close to those values even though we're

Â using not the exact values built in the test set.

Â You can also use the preProcess function to do a lot of standardization for you.

Â So, the preprocess function is a function that is built into the caret package.

Â And here I'm passing it all of

Â the training variables except for one, except for

Â the 58th in the data set, which is the actual outcome that we care about.

Â And I'm telling it to center every variable and scale every variable.

Â That will do that same transformation that we talked about previously to

Â the data, where you subtract the mean and divide by the standard deviation.

Â And you can see that by looking at the

Â mean of the value capitalAve, just like we did before.

Â And you can see that after using the preProcess function

Â the mean is zero, and the standard deviation is one.

Â So, preprocess can be used to perform a lot of the preprocessing

Â tool, techniques that you, you used to have to do by hand.

Â The other thing that you can do is you can use the object that's created

Â using the preprocessing technique to apply that same preprocessing to the test set.

Â So, here this preObj was the object used on the previous slide.

Â That was the object that we created by preprocessing the training set.

Â 4:09

So, looking at that value we can see, now,

Â if we pass the testing set to the predict function.

Â Then what it'll do is it will take

Â the values calculated on the preprocessing object and apply

Â them to the test set object, and so again, in the pre, post process test set data

Â the mean won't exactly be equal to zero

Â for any variable and the standard deviation won't exactly

Â be equal to one, but they'll be close,

Â because we used the training set values to normalize.

Â 4:36

You can also pass the preprocessed commands directly

Â to the train function in caret, as an argument.

Â So, for example here we can send to the preprocessed

Â argument of the train function, the command, the parameters center

Â and scale, and that will center and scale all of

Â the predictors, before using those predictors in the prediction model.

Â The other thing that you can do is do other kinds of transformation.

Â So, centering and scaling is one approach, and that will take

Â care of some the problems that we see in these data.

Â You can remove, very strongly biased predictors

Â or predictors that have super high variability.

Â The other thing that you can do is use other

Â different kinds of transformations one example is the box cox transforms.

Â Box cox transforms are a set of transformations

Â that take continuous data, and try to make them

Â look like normal data and they do that by

Â estimating a specific set of parameters using maximum likelihood.

Â So, if I use the preprocess function and I tell it to

Â perform box cox transformations on each of the variables, and then I predict.

Â 5:37

Each of the different variables using that preprocess object on the training set.

Â I can look at the capital average values,

Â and I can make a histogram of those values.

Â And now, remember in the original plot on the histogram, they looked

Â like a huge pile at zero and a few values that were large.

Â And now you see something that look a little bit more like

Â a normal distribution, a little bit more like a bell curve here.

Â You will notice that it doesn't take care of all of the problems.

Â So, for example there's still a stack set of values here at zero and

Â in the Q-Q plot, so this is

Â showing the theoretical quantiles of the normal distribution.

Â Versus the sample quintiles that we calculated for our

Â preProcess data, you can see that they don't perfectly

Â line up and in particular there's this again chunk

Â down here at the bottom these don't lie on

Â a perfect 45 degree line, and that's because if

Â you have a bunch of values that are exactly

Â equal to zero this is a continuous transform and

Â it doesn't take care of values that are repeated.

Â So, it doesn't take care of a lot of the problems that

Â would happen, would occur with using a variable that's highly skewed though.

Â 6:38

So, the thing that we can do is also impute data for these data sets.

Â So, it's very common to have missing data.

Â And when you're using missing data in

Â the data sets, the prediction algorithms often fail.

Â Prediction algorithms are built not to handle missing data in most cases.

Â So, if you have some missing data.

Â You can impute them using something called k-nearest neighbor's imputation.

Â So here we set the seed again, because this is

Â a randomized algorithm, and we want to get reproducible results.

Â And I take just these capital average values

Â and, I create a new variable called CapAve.

Â 7:29

So, now I want to know how would I handle those missing values?

Â I did this so that we could see how we could handle missing values in a dataset.

Â So, one thing that you going to do is you going to get

Â news as preProcess function and tell it to do k-nearest neighbors imputation.

Â K-nearest neighbors computation find the k.

Â So if k equal to ten, then the 10

Â nearest, data vectors that look most like data vector

Â with the missing value, and average the values of

Â the variable that's missing and compute them at that position.

Â So, if we do that, then we can predict on our training set, all of

Â the new values, including the ones that

Â have been imputed with the k-nearest imputation algorithm.

Â 8:17

One thing you can do is you can look at the comparison between the

Â actual, in this case, when we set some of the values to be equal

Â to NA in advance, we can look at the values that were imputed, and

Â the values that were truly there before we removed them and made them NAs.

Â And we can see how close those two values are to each other.

Â And so, you can see, for example, that.

Â Those values are relatively close to each other.

Â Most of the differences are very close to zero.

Â Here you can see the values are mostly very close to zero.

Â So the imputation work relatively well.

Â You can also do look at just the values that were imputed.

Â So again here I'm looking at a capAve

Â quantile of the same difference between the imputed values.

Â And the true values, they're only for the ones that were missing.

Â And here you can see again, most of the values are close to zero, but here we're

Â only looking at the ones we're missing, so

Â clearly some of them are more variable than previously.

Â And then you can look at the ones that

Â were not the ones that we selected to be NA.

Â And you can see that they're even closer to each other, and

Â so the ones that got imputed are a little bit further apart.

Â But aren't that much further apart.

Â 9:26

So, there is a lot more to learn

Â about training and testing sets in terms of transformations.

Â But train, but the things to keep in mind are that

Â training in test sets must be processed in the same way.

Â The caret package handles a lot of this under the hood in

Â the sense that if you train a data set using preProcess functions.

Â 9:44

Built into the train function and caret way

Â it applies that preprocessed function to the test set.

Â It will handle all of the preprocessing in the correct way for you.

Â In general, you need to pay attention to the

Â fact that anything you do to the training set.

Â Will create a set of parameters.

Â You must use only those parameters when you apply it to the test set.

Â You can't estimate new transformations on the test set alone.

Â And that means that the test set transformations will likely be imperfect.

Â Especially if the test and training sets are

Â diff, collected at different times or in different ways.

Â Some of the transformations won't necessarily work as well.

Â 10:21

All of the tra, transformations I'm talking about, so

Â far other than imputation are based on continuous variables.

Â When dealing with factor variables it's a little

Â bit more difficult to know what's the right transformation.

Â Most machine learning algorithms are built to deal with either

Â binary predictors, in which the binary predictors are not pre-processed.

Â Or continuous predictors in which case sometimes it's expected

Â that, the data are preprocessed to look more normal.

Â You can go to this link that I've linked here to look, into

Â more information about how to preProcess

Â data for prediction using the caret package.

Â