0:09

With associational analyses, the basic goal is to determine whether a key

Â predictor and an outcome are associated with each other in the presence of many

Â other potentially confounding factors.

Â But sometimes you wanna be able to predict the outcome with

Â all of the available information that you have.

Â 0:25

So you don't necessarily have to distinguish between, say, a key predictor

Â and a set of other predictors, okay, you just want to use all the information.

Â So, and furthermore it doesn't matter kind of if the variables are related to

Â the outcome in some sort of causal or mechanistic way, but

Â if they carry any information at all about the outcome,

Â they may be useful in a prediction setting, and you might want to use them.

Â Because you're usually not interested in developing a kind

Â of detailed understanding of how the variables are related to each other, or

Â how about they're related to the outcome.

Â You just want to be able produce solid and high quality predictions, and

Â so any variable that could play a role in that might be useful to you.

Â So, what are the expectations that we have about prediction problems?

Â So when we build a prediction model the thing that we want is to be able to find

Â a feature or a set of features that can produce good separation in the outcome.

Â Because typically the outcome is typically going to be some binary outcome or

Â some multi-class outcome where that can take either two values or

Â just a handful of values.

Â That is a typical setup of a prediction problem.

Â And you want to be able to separate the two classes using a set of

Â features that you collect, and a model that you develop, okay?

Â So here's a very simple example of a single predictor

Â on a binary outcome that produces very good separation, okay?

Â This data is simulated, so

Â I just want to show you so you can see what it looks like.

Â On the y-axis I have the values of the outcome which are just 0 and 1, so

Â it's a binary outcome.

Â You can think of this as like not having the disease and having the disease or

Â 2:00

any sort of binary class outcome like that.

Â On the x-axis here I've got a simulated predictor that ranges between -2 and 2,

Â roughly.

Â And it's continuous, so it takes all the different values in between there.

Â And you can see that there's a gray area

Â here that I've highlighted in the plot that's kind of near the middle.

Â And you can see that in that grey area that the outcome will take values 0 and

Â 1 depending on the value predictor.

Â So the outcome can take either value in that grey area.

Â To the right of the grey area you'll notice that the outcome is always 1,

Â and to the left of the grey area you'll see that the outcome's always 0.

Â So the goal of most prediction algorithms is essentially

Â to minimize the size of that grey area.

Â Cuz that gray area is the area where you have the most uncertainty.

Â It's because this is the range of the predictor

Â where the outcome could actually take both values.

Â So you have some uncertainty there.

Â Once you're outside that gray area, you have almost absolute certainty,

Â because the outcome will either be 0 or 1.

Â So the goal is to minimize the size of this gray area using some set

Â of features that you can collect.

Â 3:05

So, for this example, I'm gonna use a dataset on the creditworthiness

Â of a group of individuals.

Â And the dataset is taken from the UCI Machine Learning Repository,

Â which is an excellent repository for all kinds of machine learning and

Â prediction types of data sets.

Â So the data set classifies individuals into good credit or bad creditworthiness.

Â And they include a variety of variables to help you to

Â predict the creditworthiness of these people.

Â So the basic process that we'll go through here is we'll first split

Â the data set into a training and test set.

Â Then we'll fit the model to the training data.

Â Then we'll make predictions based on the model, but on the test data.

Â And then we'll compare the predictions on the test data to the truth

Â that we know from the test data to see how well we did.

Â So here are some of the results from the model that we fit.

Â I won't get into the details of

Â the model cuz it doesn't really matter at this point.

Â But here's a plot that you might make.

Â And the basic idea here is on the x-axis we have our predicted probability of being

Â good, a good credit quality.

Â And on the y-axis we have the actual truth of whether you're a good or

Â bad credit, you have the good or bad credit.

Â Because this is the test data set, we actually know the truth, and so

Â we can compare the truth with what our prediction says.

Â And so a couple things you'll notice here, first of all you'll notice it doesn't

Â quite look like that picture where I simulated the data.

Â So, it doesn't quite match our expectations of having this very good

Â separation, right?

Â All along the range of the x-axis you'll see there are both values of bad and

Â good and so the separation isn't necessarily so good.

Â Now you will notice there is a big clump of points in the range of the x-axis.

Â But about point 6 they are mostly in the good credit qualities category.

Â So as you can see as the predictor quality goes up you'll see that the number

Â of actual good credit quality people increases.

Â So there is at least an association there, so that's good.

Â But one thing you'll notice is that the prediction scores were all kind

Â of on the high end, they're all basically greater than a half, and so

Â there isn't a lot of range there.

Â 5:10

So ultimately, it's not clear that this prediction algorithm is particularly good.

Â It seems like it's having some difficulty finding out, finding a good combination

Â of features that can separate people with good and kind of bad credit risk.

Â Something that can also be helpful is to compute a set of summary statistics about

Â the prediction algorithm, and you can see that here.

Â 5:28

So here at the very top is what's called a confusion matrix and

Â it shows the number of predictions that are in the truth, bad or

Â good, that's called the reference, and then what we predict to be bad or good.

Â And you'll notice immediately that most of the predictions are just of good.

Â So the algorithm just basically classifies everyone as good credit quality.

Â So I guess that makes sense because most of the individuals in this data set have

Â good credit quality.

Â So if you were to make a prediction it would be easiest just to say

Â you have good credit.

Â And then your prediction would be right about 70% of the time.

Â 6:04

You can see that the accuracy of the algorithm is about 70%.

Â And so that's okay, but

Â the problem is that the algorithm's specificity is very poor.

Â Which means that if you actually have bad credit, the probability that the algorithm

Â will classify you as such is only about 2.6%, so it's very low.

Â So if you truly have bad credit the algorithm will have a difficult time

Â picking that up.

Â 6:28

So there are a couple of things to think about

Â when you see the results of a prediction model like this one.

Â So, the first thing to think about is prediction quality.

Â First, you have to ask yourself is the model's accuracy good enough for

Â your purposes.

Â So you solve the summary statistics from this particular model, which is okay.

Â It had a 70% accuracy, it had 2.6% specificity.

Â Is that good enough for your purposes?

Â Well, it depends on what your purposes are.

Â For example, in many medical applications where the outcome is the presence of

Â a disease you may want that test or that algorithm to have a very high sensitivity.

Â So if they truly have the disease, you'll want the algorithm to pick it up.

Â Because then you can send them into treatment,

Â and then kinda send them down the road to recovery.

Â However, if the treatment for that disease is very painful or

Â there are a lot of bad side effects you may want to be careful about

Â exactly who you send out to treatment.

Â And particularly, you wouldn't want to send someone who didn't have the disease

Â to a treatment that's going to be very painful and have a lot of side effects.

Â So there, you would make sure that if someone, you would want to make sure that

Â if someone does not have the disease, that you pick that the algorithm picks that up.

Â And so there are different metrics that you want to favor over each other,

Â depending on the kind of decision that will be made and

Â the consequences of those decisions.

Â And so, for example, in a financial application like the dataset we just

Â looked at with good and bad credit quality, there may be asymmetric costs

Â associated with mistaking good credit for bad versus bad credit for good.

Â So one scenario might have very little cost or

Â another scenario might have very high cost.

Â And so you wanna think about,

Â given the outcomes of your decision what types of metrics you want,

Â whether sensitivity or specificity, or all these other kinds of metrics,

Â which ones are going to be most important to you in your setting.

Â 8:15

Every setting is going to be a little bit different, and so

Â you're not going to always focus on a given metric for every single application.

Â So a hallmark of almost all prediction algorithms is tuning parameters.

Â Most algorithms have lots and lots of tuning parameters.

Â And even though they're called tuning parameters, they can often have a very big

Â impact on the prediction quality of the algorithm depending on how you set them.

Â 8:41

And so you should be careful about how you set them and how you change them around.

Â Now, there's no prediction algorithm that I'm aware of

Â where a single set of tuning parameters works fine for all problems.

Â So for any new data set that you bring into a prediction algorithm,

Â you'll probably have to tune it a little bit, and that's okay.

Â But the most important thing, excuse me, is that you're gonna want to make sure you

Â keep track of all the tuning parameters you set and

Â the process through which you set these tuning parameters.

Â Because ultimately you're gonna want this model to be reproducible.

Â And if you can't remember or you can't reproduce the tuning parameters,

Â you'll never be able to reproduce the algorithm itself.

Â 9:21

So one last thing I want to mention is about the availability of other data.

Â So many prediction algorithms

Â these days are very good at exploring the structure of complex data and

Â making very good predictions, especially once you get the tuning parameters right.

Â And so, but now it may be that if your model is not working very well,

Â that you have to change to another algorithm or another procedure because

Â different procedures can work well in different settings and

Â different types of data structures and different types of data set up.

Â And so it's not necessarily true that all, the algorithms are exchangeable,

Â you may want to change the algorithm.

Â However, if you try a few algorithms and they all seem to be producing kind of

Â a similar quality of prediction, regardless of how well you tune them,

Â it may be time to get more data or other data to help you predict the outcome.

Â It could just be that the dataset you have only has an intrinsic amount of ability

Â to predict whatever outcome you're interested in.

Â And you have to kind of get other data that will have better predictive power.

Â And so think about that as you're building prediction algorithms and

Â you're seeing the results.

Â That there always might be this possibility that you have to bring in

Â additional data to improve your predictions.

Â