0:00

A lot of the action in machine learning has focused on what

Â algorithms are the best algorithms for

Â extracting information and using it to predict.

Â But it's important to step back and look at the entire prediction problem.

Â This is a little diagram that I made to illustrate

Â some of the key each, issues in building a predictor.

Â So you start of with, suppose I want to

Â predict for these dots whether they're red or blue.

Â Well, what you might do is have a big group of dots that you

Â want to predict about, and then you use

Â probability and sampling to pick a training set.

Â The training set will consist of some red dots and some blue

Â dots, and you'll measure a whole bunch of characteristics of those dots.

Â Then you'll use those characteristics to build what's called a

Â prediction function, and the prediction function will take a new dot,

Â whose color you don't know, but using those characteristics that

Â you measured will predict whether it's red or whether it's blue.

Â Then you can go off and try to

Â evaluate whether that prediction function works well or not.

Â 1:03

This is always a required component of building every machine learning algorithm

Â is deciding which samples you're going to use to build that algorithm.

Â But sometimes it's over-looked, because all of the action that you hear about for

Â machine learning happens down here when you're

Â building the actual machine learning function itself.

Â 1:19

One very high profile example of the ways that this

Â can cause problems is the recent discussion about Google Flu trends.

Â Google Flu trend is tried to use the terms that people were typing into

Â Google, terms like, I have a cough, to predict how often people would get flu.

Â In other words, what was the rate of flu that was going

Â on in a particular part of the United States at a particular time?

Â 1:41

And they compared their algorithm to approach taken

Â by the United States government, where they went out

Â and they actually measured how many people were

Â getting the flu, in different places in the US.

Â And they found in their original paper that the

Â Google Flu trends algorithm was able to very accurately represent

Â the number of flu cases that would appear in

Â various different places in the US at any given time.

Â But it was quite a bit faster and quite a

Â bit less expensive to measure using search terms at Google.

Â The problem that they didn't realize at the time, was that

Â the search terms that people would use would change over time.

Â They might use different terms when they were

Â searching, and so that would affect the algorithm's performance.

Â And also, the way that those terms were actually

Â being used in the algorithm wasn't very well understood.

Â And so when the function of a particular search

Â term changed in their algorithm, it can cause problems.

Â And this lead to highly inaccurate results for the Google

Â Flu trends algorithm half over time as people's internet usage changes.

Â So this gives you an idea that choosing

Â the right dataset and that knowing what the specific

Â question is are again paramount, just like they have

Â been in other classes of the data science specialization.

Â So here are the components of a predictor.

Â You need to start off as always in all, any problem

Â with data science with a very specific and well defined question.

Â What are you trying to predict and what are you trying to predict it with?

Â 2:56

Then you go out and you collect the best

Â input data that you can to be able to predict.

Â And from that data you might either use measured

Â characteristics that you have or you might use computations

Â to build features that we'd think you might be

Â useful for predicting the outcome that you care about.

Â At this stage then you can actually start to use the machine learning

Â algorithms you may have read about, such as Random Forest or Decision Trees.

Â And then what you can do is estimate the

Â parameters of those algorithms, and use those parameters to

Â apply the algorithm to a new data set and

Â then finally evaluate that algorithm on that new data.

Â So I'm going to just show you one quick little

Â example, to show you how this little process works.

Â So this is obviously a trivialized version of what would happen in a

Â real machine running algorithm, but it gives you a flavor of what's going on.

Â So you start off with asking something about the question.

Â So you start with a in general

Â people usually start with a quite general questions.

Â So here is, can I automatically detect emails

Â that are SPAM from those that are not?

Â So SPAM emails are emails that you got

Â that you, come from companies that get sent out

Â to thousands of people at the same time

Â and that you might not be interested in it.

Â 4:02

So you might want to make your question a little bit more concrete.

Â You often need to when doing machine learning.

Â So, the question might be, can I use

Â quantitative characteristics of those emails to classify them as

Â SPAM, or what we're going to call HAM which

Â is the email that people would like to receive?

Â 4:19

So once you have your question, then you need to find input data.

Â In this case, there's actually a bunch of data

Â that's available and already pre-processed for us in R.

Â So it's actually in the current lab

Â package K-E-R-N-L-A-B and it's the SPAM dataset.

Â So we can actually load that data set into R directly, and it has some information

Â that's been collected about SPAM and HAM emails already available to us.

Â Now we might want to keep in mind that that might

Â not necessarily be the perfect data, in fact, we don't have all

Â of the emails that have been collected over time, or we

Â don't have all the emails that are being sent to you personally.

Â So we need to be aware of the potential limitations of this

Â data, when we're using it to build an algorithm, a prediction algorithm.

Â 4:58

Then we want to calculate something about features.

Â So, imagine that you have a bunch of emails.

Â And here's an example email that's been sent to me.

Â Dear Jeff, can you send me the address, so I can send you the invitation.

Â Thanks, Ben.

Â If we want to build a prediction algorithm,

Â we need to calculate some characteristics of

Â these emails that we can use to be able to build a predictive algorithm.

Â And so one example might be, we can

Â calculate the frequency with which a particular word appears.

Â So here, we're looking for the frequency that the word you appears.

Â And so in this case, it appears twice in this email so 2 out

Â of 17 words or about 11% of the words in this email are you.

Â We could calculate that same percentage for every single email that we have and

Â now we have a qualitative characteristic that we can try to use to predict.

Â 5:43

So if the data in the current lab package that I've shown here are actually,

Â information just like that, for every email we

Â have the frequency with which certain words appear.

Â And so, for example if credit appears very often in the email or money appears

Â very often in the email, you might imagine that that email might be a SPAM email.

Â So, as one example of that, we looked at the frequency

Â of the word, your, and how often it appears in the email.

Â And so, I've got a plot here that's a density plot of the, that data.

Â And so, on the x-axis is the frequency

Â that with which, your, appeared in the email.

Â And on the y-axis is the density, or the

Â number of times the that frequency appears amongst the emails.

Â And so what you can see is that most of the emails that are SPAM, those are the

Â ones that are in red, you can see that

Â they tend to have more appearances of the word, your.

Â Where as all of the emails that are HAM, the

Â ones that we actually want to receive have a much higher peak

Â right over here down near 0, so there's very few

Â emails that have a large number of viewers that are HAM.

Â 6:49

So, we can build an algorithm in this case let's build a very very simple algorithm.

Â We can estimate an algorithm where we want to just find a cut off a constant C, where

Â if the frequency of your is above C then

Â we predict spam and otherwise we predict that it's ham.

Â 7:05

So going back to our data we can fig, try to figure out what

Â that best cut off is, and here's an example of a cutoff that you could

Â choose, so choose a cut off here that if it's above 0.5 then we

Â say that it's SPAM, and if it's below 0.5 we can say that it's HAM.

Â And so we think this might work because you can see that

Â the large spike of blue HAM messages are below that cut off.

Â Whereas the big, one of the big spikes of the SPAM messages is above that cut off.

Â So you might imagine that wil cache quite a bit of that SPAM.

Â So then what we do is we evaluate that.

Â So what we would do is calculate for

Â example predictions for each of the different emails.

Â We take a prediction in that says, if the frequency of yours

Â above 0.5, then you're spam and if it's below then you're nonspam.

Â And then we make a table of those predictions and divide

Â it by the length of the, all the observations that we have.

Â And so we can say is that, when you're nonspam about

Â 45% of the time, 46% of the time, we get you right.

Â When you're spam about 29% of the time, we get you right.

Â So, total we get you write about 45% plus 29% is about 75% of the time.

Â So our prediction algorithm is about 75% accurate in this particular case.

Â So that's how we would evaluate the algorithm.

Â This is of course any same dataset where we actually calculated

Â it, the prediction function, and as we will see in later lectures.

Â This will be an optimistic estimate of the overall error rate.

Â So that's an overview of, the basic steps in building a predictive algorithm.

Â