0:00
Welcome to the course Practical Machine Learning.
 This is one of the funnest classes in the Data Science specialization.
 I'm very excited about all the ways you can use date to predict, and I think it's
 one of the areas that's probably the most sort
 of well known when you think about Data Science.
 This first lecture is going to cover the
 motivation and pre-requisites for the course and
 give you a little bit of idea about where we're going to be going.
 So this course will cover the basic
 ideas of machine learning and prediction, and our
 goal is to be very practical and very
 hands on with our understanding of machine learning.
 And so the idea here is that we're going to cover the
 main techniques that lots of people use and that you've maybe heard about.
 Linear regression and random forests and things
 like that, but we're also going to cover
 sort of the nitty-gritty details and the
 practicalities of doing machine learning and real examples.
 And, so, we've got to start off with ideas of, like, study design, so training versus
 test sets, and deciding how do you actually
 build up a predictor in a real data set.
 Then we'll talk about conceptual issues like out of sample error
 and over fitting, so you might have heard of the fact
 that some models are maybe a little tuned to the noise
 and that so it won't predict well on a new sample.
 And so we'll talk about how do you sort of prevent those sorts of problems.
 We'll also talk about things like ROC curves or methods for
 evaluating predictors for deciding whether a predictor's any good or not.
 1:16
And we're going to be focusing a lot on the practical implementation of
 these machine learning algorithms and also these more conceptual issues in R.
 And we're going to be using the caret package for a large majority of that.
 The caret package is a nice unifying framework for
 a lot of machine learning packages that exist in R.
 Those packages were built by a lot of different people, and they
 have different parameters and different choices
 that have been made by their developers.
 And the caret package is sort of a nice unifying framework for that.
 This course does depend quite heavily on the tools
 that you've learned in the data scientist's toolbox and
 in R Programming, so if you haven't taken those
 classes already, they're highly encouraged before taking this class.
 It'll also be useful if you've taken exploratory data
 analysis, reporting data and reproducible research, and regression models.
 Those classes aren't required, but, a lot of the material that
 we'll cover in this class will, be related to that picture, so
 if you've seen it before, it might be a little bit,
 easier on you if you go through the material in this class.
 2:14
So who predicts things?
 This is an important question.
 I think an important motivator for this class.
 Basically, most organizations now use machine learning in some simple
 form or minimum and often in much more complicated forms.
 So here are a couple of examples.
 Local governments might try to predict pension
 payments in the future so that they
 know whether their revenue generation mechanisms have
 sufficient, funds, generated to cover those pension payments.
 Google might want to predict whether you're going
 to click on an ad so that they can
 show you only the ads that, are most
 likely to get clicks, and so it'll increase revenue.
 Amazon and Netflix and other companies like that will show you
 one movie and they want you to buy the next movie.
 In order to do that they want to show you what you may be interested in.
 Movies that you have seen this one movie, so you might be interested in
 these other movies, so they can kind
 of keep you watching and again increase revenue.
 Insurance companies employ large groups of actuary and statisticians
 to try to predict your risk of all sorts
 of different things, including death so they can know
 what's the right price to set insurance premiums at.
 And then please select Johns Hopkins where I work will
 also want to predict who's going to succeed in their programs.
 So which students that have applied to our
 program will be most likely to be successful.
 All of these different prediction, tasks are preformed by a
 variety of different organizations, and they're preformed at different levels.
 So some of them are very complicated.
 Predicting which ad you might click on might have a whole bunch of
 predictors, and it might be based on quiet a complicated machine learning algorithm.
 In some cases, it might be a lot simpler in terms of what you're trying to predict.
 And so, either way, it's an important
 component of basically every major organization these days.
 So why would you predict things?
 Well, one is glory.
 Here's a picture of Chris Volinsky.
 He's a member of the team that won the Netflix Prize.
 The Netflix Prize was a million dollar prize that
 was given out to a teen that could reduce the
 error that Netflix was making when they were trying to
 predict which new movie somebody might be interested in seeing.
 So Chris was a member of a large organization of multiple teams that
 blended their models together, and they predicted the best and won the prize.
 It's actually kind of a, a fascinating story about how that happened.
 And so, of course, they all got a lot of sort
 of nerd credit and a lot of glory for winning these competitions.
 And so that's one way, reason you might
 be excited about being good at machine learning.
 4:38
You might also be excited because you can, there's money in it.
 So not only through organizations where you can
 earn a lot of money if you know how
 to best predict which ads people will click
 on and so forth but even in these competitions.
 So, for example, this is the heritage health prize.
 And so this was a $3 million prize to the team that
 could best predict who would be admitted to the hospital in a year.
 5:00
And when you were trying to do this prediction,
 you would use information about the previous hospitalizations from
 previous years, and nobody actually won three million dollars
 but people did win quite a bit of money
 from this prize over in the sort of the
 interim prizes, and so people actually both make money
 through the competitions, but they also spun off analytics
 companies and organizations based on their performance in these competitions.
 In general, it's, it's now kind of a sport.
 Data science is a sport, particularly in terms of prediction.
 And so these are, this organization Kaggle is
 one of, many organizations that can host these competitions
 where you can try to predict, the outcome
 of a particular experiment, or you try to predict
 all sorts of different things, and these competitions
 often run for a certain fixed period of time
 and often have a lot, of, of a little bit of money on the line as well.
 So, it can be a lot of fun, and there's a ranking and
 a leaderboard, so you can kind of get into the fun of the competition.
 5:56
This is a little closer to my area of research, so you might also predict for,
 the purposes of doing sort of better medical
 decision making and so Oncotype DX is a
 prognostic gene expression signature that can be measured
 in women who have breast cancer, and it
 can be used to predict how long they'll
 survive, given a set of conditions that they have.
 So that can be useful for medical doctors
 when making decisions about patients with breast cancer.
 6:26
This is a book that I find very useful.
 Its a little bit advanced for this class, although
 a lot of the tools are incredibly useful still.
 It's called the Elements of Statistical Learning, and
 so this is a book that's actually, you
 can get a free copy of the PDF from the author's website, so that's very nice.
 But if you do really like the book, I encourage you to buy it as well.
 The author's put a lot of effort on, into it, and it's a great book, and so it
 could be very useful in terms of having a
 lot of information that we'll cover in this class.
 7:18
If you want some more advanced materials, so one place
 to go would be the Machine Learning class from Coursera.
 And what I mean by advanced material is sometimes you might be interested
 in a lot more of the sort of mathematical detail behind how these
 algorithms work, or you might be interested in a lot more of the
 sort of high level machine learning algorithms
 that are on the very cutting edge.
 And I think this class would be a great place
 for you to start learning about those with some material.
 This class, like I said, will cover the basics and will focus
 on getting you to the point from sort of zero to 60.
 And, in other words, it'll get you to the
 point where you can use machine learning tools in
 your day to day, but it won't necessarily cover
 all the top level details of machine learning algorithms.
 7:58
There is actually a huge amount of
 information available out there on machine learning.
 It's a very hot topic right now so so I listed
 her a bunch of links that sends you to information at Quora.
 It's from Science, from MIT, CMU, and Kaggle, which will give you
 a lot of information about how to do machine learning in a variety
 of different ways, and so, if this class whets your appetite and
 gets you excited about some of these other things, that'd be great, too.
Â