Welcome to the course Practical Machine Learning. This is one of the funnest classes in the Data Science specialization. I'm very excited about all the ways you can use date to predict, and I think it's one of the areas that's probably the most sort of well known when you think about Data Science. This first lecture is going to cover the motivation and pre-requisites for the course and give you a little bit of idea about where we're going to be going. So this course will cover the basic ideas of machine learning and prediction, and our goal is to be very practical and very hands on with our understanding of machine learning. And so the idea here is that we're going to cover the main techniques that lots of people use and that you've maybe heard about. Linear regression and random forests and things like that, but we're also going to cover sort of the nitty-gritty details and the practicalities of doing machine learning and real examples. And, so, we've got to start off with ideas of, like, study design, so training versus test sets, and deciding how do you actually build up a predictor in a real data set. Then we'll talk about conceptual issues like out of sample error and over fitting, so you might have heard of the fact that some models are maybe a little tuned to the noise and that so it won't predict well on a new sample. And so we'll talk about how do you sort of prevent those sorts of problems. We'll also talk about things like ROC curves or methods for evaluating predictors for deciding whether a predictor's any good or not. And we're going to be focusing a lot on the practical implementation of these machine learning algorithms and also these more conceptual issues in R. And we're going to be using the caret package for a large majority of that. The caret package is a nice unifying framework for a lot of machine learning packages that exist in R. Those packages were built by a lot of different people, and they have different parameters and different choices that have been made by their developers. And the caret package is sort of a nice unifying framework for that. This course does depend quite heavily on the tools that you've learned in the data scientist's toolbox and in R Programming, so if you haven't taken those classes already, they're highly encouraged before taking this class. It'll also be useful if you've taken exploratory data analysis, reporting data and reproducible research, and regression models. Those classes aren't required, but, a lot of the material that we'll cover in this class will, be related to that picture, so if you've seen it before, it might be a little bit, easier on you if you go through the material in this class. So who predicts things? This is an important question. I think an important motivator for this class. Basically, most organizations now use machine learning in some simple form or minimum and often in much more complicated forms. So here are a couple of examples. Local governments might try to predict pension payments in the future so that they know whether their revenue generation mechanisms have sufficient, funds, generated to cover those pension payments. Google might want to predict whether you're going to click on an ad so that they can show you only the ads that, are most likely to get clicks, and so it'll increase revenue. Amazon and Netflix and other companies like that will show you one movie and they want you to buy the next movie. In order to do that they want to show you what you may be interested in. Movies that you have seen this one movie, so you might be interested in these other movies, so they can kind of keep you watching and again increase revenue. Insurance companies employ large groups of actuary and statisticians to try to predict your risk of all sorts of different things, including death so they can know what's the right price to set insurance premiums at. And then please select Johns Hopkins where I work will also want to predict who's going to succeed in their programs. So which students that have applied to our program will be most likely to be successful. All of these different prediction, tasks are preformed by a variety of different organizations, and they're preformed at different levels. So some of them are very complicated. Predicting which ad you might click on might have a whole bunch of predictors, and it might be based on quiet a complicated machine learning algorithm. In some cases, it might be a lot simpler in terms of what you're trying to predict. And so, either way, it's an important component of basically every major organization these days. So why would you predict things? Well, one is glory. Here's a picture of Chris Volinsky. He's a member of the team that won the Netflix Prize. The Netflix Prize was a million dollar prize that was given out to a teen that could reduce the error that Netflix was making when they were trying to predict which new movie somebody might be interested in seeing. So Chris was a member of a large organization of multiple teams that blended their models together, and they predicted the best and won the prize. It's actually kind of a, a fascinating story about how that happened. And so, of course, they all got a lot of sort of nerd credit and a lot of glory for winning these competitions. And so that's one way, reason you might be excited about being good at machine learning. You might also be excited because you can, there's money in it. So not only through organizations where you can earn a lot of money if you know how to best predict which ads people will click on and so forth but even in these competitions. So, for example, this is the heritage health prize. And so this was a $3 million prize to the team that could best predict who would be admitted to the hospital in a year. And when you were trying to do this prediction, you would use information about the previous hospitalizations from previous years, and nobody actually won three million dollars but people did win quite a bit of money from this prize over in the sort of the interim prizes, and so people actually both make money through the competitions, but they also spun off analytics companies and organizations based on their performance in these competitions. In general, it's, it's now kind of a sport. Data science is a sport, particularly in terms of prediction. And so these are, this organization Kaggle is one of, many organizations that can host these competitions where you can try to predict, the outcome of a particular experiment, or you try to predict all sorts of different things, and these competitions often run for a certain fixed period of time and often have a lot, of, of a little bit of money on the line as well. So, it can be a lot of fun, and there's a ranking and a leaderboard, so you can kind of get into the fun of the competition. This is a little closer to my area of research, so you might also predict for, the purposes of doing sort of better medical decision making and so Oncotype DX is a prognostic gene expression signature that can be measured in women who have breast cancer, and it can be used to predict how long they'll survive, given a set of conditions that they have. So that can be useful for medical doctors when making decisions about patients with breast cancer. This is a book that I find very useful. Its a little bit advanced for this class, although a lot of the tools are incredibly useful still. It's called the Elements of Statistical Learning, and so this is a book that's actually, you can get a free copy of the PDF from the author's website, so that's very nice. But if you do really like the book, I encourage you to buy it as well. The author's put a lot of effort on, into it, and it's a great book, and so it could be very useful in terms of having a lot of information that we'll cover in this class. And then the package that I think we're going to be using by far the most in this class is the caret package. So the caret package combines a very large number of predictors that have been built in R, and it allows you to sort of set up training and test sets in a kind of a unified framework that prevents a lot of the problems that come up when building prediction models. If you want some more advanced materials, so one place to go would be the Machine Learning class from Coursera. And what I mean by advanced material is sometimes you might be interested in a lot more of the sort of mathematical detail behind how these algorithms work, or you might be interested in a lot more of the sort of high level machine learning algorithms that are on the very cutting edge. And I think this class would be a great place for you to start learning about those with some material. This class, like I said, will cover the basics and will focus on getting you to the point from sort of zero to 60. And, in other words, it'll get you to the point where you can use machine learning tools in your day to day, but it won't necessarily cover all the top level details of machine learning algorithms. There is actually a huge amount of information available out there on machine learning. It's a very hot topic right now so so I listed her a bunch of links that sends you to information at Quora. It's from Science, from MIT, CMU, and Kaggle, which will give you a lot of information about how to do machine learning in a variety of different ways, and so, if this class whets your appetite and gets you excited about some of these other things, that'd be great, too.