In this lecture, we're going to describe the various components and the tasks to implement the capstone project. So the purpose of the capstone project in this course is going to be to try and harness the knowledge you've developed so far on machine learning, evaluation, feature design, etc., and to harness that knowledge to implement four practical tasks on building a recommender system on a real-world data set. So the data set we're going to use here is the one I've given in the last few lectures, this relatively large data set of Amazon musical instrument reviews. I've already provided a code stub which separates the data into the training and test portions. So your task is going to be to implement techniques for various functions, then try to get good performance on the test set but you're only allowed to use the training data for training. So you have to correctly implement your own training validation and test pipeline but I've already separated the data into a training and test faction, and you have to implement a solution that only looks at the training component of that data, to make the best possible predictions on the test set. So the four tasks you'll implement on this data set consist of the following. The first is basic data processing, going way back in things we covered in course one, followed by a classification and regression tasks, and finally a Recommender Systems task or which I'll describe as follows. So the data processing task, you have to implement a variety of simple functions to compute basic statistics about the data. For example, can we take this data sets or how many unique users are there? What is the average rating? What fraction of reviews have they verified flag for example? So simple data processing just like we saw in course one on this JSON structured data set, and each of these tasks is going to require you to fill in a simple function stub which has been provided to compute each of these various quantities. The second task would be a classification task and specific classification task will try and take that data set and for each review, estimate whether that review corresponds to a verified purchase or not. As a binary classification task, main challenge we're going to have to overcome here, it is very imbalanced. So the vast majority of reviews are going to correspond to verified purchases and then only a small number will be unverified. So given unbalanced data set and you can refer back to the previous lectures to see why, we might try to use a balanced evaluation metrics, so the balanced error rate is an example that we'll use. So that's to stop you from having kind of a trivial solution which would just predict that every single review corresponding to a verified purchase, it would have high accuracy but it would not be a useful predictor that's why we used a balanced evaluation metric. So you have to use classification techniques that lead to good performance for this metric, and in all of these tasks, will be a simple baseline given to you. So you have to beat this simple baseline which is just going to perform logistic regression based on the length and the rating of the review, and nothing else. So probably, you can do something better than that baseline or adapt their baseline to use more complex features or different features or better regularization strategies. Really a function here is just to try and come up with a solution based on the knowledge you've developed which outperforms the simple baseline on the test data of course, not just on the training data. The third task is going to be a regression task. So for this task, we'll try to use some word features and other features if you would like to perform sentiment analysis. So this is a regression task. We're trying to predict a rating or the sentiment based on the text in some review. The main challenge here is going to be to avoid over-fitting, since we're now using high dimensional features, namely, features based on different words. So you have to carefully implement that training validation test pipeline to get your regularization parameters working well. You'll also have to carefully do some feature engineering. So what choices do you make when pre-processing the data set that, do you keep or remove punctuation, capitalization? How many words do you use? Your goal is again, going to beat some simple baseline. The baseline here is just a straightforward model which uses the 100 most popular words only within some regression framework. Finally, for the recommendation task, task four, you have to predict ratings that users will give to items. This is a recommender systems task. So given a user enlightened predictor rating, the main challenge is going to be to correctly implement one of his complex models. You can kind of follow the code I've given in the previous lectures, but you'll have to modify that a little bit to somehow improve its performance. The real challenges here will be to be careful about overfitting as well as how do you intelligently initialize your model, and again, you'd have to outperform a simple baseline which is that bias only model I provided a few lectures ago. So to evaluate this capstone project, it's fairly straightforward, you're given a training set and a test set, you're only allowed to use the training set to tune your model but you have to get good performance on the test set in terms of the balanced error rate or the mean squared error, or something like that. So in each of these tasks, you'll have to avoid overfitting. You'll have to have a model that actually translates well from the training set to the test set, and really you'll just have to try and come up with a sophisticated model or a somewhat sophisticated model that beats the simple baselines that have been provided. So this is going to have to leverage various ideas we've covered throughout the specialization. How do you correctly build training validation, test pipelines? Have you regularize properly to avoid overfitting? How do you engineer features in an effective way and can use the appropriate technique for each task? So that's about it. This lecture was just to introduce the capstone project for course four which just consists of solving various real-world recommender system tasks on a relatively large data set.