Hello, my name is Pavel and now we talk about how to use Spark MLlib. We will make our first step in studying machine learning of Big Data. Let's go. To begin with, we will build a linear regression. And in this lesson you will learn how to prepare data for Spark MLlib tasks so that you can use Spark MLlib, learn how to make predictions using linear regression, and estimate the accuracy of this prediction. You need some kind of a dataset in order to build our prediction. I generally like to ride the bike. Every day I go to work and back by bike. And so, I gladly found a dataset for renting bikes. And let's try to analyze it. I put it in HDFS, so you can download it via Spark DataFrame and see what there is. There are 16 columns in it which store a lot of rented bicycles on different days. You can divide this dataset into three parts. Firstly, it is a date data and all that can be extracted from it. Some good data engineers made a good thing for us. They created a lot of features from the data, season, year, months whatever the day was, a week day or weekend, what number it had. Secondly, it is the data about the weather. Weathersit means what the weather was like that day? One means sunny, clear, good. And four, rain, snow, snow storm, typhoon, tsunami, and apocalypse. Temperature means the temperature. Predicting your question about the scales of the temperature Celsius or Kelvin. I want to ask you what renting service we're dealing with? The one on the moon? Of course not. The temperature is normalized. Zero means -8 degrees. And one means +98 degrees. All the rest indications are intermediate. The next is so-called Baret temperature which is a very interesting thing. This value is combination of temperature, humidity, and wind. How people perceive it? It varies from the -16 when zero to +15 when one. Then next value is humidity. Zero usually somewhere in Sahara Desert and one is 100% humidity after the rain. And the speed of the wind, when zero means calm no both swim under sails, and one is the most terrible windmills that has ever been recorded during those two years of measurements. And three columns of objective function mean rental statistics. How many random passengers rent bicycles like this? Wow bicycles, I should rent one of them. How many of them already registered customers use bikes? And how many people rode the bikes today in total? I think you shouldn't be bothered with separation of registered users and unregistered users. Let's predict only the last column. Let's look at that schema of the data we have downloaded. We took it from the CSV file. In fact, this CSV file doesn't contain information in which format which comes to start and they are stored in the form of strings. So those integers, floating points, dates, all is kept in the form of strings. Well, let's fix it. First, you should throw away all unnecessary fields, type of the record number, date, random number of bicycles rent. Secondly, you need to translate all the integers fields to integers. Thirdly, convert all the floating point numbers to double. And fourthly, I changed the CNT column of the numbers of bicycles to the standard name label. Let's see what we have got. As you can see, the number of columns have become much smaller. I selected the last column in the way we want to learn to predict. We predicted by a linear regression. Here is the simplest form. It looks like this. There is a function from one variable. In our case, linear regression will depend from 12 variables but the principle is the same. Prediction at output is simply a linear combination of the input features with some weights. Before we begin to train regression, we need to divide the data in the train and test those that all our quality measurements are done in the test. Or we can make it one, two complicated functions that will predict one of those examples to good to be true. And thus accordingly, we can train it to get 100% quality but still miss the mistake. We divide our dataset into a train and test with proportion of 70 to 30; 70 goes to the train and 30 goes to the test. And we have trained our linear regression. At the input one, we should receive a certain vector of values. Output is a number. Where do we get this vector? It's a vector that is fed from the input must be in the special column in a DataFrame which has a vector type. In order to collect it from the variable values, we need to use VectorAssembler which specifies input column with the output columns, and then apply it to our train sample. And that's what we have got. We have those 12 columns that were at the beginning. We have added to them one column with our feature vector and now we can use it. Let's throw all the other columns because they are useless. Leave only features and the labels. Create an object of linear regression and train it on your training samples. So, now you need to apply it on your sample. Let's see what happens. By default, the linear regression prediction are written in the same DataFrame as input and codes the prediction. If you compare the labels to the predictions, you will see that for many labels you have not so mistaken. For the first example, we have made a mistake about 50 elements and for the later we also make a mistake at 15. That is sometimes we predicts the labels very well. Let's now to maintain the experimental integrity, see how accurately we will predict everything on the test sample. You can also see there are values that are close and there are values where you even be mistaken for 500 bicycles. It's not good. We can calculate the quality of prediction of regression using a special evaluator. First, with the help of this evaluator we can calculate R squared so-called coefficient of determination, which describes how well our linear regression pass through all of the examples. So the coefficient of the determination is 0.76, is it good or bad? Well, let's take a look. Here in this picture, there are different distributions smoothed by a linear regression for which different estimates of the accuracy of R squared are given. The left most has an accuracy of 0.99. When almost one when all points almost directly fell on our straight line. Our accuracy is 0.76. So most likely, our situation is similar to the picture in the middle. Another way to estimate the mistake is to measure the mean error. For example, the root mean square error. It is the root of the sum of squares of all the deviations divided by N. The standard deviation is 910 bicycles that is you are mistaken for whole 1000 bikes. Wow, lots of. Well, let's try to construct some useful metric that will be understood by your possible customers. For example, they are ready to store extra 300 bikes in their warehouse just in case. How many errors containing less than 300 bicycles did you make in your predictions? For this, you should substract the real number of bicycles from the predicted one and take the absolute value of the residual. In all cases, when the mistake was less than 300 bikes put one; in the remaining cases, zero. And now just calculate the average of this ones and zeros. We have predicted correctly inserted 2% of cases whereas accuracy of prediction was within 300 bikes. This is a business-friendly metric and customers can decide whether to use this prediction in their work or not. So let's sum up, in this lesson, you have learned how to prepare data from Spark MLlib tasks, make predictions using linear regression, and evaluate the quality of the predictions. And at the next lesson, let's pay more attention to the architecture of Spark MLlib library. How it's arranged? What part it has and how it can be used? Stay with us. It will be interesting.