0:00

Hello, my name is Pavel and now we talk about how to use Spark MLlib.

Â We will make our first step in studying machine learning of Big Data.

Â Let's go.

Â To begin with, we will build a linear regression.

Â And in this lesson you will learn how to prepare data

Â for Spark MLlib tasks so that you can use Spark MLlib,

Â learn how to make predictions using linear regression,

Â and estimate the accuracy of this prediction.

Â You need some kind of a dataset in order to build our prediction.

Â I generally like to ride the bike.

Â Every day I go to work and back by bike.

Â And so, I gladly found a dataset for renting bikes.

Â And let's try to analyze it.

Â I put it in HDFS,

Â so you can download it via Spark DataFrame and see what there is.

Â There are 16 columns in it which store a lot of rented bicycles on different days.

Â You can divide this dataset into three parts.

Â Firstly, it is a date data and all that can be extracted from it.

Â Some good data engineers made a good thing for us.

Â They created a lot of features from the data, season, year,

Â months whatever the day was,

Â a week day or weekend,

Â what number it had.

Â Secondly, it is the data about the weather.

Â Weathersit means what the weather was like that day? One means sunny, clear, good.

Â And four, rain, snow, snow storm, typhoon, tsunami, and apocalypse.

Â Temperature means the temperature.

Â Predicting your question about the scales of the temperature Celsius or Kelvin.

Â I want to ask you what renting service we're dealing with?

Â The one on the moon?

Â Of course not. The temperature is normalized.

Â Zero means -8 degrees.

Â And one means +98 degrees.

Â All the rest indications are intermediate.

Â The next is so-called Baret temperature which is a very interesting thing.

Â This value is combination of temperature, humidity, and wind.

Â How people perceive it?

Â It varies from the -16 when zero to +15 when one.

Â Then next value is humidity.

Â Zero usually somewhere in Sahara Desert and one is 100% humidity after the rain.

Â And the speed of the wind,

Â when zero means calm no both swim under sails,

Â and one is the most terrible windmills that has

Â ever been recorded during those two years of measurements.

Â And three columns of objective function mean rental statistics.

Â How many random passengers rent bicycles like this?

Â Wow bicycles, I should rent one of them.

Â How many of them already registered customers use bikes?

Â And how many people rode the bikes today in total?

Â I think you shouldn't be bothered with separation

Â of registered users and unregistered users.

Â Let's predict only the last column.

Â Let's look at that schema of the data we have downloaded.

Â We took it from the CSV file.

Â In fact, this CSV file doesn't contain information in

Â which format which comes to start and they are stored in the form of strings.

Â So those integers, floating points, dates,

Â all is kept in the form of strings.

Â Well, let's fix it.

Â First, you should throw away all unnecessary fields,

Â type of the record number, date,

Â random number of bicycles rent.

Â Secondly, you need to translate all the integers fields to integers.

Â Thirdly, convert all the floating point numbers to double.

Â And fourthly, I changed the CNT column

Â of the numbers of bicycles to the standard name label.

Â Let's see what we have got.

Â As you can see, the number of columns have become much smaller.

Â I selected the last column in the way we want to learn to predict.

Â We predicted by a linear regression.

Â Here is the simplest form.

Â It looks like this.

Â There is a function from one variable.

Â In our case, linear regression will depend from

Â 12 variables but the principle is the same.

Â Prediction at output is simply

Â a linear combination of the input features with some weights.

Â Before we begin to train regression,

Â we need to divide the data in the train and test

Â those that all our quality measurements are done in the test.

Â Or we can make it one,

Â two complicated functions that will predict one of those examples to good to be true.

Â And thus accordingly, we can train it to get 100% quality but still miss the mistake.

Â We divide our dataset into a train and test with proportion of 70 to 30;

Â 70 goes to the train and 30 goes to the test.

Â And we have trained our linear regression.

Â At the input one, we should receive a certain vector of values.

Â Output is a number.

Â Where do we get this vector?

Â It's a vector that is fed from the input

Â must be in the special column in a DataFrame which has a vector type.

Â In order to collect it from the variable values,

Â we need to use VectorAssembler which specifies input column with the output columns,

Â and then apply it to our train sample.

Â And that's what we have got.

Â We have those 12 columns that were at the beginning.

Â We have added to them one column with our feature vector and now we can use it.

Â Let's throw all the other columns because they are useless.

Â Leave only features and the labels.

Â Create an object of linear regression and train it on your training samples.

Â So, now you need to apply it on your sample.

Â Let's see what happens.

Â By default, the linear regression prediction are written in

Â the same DataFrame as input and codes the prediction.

Â If you compare the labels to the predictions,

Â you will see that for many labels you have not so mistaken.

Â For the first example,

Â we have made a mistake about 50 elements and for the later we also make a mistake at 15.

Â That is sometimes we predicts the labels very well.

Â Let's now to maintain the experimental integrity,

Â see how accurately we will predict everything on the test sample.

Â You can also see there are values that are close and there are

Â values where you even be mistaken for 500 bicycles.

Â It's not good.

Â We can calculate the quality of prediction of regression using a special evaluator.

Â First, with the help of this evaluator we can

Â calculate R squared so-called coefficient of determination,

Â which describes how well our linear regression pass through all of the examples.

Â So the coefficient of the determination is 0.76,

Â is it good or bad?

Â Well, let's take a look.

Â Here in this picture,

Â there are different distributions smoothed by a linear regression

Â for which different estimates of the accuracy of R squared are given.

Â The left most has an accuracy of 0.99.

Â When almost one when all points almost directly fell on our straight line.

Â Our accuracy is 0.76.

Â So most likely, our situation is similar to the picture in the middle.

Â Another way to estimate the mistake is to measure the mean error.

Â For example, the root mean square error.

Â It is the root of the sum of squares of all the deviations divided by N.

Â The standard deviation is 910 bicycles that is you are mistaken for whole 1000 bikes.

Â Wow, lots of.

Â Well, let's try to construct some useful metric that

Â will be understood by your possible customers.

Â For example, they are ready to store extra 300 bikes in their warehouse just in case.

Â How many errors containing less than 300 bicycles did you make in your predictions?

Â For this, you should substract the real number of bicycles from the

Â predicted one and take the absolute value of the residual.

Â In all cases, when the mistake was less than 300 bikes put one;

Â in the remaining cases, zero.

Â And now just calculate the average of this ones and zeros.

Â We have predicted correctly inserted 2% of

Â cases whereas accuracy of prediction was within 300 bikes.

Â This is a business-friendly metric and customers can

Â decide whether to use this prediction in their work or not.

Â So let's sum up, in this lesson,

Â you have learned how to prepare data from Spark MLlib tasks,

Â make predictions using linear regression,

Â and evaluate the quality of the predictions.

Â And at the next lesson,

Â let's pay more attention to the architecture of Spark MLlib library.

Â How it's arranged? What part it has and how it can be used?

Â Stay with us. It will be interesting.

Â