[MUSIC] Let's continue learning. In this lesson, you'll see how a data scientist builds a model in Python using a library called scikit-learn. In the next lesson, you will be able to invoke this machine learning model your data scientists built using Spark SQL. By the end of this lesson, you will be able to identify the key steps in a data scientist's workflow. I hope you're excited to start building a machine learning model now that you've learned the fundamentals of machine learning and linear regression. In this notebook, we will build a machine learning model using scikit-learn, a popular machine learning library in Python to predict the response time to an incident given a number of features. Let's run our classroom setup. Next we're going to go ahead and load some data in. We're going to use a table called fireCallsClean which contains a subset of our data, which we're going to use to train our model with. Now we need to go ahead and convert our response date time and received time into timestamp types. Because we want to predict the time delay between our response time and the received time, we have to get them out of string type into timestamp type so we can go ahead and compute the difference down below. So we're going to go ahead and create a view called time which contains all of the original columns from fireCallsClean, but with two additional columns, response time and received time, which is the Unix timestamp of our response time and received date time, respectively. Now that we have those two fields in timestamp type, we can go ahead and do some mathematical computations on the two. We can create a view called timeDelay which contains all of the columns from our timetable created up above and with an additional column called time delay, which is the difference between our response time and received time divided by 60 because our response time and received time are represented as timestamps in seconds, and we want to convert it to minutes. Let's go ahead and take a look at what this looks like. We also want to see if we have any corrupt records. So we're going to see if there are any records where the time delay is less than zero. That's weird. We have some of the time delay of almost minus 60 minutes for a structure fire. We think that there could be some data corruption issues that perhaps the fire department might want to look into. For our model, we're going to filter out any of the ones with time delay less than zero. In addition, we're also only going to keep ones with time delay less than 15 minutes. We don't want outliers to be influencing our model too much. The discussion of how to handle outliers is out of the scope of this course. Here we can see the time delay for call type of traffic collision in fire prevention district 10 in Bayview-Hunters point with one alarm, original priority two, and unit type of medic has a time delay of 2.3. Whereas this medical incident in Portola has a time delay of 7.75 minutes. We want to be able to predict this time delay given these other features. And so to do that, we're going to convert our Spark SQL table into a Pandas DataFrame. In addition, we're also going to enable arrow execution for faster transfer from our Spark DataFrame to our Pandas DataFrame. And you can see here, we can execute Spark SQL queries as this. And as you can see here, we can execute a Spark SQL query and then convert the result to a Pandas DataFrame by calling dot to Pandas. This will pull all of the data distributed across our cluster into the driver. You want to be very careful when you call this to Pandas command because if you have a hundred gigabytes of data, you can't pull all of that data into the driver. In this case, we know we have a relatively small dataset so this is safe. Now, let's visualize the distribution of our time delay. We can go ahead and plot a histogram of our time delay. And you'll notice that most of our time delays are between roughly two and six minutes, which is pretty good. You don't want to be waiting much longer than 10 minutes to have the fire department respond to your incident. So for this model, we're going to use an 80/20 train test split. We're going to train our model and 80% of our data, test it on the held out 20%. And so the train test split expects X and y to be two different Pandas DataFrames. So what we're going to do is we're going to take our Pandas DataFrame, drop the time delay column or the label we're trying to predict from X. And then our y is going to just contain the time delay values. So you can see here, we can split our X and y into X train X test, y_train y_test respectively with test size of 0.2 and random state of 42. The random state sets a seed for reproducibility. What that means is if we rerun this notebook on our cluster, we will get the same data points going into training and test set respectively. That way we have reproducible machine learning models. Let's go ahead and run this code. Before we get started building our linear regression model, let's establish our baseline RMSE on our test dataset by always predicting the average value of y in our training dataset. Here we're going to create a numpy array with the same dimensions as our y_test, but it's going to be full of the values of the average y in our training dataset. Then we can use scikit-learn's mean squared error to calculate the MSE of our y test in our average delay and then take the square root so we can get the corresponding RMSE. Great, now that we have established a baseline we can use scikit-learn's pipeline API to build a linear regression model. Our pipeline will have two steps. The first step is a one hot encoder. This converts all of our categorical features into numeric features by creating a dummy column for each value in that category. What that means is if I have a categorical column called animal, let's say, with the values dog, cat and bear, I can't quite pass those in directly to a machine learning algorithm. It doesn't understand this string data type. Instead, I have to convert it into a numeric form. And the best way to do that for linear regression is to create a column for each distinct value in that categorical feature. So in this case, dog would have column one and the value would be one. Cat would have column two, the value is also one. Bear would have column three and the value is one. This is a very standard technique known as one hot encoding. And if you've used Pandas before you might also see this called, and if you've used Pandas before, you might also see this called creating dummy variables. The next step in our pipeline is to build a linear regression model to find the line of best fit for our training data. So here we can go ahead and create a one hot encoder handling any unseen categories that we encounter in our test dataset by just ignoring them. And then we can go ahead and create a linear regression model adding an intercept to our data and normalizing all, and normalizing all of our data so it's on the same scale. We can pass in those steps to our pipeline and fit it to our X_train and y train, then we can generate predictions on our X_test. And if you are a little bit confused about what one hot encoding actually meant, we can go ahead and get our first step in our pipeline. Python uses zero indexing and get our one hot encoder and get the feature names. So we can see that for categorical variables we can see it has a different column for each of the different categories of call types. Now let's go ahead and evaluate our RMSE from the linear regression model we just built. Here we can see our RMSE is 1.72. So a little bit better than our baseline model of always predicting the average. Now, we're going to go ahead and save this model using an open source library called MLflow. The reason why we're using MLflow is it has a very nice functionality to automatically generate User-Defined Functions for us, which we will use in the next notebook. So here we're going to give it a model path to save our model and we're going to clear out anything in that directory if it exists and save our model there. In the next lesson, Conor will show us how we can load in this model and apply it in SQL to generate very interesting predictions such as which neighborhood is predicted to have the longest time delay.