Up until this point, we've only focused on using Spark with SQL. However, there are many other applications of Spark, including graph processing, machine learning, and streaming. Now, we wanted to wrap up this course by giving you a sense for how you can continue to improve your understanding of data. To this end, we'll take a look at some of the techniques in data science. In particular, we'll look at machine learning, as our understanding of data becomes more mature, we need more complex statistical methods to make use of all of that information, that's where machine learning comes in. I hope at the end of this lesson, you'll be motivated to continue your data journey by incorporating machine learning. Let's start by looking at some of the applications of machine learning. Machine learning represents a more nuanced understanding of data that goes beyond simple summary statistics and reporting to answer more complex and prescriptive questions. Let's look at some of the machine learning use cases we've seen using Spark. A very common application is fraud detection or classifying real users from bad actors on, let's say, a website. Many major financial institutions utilize Spark's big data processing capabilities to predict financial fraud in real time. One use case that I've spent quite a bit of time working on is natural language processing with Spark or analyzing natural language using statistical models. I've used this for classifying medical records, chatbots, sentiment analysis, and many other applications. Machine learning, in particular deep-learning can also be used for image recognition and apply to domains such as medical image processing and self-driving cars. If you're running a business, you might be interested in financial forecasting to predict the next year's revenue based on past trends. Customer churn is when a customer leaves your platform, perhaps you're an online retailer or you're concerned about customers not returning to purchase widgets from your online store. If you know a customer is likely to churn, you might want to intervene by sending them an offer that could attract them back to your company. How can we approach this churn problem as a data problem? Well, first we would want to define what churn is, we would call it churn when a customer doesn't buy anything from your website in a given month, in a given year, or if a customer doesn't even visit your website for a given period of time. The choice of your definition depends largely on the specifics of your business. Once you've created a definition for churn, you can start to ask some other questions. First, what do you think is a good way of predicting churn, given some records of customers churning and not turning? You might expect that the more time a user comes to your website, the less likely they are to churn. In that case, you'd want to include this information as you try to predict whether or not the user will churn. You might also expect that the longer they've been a customer, the less likely they are to churn, you can include this in your model as well. The work you've done manipulating data with SQL up to this point gives you the skills necessary to create these variables. These will be the inputs to our Machine Learning model. Suffice it to say for the moment that the first step in machine learning is to translate business problems into data problems. In the case of churn analysis, that data problem we're asking is whether we can predict future user activity based on past user activity. Once we've translated our business problem into a data problem, we can start asking more advanced questions, such as whether there is a statistically significant correlation between the number of purchases a customer has made in a given month and whether there will be a customer in the next month. After analyzing the data, we might also learn that a customer who has made fewer than two purchases in a given month is say, 20 percent more likely to churn in the following month. More and more businesses are making decisions based upon data like this. Translating business problems into data problems is therefore one of the most important 21st century skills. What is machine learning? I'm going to offer two definitions, one that's more general and the other that's a little more technical. The general definition is as follows, machine learning refers to a broad array of techniques that learns patterns in data without being explicitly programmed. In our churn example, a machine learning algorithm would use past user data to find patterns between variables, such as the number of purchases in the past month and whether a customer has left your platform. A more technical definition is as follows. Machine learning is a function that maps features to an output, in other words, with a number of different input features, such as customer purchase history, machine learning maps those inputs to the output, in our case, the probability that that customer will churn. Now that we know that machine learning refers to tools for finding patterns within data, let's talk about a few different types of machine learning. The two types of machine learning that covered the vast majority of use cases I see are supervised and unsupervised machine learning. There's also reinforcement learning and semi-supervised learning as well, but we won't cover those in this course. Supervised machine learning is a type of machine learning where you have labeled data points and your task is to predict that label. In our churn analysis example, this is a supervised learning problem because we know the output we're trying to predict, which is whether or not a customer has left you're platform. Supervised learning can be further broken down into two different groups, classification and regression. For classification tasks, the goal is to predict a discrete set of categories. In churn analysis, we're predicting whether a customer churns or does not churn, so we're categorizing our customers into two categories. This is an example of binary classification, where we have two categories that we're predicting. There's also multi-class classification, where there are more than two categories that are used. The other type of supervised learning problems are called regression. In regression tasks, we're predicting some continuous value. Think of the question of financial forecasting, where you're trying to predict the revenue of the company in the next quarter. This is an unbounded number, not a distinct set of categories. With regression, we're predicting a continuous range of values, even values we haven't seen before. Now that we know that two types of supervised learning: regression and classification, let's talk about unsupervised machine learning. In unsupervised machine learning, we don't have a label to predict. Rather, we're learning the natural structure of our data. One example of this is clustering. Going back to the customer churn example, we might want to look at the natural clusters that form within our customers, revealing different customer segments that we could better serve. Now, let's talk about the machine learning algorithm we'll use in the demo, linear regression. The goal of linear regression is to find the line of best-fit. In the case of our fire calls dataset, we'll be predicting how long the response time is for an emergency call. We'll have a handful of different features, like which neighborhood the call came from and what type of call, such as medical incident or structure fire. Linear regression will help us learn the relationship between input features and our label or the response time. We'll see that if a call is a structure fire, for instance, we'll generally get a prediction differently than if it's, say, a medical incident. Linear regression will learn this association by giving us the coefficient or the number we multiply a given feature by in order to attain our final prediction. To say that more mathematically, when we use linear regression, we're learning those coefficients. Sometimes, they're called weight as well, but we're learning those coefficients that minimize the residuals, so we minimize the difference between the line and the various data points that we have. Let's see how machine learning can be applied to our fire calls dataset. The remainder of this walk-through will predict the response times to calls using very simple features, such as the call type and location of the call. This is a supervised machine learning problem. It's also a regression problem because we're predicting a continuous variable response time delay. Now, I'm going to skip over many aspects of this Notebook because there's quite a lot here. I just want to highlight a few key points and leave the Notebook for you to examine more closely should you be interested. I've already hit "Run all" in this Notebook, so all of the results are cached. Now, I'm going to load in the data and clean it. The timestamp in delay parts are calculating our label by looking at the time difference between when the call was made and when the fire department actually arrived. Next, I'm going to convert this Spark DataFrame to a pandas DataFrame. Pandas is a library in Python that makes for easy data manipulation. This is one of the core competencies of Data scientists working in Python. Next, we can visualize the time delay. We can see here that many calls were responded to in two minutes, though some took quite a bit longer. Let's skip ahead to actually building the model. Like I said, you might want to take a closer look at many of these steps because there's plenty of texts in this Notebook that helps you get a better sense of everything that's going on here. But here, the sense is that machine learning really only works with numerical data, so we need to translate those different neighborhood names, so this is in San Francisco, so those are names like the Mission District in the Tenderloin, but we need to translate those different names in other strings into numerical values. This is a process called one-hot encoding. This makes each of those different categories its own separate column with either a zero or one. For instance, with our neighborhoods, if a given row in our data is, say, Chinatown, we'll have a one for this column of that observation. As we scroll down, we start to get to the actual training of the model. We've trained our model, and we can examine the related coefficients. These are the features and categories that have the strongest positive influence on our prediction of response time. Here, you can see that response time is higher, meaning slower for watercraft in distress, for instance, than it is for some of the other types of fire calls and that makes it a lot of sense. Because as you can imagine, it might take longer to get to a watercraft than it does to a structure fire. Now, if we take a look at the lowest coefficients, these, we can see that structure fire has some of the lowest or one of the lowest coefficients. That means that we are able to respond to structure fires much more quickly than some of these different fire calls. This starts to give you a sense for some of the correlations between the different features we've been looking at over the course of the last few weeks and the response time for our fire trucks. I know we went through this relatively fast, but my hope is that this gives you an idea of what you can do with more advanced data tools like machine learning. To work on this type of problem, you really need to be working in another programming language like Python. If you're interested in machine learning, learning, Python is a very good place to start. Some statistics in machine learning classes will definitely help too. I hope you found this walk-through to be informative, and I encourage you to spend some time getting to know this Notebook as you consider your next steps in the world of data.