Hello and welcome to the course on using Machine Learning and BigQuery. My name is Raiyaan Serang and I'm a machine learning consultant here at GCP. BigQuery Machine Learning allows you to build machine learning models using SQL syntax. This is a really powerful feature since anyone who knows only SQL can build ML models. BigQuery ML brings us one step closer to democratize and Machine Learning and AI, one of our core missions here at Google. First, we'll introduce BigQuery ML and walk through the steps for an end to end use case. We'll then discuss the current [inaudible] of support and models. At the end, you'll work through a couple of labs, so you can get your hands on BQML. Why does BQML fit into the greater picture of Google Cloud's AI and ML options. Well, unlike the ML APIs, you are able to create your own custom models. We'll get to auto ML in the next course but in short it allows us to leverage Google's ML models in certain cases to build our own models from scratch using transfer Learning and a form of neural architecture search. To work with BQML, there are only a few steps from beginning to inference. First, we must write a query on data stored in BigQuery to extract our training data. Then, we can create a model where we specify a model type and other hyperparameters. After the model is trained, we can evaluate the model and verify that it meets our requirements. Finally, we can make predictions using our model on data extracted from BigQuery. Are you familiar with Hacker News? Hacker News is a social news aggregator website focused on computer science and entrepreneurship. We won't get into all the details, but let's ask a question. Given these three articles on the right, where were they actually published? TechCrunch, GitHub or New York Times. Of course, we may be able to tell based on the style of the articles. But what if we wanted to build an ML model to do this? First, let's write an Ad-hoc Query to explore our data. Here's a quick example where we're simply looking at the title of the article and the URL. We can extract the publisher information from the URL using regular expression functions. That's exactly what we'll do using the query on this slide. Our plan will be the following. Let extract the publication source using a regexextract function and then separate out the first five words at each title or replaced them with null values, in the case the title is too short. Our goal will then be to decide if the article title is from GitHub, New York Times or TechCrunch using BigQuery ML. Let's look at the syntax needed to create a model. We have the create or replace model statement at the top where we give the name of the model, the options of the model, and other hyperparameters. After the "as" clause, we have our query which defines the training set from the previous slide. That's it. In this case, we're specifying that we want to build a logistic regression model. This is a type of classification model we can use to classify our input as being either a GitHub, New York Times or TechCrunch article. Once we have a trained model, we want to know how it performs. Since it's a classification model, we can use a variety of metrics to see how well it knew where the article originated based on just the title. To get those metrics, you simply call ML dot evaluate on your train model or click on the model and the BigQuery UI and click the evaluation tab. For the metrics, they're on a score from zero to one. Generally speaking, the closer to one, the better, but it really depends on your metric. Precision means for the articles we made a guess on, how accurate were those guesses. High precision means a low false positive rate, meaning we really punish a model's precision if it makes a ton of bad guesses. Recall on the other hand is the ratio of correctly predicted positive observations to all observations in the actual class. How many did we get right out of both true positives and false negatives? Accuracy is simply true positives plus the true negatives over the entire set of observations. There are other metrics like, f1-score, Log-Loss and ROC curves that you can use as well. A link out to a description of each and when you should use which metric in your classification models. We can also see the confusion matrix, where we can see where the model made incorrect predictions. As we can see here, our model had some confusion between TechCrunch and New York Times articles. If the model meets our requirements, great, we're ready to predict with our model. We don't need to worry about deploying our model in a separate process. It's automatically available to serve predictions here in BigQuery. Here's an example of performing batch prediction using BQML. The first example here has government, shutdown, leaves, workers, and reeling as the first five words of the title. The model's prediction in this case is New York Times