In this demo, we're going to apply what we've learned about k-means clustering by demonstrating how to perform k-means clustering using Python and scikit-learn. Again, remember that for each of our demos and labs, we need to run our classrooms setup notebook within each notebook. It should just take a few seconds to run. Okay. Just like with any machine learning project, we need to prepare our data. If you remember, we're interested in a user-level clustering based on the project objective. This objective is to determine whether there are naturally occurring groups of users in our health tracker dataset. As a result, we're going to need our data to be at the user level. In this SQL cell, we can aggregate our daily level metric table to be at the user level by grouping, by device ID and aggregating the numeric columns. Note that we are taking the average of each of these numeric columns, the average by user across all days. Once we've created that table, we can display it to demonstrate the new columns. You'll see that our aggregated columns are in this table. Average resting heart rate, average active heart rate, and so on. Next, we're going to split our data using scikit-learn to train test split function. You might be wondering, why are we splitting our data for unsupervised learning when there are no holdout labels or target values to measure how our models generalize? That's a really good question. Normally, we don't need to split our data when performing unsupervised learning tasks, but in this case, we're going to use the train test split function to split our data into a training DataFrame that we'll use to develop the k-means clustering and the DataFrame that we'll use for inference later on. In other words, we're using the test set for an example of placing unseen data into clusters. The train test split function makes this splitting pretty easy, so we're just kind of repurposing it here. You'll see that the shape of the two DataFrames indicate that the same columns are in each dataset, but they have differing number of rows. We'll create the K-means clustering with the training set, which has more rows and then perform inference with the inference set, which has fewer rows. Okay. Now that we've prepared our data, let's focus on training the k-means algorithm. One of the key points of the k-means algorithm is the k, which is the number of clusters. It needs to be specified ahead of time by you, the data scientist. We're going to set our number of clusters to four, but we're doing this arbitrarily. Be sure to pay attention to this note here about scaling our features because k-means is distance-based, it's scale variant. This means that feature variables with larger natural spreads are associated with greater distances than feature variables with smaller spreads. What do we do about this? We scale our features. We put all of our features on the same scale. The details around this are beyond the scope of this lesson, but it's important to ensure we use this scaling functionality or scaling functionality similar to this when we train distance-based algorithms like K-means. Once we fit our model, we'll see the object returned with a handful of parameters listed. We don't have time to go into each one of these parameters, but we'll talk about a few here. The first is this init parameter. The init parameter controls how the centroids are initialized. In basic k-means, this placement is random, kind of like we saw in our visualization in the previous video. But there are other options that can make this a bit more intelligent and make the algorithm train more efficiently. The next perimeter is max_iter. Remember that k-means is an iterative algorithm. It performs the same task in an iterative nature until one of a stopping criteria is met. One stopping criterion is the number of iterations of the algorithm. In this case, because max_iter is set to 300, our algorithm will at most iterate 300 times. At that point, the centroids will be in their final positions. If we want to determine where the final centroids are located for our trained k-means model, we can use the cluster_centers_attribute. This attribute returns an array of elements where each element is the coordinates of the centroid in the feature space. Evident of this, we can see that each element of the array is a list of Length 6, one point for each of our feature dimensions. Picture these centroid points being plotted in a six-dimensional scatterplot in your brain with clustered points around them, with the closer the points being assigned to them. Now that we have our trained model in our centroids, we can perform inference on new records of features we haven't seen before. This is why we set aside our inference_df DataFrame in the first place. We can place the rows from our inference DataFrame into clusters using the k_means.predict function. This is a method coming out of our trained k-means model object. This function accepts a DataFrame of the same number of feature variables as the training DataFrame on which k_means was fit. Note that we are scaling are features again here. There are better ways to do this, and we'll cover them in the next module. The output of this predict function, as you can see here, is a numpy array. Its length is 300. This is the same number of rows in our inference DataFrame. This isn't a coincidence. The k_means.predict function is returning a cluster ID or a cluster assignment for each of the rows in that inference_df DataFrame. We can bind this array to our inference DataFrame by adding it as a new column to a copy of the inference DataFrame. If we want to print or view this DataFrame, and we scroll all the way to the right, we'll see a new column called cluster. You'll see the cluster ID associated with each row. The value inside of this cluster column is the cluster to which each of these rows from the inference DataFrame has been assigned. In the rest of this lesson, we'll look at how to determine the appropriate number of clusters and how to use these clustering results on real-world projects.