Now, let's talk about the K-means clustering, which is one of the most simple and popular clustering methods. First, let me explain the name. K in the name stands for the K clusters that the algorithm tries to find in your data. For example, if you take K equal 3, the K-means algorithm go find three clusters for you. Second, means in the name K-means indicates that clusters are identified by some mean values. As we will see shortly in the K-means method, it's the mean value of all points in the given cluster that identifies the cluster itself. Let's see how it works using a simple example. Imagine we want to analyze and segment companies in a given industry. Assume that we look only at two companies' characteristics for each company, namely earnings per share or EPS and return on assets or ROA. So, each company will be represented as a dot on these two-dimensional plane formed by the values of EPS and ROA. The K-means algorithm starts with choosing the value of K. Let's take K equal 2. That is, we want to find two clusters in the data. The K-means algorithm starts with some random initialization of K cluster centers that we will call mu K. At the next step, we scan overall points in the dataset, and for each point, xi compute the distance from it to each cluster center mu K. Then we assign each point xi to the cluster K for which the distance d of xi and mu K is minimal across all values of K. This completes the first step of the K-means algorithm. Next, for each cluster, we recompute the position of the cluster center by taking a mean value of all points that belong to this cluster. This completes the second step of the algorithm. After that, we repeat the whole two-step procedure again and again until convergence work for a fixed number of iterations. Now, after we completed the training, we can use the trained model. So, for each data point in our new dataset, we compute its distance to all cluster centers and pick the cluster with the shortest distance as its parent cluster. The K-means algorithm is rather quick and often finds good clusters within a small number of steps, especially when the data is already segmented in some well-separated groups. Here's a short video showing you how the K-means finds the centers for synthetic two-dimensional data generated from three Gaussian distributions centered at different corners of a rectangle. It's useful to know that the above iterative procedure of the K-means algorithm can also be formulated as an optimization problem. It turns out that the same solution is obtained if we minimize the weighted sum of squared distances between all data points and all cluster centers, where the weight is 1 if the point belongs to the given cluster and 0 otherwise. This is the meaning of the objective function shown here, which we have to minimize with respect to the set of variables. The first set are the K-dimensional vectors mu K that define the cluster centers. The second set are binary values rnk with possible values in 0 and 1. This objective function is nonconvex. Therefore, it can have multiple local minimum. Let me describe one method that allows you to find one local minimum, typically, the one which is closest to your initial point. Such minimization is done iteratively. More specifically, starting with some initial values of centroids mu K, we iterate between two steps. In the first step, we keep values of mu K fixed and compute the binary values rnk that minimized the lost function for a given set of mu K. In that second step, we fix the values of rnk and minimize with respect to mu K. Now, what I have just described is a special case over more general machine learning algorithm called the EM or Expectation Maximization algorithm. The first step of finding the binary variables rnk for fixed mu K is the M-step because it minimizes the loss or equivalently maximizes the gain that is the negative of the loss. The second step, which minimizes with respect to mu K for fixed rnk, is the E-step. That is the expectation step of the EM algorithm. So, this is how the basic K-means algorithm works, simple and elegant. Now, I want to conclude this video with a few further comments on the K-means algorithm. First, the K-means is a deterministic algorithm, which means that for a given dataset, repeated experiments with the same starting point will always produce the same result. However, the objective function in the K-means is a nonconvex. That means that for different starting points, we generally have a different cluster assignments with the K-means. This also brings up the question of cluster stability with the K-means. If you take different datasets from the same data-generating distribution, the K-means might produce fluctuating and unstable class labels. In addition, outliers in data may make the stability issue even more challenging. For these reasons, good initialization scheme that helps control local minima, for example, a method called K++ means are often used in practice. Also, on a practical side data normalization or data whitening, that is PCA transformation on the data, sometimes helps to find good clusters. Also, while we looked at batch version of the K-means, when you have lots of data, you may use a mini batch or online versions of the K-means. There are also some issues with the K-means that you should beware. First, the K-means will always find clusters for you even if your data is pure noise. Just give it a number of K, and it will happily report exactly K clusters found. Also, it's often far from obvious how to choose the right value of K for your problem. Usually, you have to rely on your intuition and prior expectations when choosing the value of K for your problem. And the last but not least, you have to remember that the K-means relies on Euclidean distances between data points in computing clusters. If the Euclidean metric is not appropriate for your problem, then the K-means will implement the famous GIGO principle, which means garbage in garbage out. Let's pause here for a control question and then move on with the next video with our favorite topic, Neural Networks, where we will see how they can be a useful clustering.