I'm now ready to introduce the idea of soft clustering. Here are three clustering problems that every human would easily solve. Here is the solution of each of them. The problem on the left is partitioned between two clusters, the same as the problem in the middle, and for the problem on the right, it is partitioned into three clusters. But how would the Lloyd algorithm cluster the same data set? I'm afraid that for the data set on the left, it will produce a different solution shown here. The same is for data set in the middle and once again a different solution for the data set on the right. And, I now want to discuss how we can change our approach to clustering so we are doing almost as good as the human eye. Now let's take a look at the evolution of our clustering approaches. We started from the good clustering principle that obviously was a bad approach to clustering. Afterwards, we move to the k-center clustering based on Max Distance and we quickly figure out the limitations of the Max Distance. And afterwards, we move to k-means clustering based on square toe distortion, and we are now learning that k-means clustering also has limitations. But we haven't yet figure out more accurately what is the key limitation of the square distortion as applied to clustering. And as a result, it is not clear how we should change our clustering approach. To help us to figure out how we will change our approach to clustering, let's think about this very simple clustering problem. Two clearly defined clusters shown by blue and red samples in any point located approximately halfway between blue and red clusters. How should we cluster this mid point? Well, since we are currently doing hard assignment of points to centers, then we only have two choices, either assign it to red cluster or to blue cluster. And both choices are bad. So maybe we should stop thinking about soft assignments of data points to clusters when the points will be assigned, let's say approximately 50% to red cluster and 50% to the blue cluster. And the soft assignments are illustrated on this slide. On the left, you see hard assignment. Every point is either blue or red. On the left, on the other hand, you see that every points are assigned to responsibilities. Red responsibility and blue responsibility show that the sum of responsibilities is 1. Some points are clustering almost a hundred percent red. Some points are clustering almost a hundred percent blue. But there are points in the middle that are roughly 50 percent red and roughly 50 percent blue responsibilities. But the question remains, how do we assign the responsibilities to the coin? And these two unlikely characters from Hamlet will help us to figure out how to assign these responsibilities to clusters. These characters were borrowed by Tom Stoppard in his famous play, where they toss a coin, and it lands on heads an insane number of times. Over 100 times in a row. And that starts a philosophical discussion that these two characters have, but we are not interested in philosophical discussion at the moment. We are more interested in what happened. And mathematical explanation for what happened is that the coin was simply biased bias to over hats. And then now, look on the problem of estimating the unknown bias of a coin. We flip a loaded coin with a known bias, where bias is simply the probability that the coin lands on heads, and suppose in our experiment the coin lands on head i out of n times. The question is, what is the most likely bias of the coin? Well if the bias was known, then we could compute the probability of the resulting sequence of flips. It is simply biased to the power i multiplied by one minus bias to the power n minus i. And therefore, the most likely bias of the coin is simply the bias that maximizes this probability. And simple calculus results in the following estimate for the most likely bias. It is i over n, or simply the fraction of heads in the series of coin tosses, something that you could have guessed from the very beginning. We now move to a more complicated scenario when there is not one, but two biased coins, each with its own bias. The coins are identical, so you cannot say which coin is which. And imagine that the real five series of flips shown at this slide. And they result in a different number of heads. We are not really interested in an exact sequence of heads and tails. We are more interested in the fraction of heads in every series and we represent this fraction as a vector data shown on the slide. Our goal is to estimate the biases of coin A and B. Well, it's actually looks like a difficult problem, but if we knew which coin was used in each sequence, then it will turn into a symbol probably. We will encode which coin was used in each series of flips by hidden vector, where one in hidden vector will correspond to the blue coin and zero will correspond to the green coin. So our hidden vector is 10010. If the hidden vector is known, that estimating the most likely biases of blue and green coins is a simple problem. Indeed in this case, the bias of blue coin is simply the fraction of heads in all blue flips, which is (4+3) / 20, or (0.4 + 0.3) / 2. 0.35. Likewise, the bias of the green coin is simply the fraction of flips in all green tosses, or which is simply (0.9 + 0.8 + 0.7) divided over 3. Now, as soon as we know which coin was used in each experiment, we can derive biases of blue and green coin which we referred to as a two-dimensional vector parameters. It is 0.35, 0.8. I now want to show you how to compute parameters a little bit bigger. Lets simply take the dot product of data and hidden vector. The result of dot product is already shown here. Its simply 0.4 + 0.3 or represented differently simply sum of these five products. And this is of course simply data, the product of this hidden vector. Let's now represent the right part of this equation also as a Dot-Product. It isn't the dot product, because it's simply the multiplication of vector system consisting of all ones. With this hidden vector. We call this vector all ones. And as a result, the parameters is merely data, dot product of head and vector, divided by all ones dot product with this hidden vector. Over all ones is a vector consisting of all ones. Similarly, we can represent the bias of the green coin as a dot product. Let's try to do this. So, we start from this expression in the red box, and this expression is equivalent to the expression I showed right now. But what does this expression show in the red box at the bottom? It is merely the product of data and 1 minus hidden vector. So it's a dot product that is represented as a dot product in data and 1 minus hidden vector. And now the expression on the right of this is simply vector represented as all ones. Dot product is 1 minus hidden vector. So for both biases of the blue coin and green coin, we have figured out how to represent them as dot-products. And we will see that it is a very useful representation later. I now want to show that the bias of the green coin, also can be represented as a Dot-Product. Let's look at this expression showed in the red box. And after we represent it differently, it is easy to see that this expression is nothing but the dot product within beta and 1 minus hidden factor. And what is shown on the right of this expression, this is simply the one dot product of one minus hidden factor. And therefore, we now have the formulas. Where the bias of the blue coin is dot product of Data and Hidden Vector / the dot product of all 1's and Hidden Vector. And the bias of the green coin is the dot product of Data and (1- HiddenVector) / the dot product of All ones and a one minus hidden vector. Well representation of the biases as dot-product maybe of interest for mathematically curious. But what it has to do with the. You will learn about in the next section.