Hello I'm Neil Clark. And in this talk I'm going to talk about self-organizing maps. And the self-organizing map is a method which is useful for visualizing structure in high dimensional data. In a sense it's similar to Principal Component Analysis that we talked about in an earlier talk in that it attempts to take the high dimensional data and project it down into a lower dimensional surface which tries to capture as much structure, as much of the high dimensional structure of the data as possible in a low dimensional space. And one of the differences is that this low dimensional surface with self-organizing maps, there is no constraint that this surface be linear. Self-organizing maps have some strong similarities with what is called K-means clustering. In fact, it can be thought of as simply a constrained version of K-means clustering. So what I'm going to do is I'm going to start off by describing the K-means method and then from that I'm going to develop into the self-organizing map. And I'm going to show you this in a simple three dimensional data set. So, first, K-means clustering. Let's consider a multi-dimensional data set which has elements x (sub) i. Now, each of these x (sub) i, they have multiple features. So that you can consider them as residing in a space which has a number of dimensions equal to the number of features. And you can also think of the distance between two of these data elements. Perhaps you might like to think of it as gene expression. You know, each x might be a gene expression profile and each gene expression profile is consistent with the expression of many genes. And we can also think about the distance between two gene expression profiles simply as the distance in the high dimensional space, the Euclidean distance in the high dimensional space. Now, with the K-Means procedure, we first define a number k. And this is going to be the number of clusters that we're going to divide the data up into. So usually you'll have some idea of what K should be or you have some method for determining what K should be. And we're going to introduce a number k of prototypes or centers, and each of these have have position m sub k. Then each data point becomes associated with the closest cluster, closest in the Euclidian sense. Then what we do is we look for the partition of the data into these clusters. We try to find the positions of the centers which optimally cluster the data in the sense that they minimize the sum of the squared errors between the data points and their associated clusters. So this is written as an equation here. So what this means is we're essentially trying to find the optimal way to cluster the data. So here I have a movie illustrating this procedure. The movie starts with five centers, each with a different color, and these centers are chosen randomly. And then we iterate a procedure which tries to move these centers in such a way that the sum of the distances of each associated data point from the cluster, from the center, is minimized. And you can see what is happening is that, you know, the green point is identifying the cluster there in the top right, and the yellow point is identifying the cluster in the bottom right, and so on. So that was K-means clustering, and this is going to form the basis of self-organizing maps. We're going to go into a bit more detail but I'm going to do it by applying self-organizing maps to a specific data set. And because self-organizing maps is not constrained to be linear, I think it's good to use a data set which is strongly non-linear. So here is the data set we're going to use. the points are scattered around a sphere. And we have, you'll notice that we have several clusters of points, illustrated with different colors. So first we can apply K-means clustering to this data. We can chose a number k which is equal to the number of clusters that we have and iterate it, and we can iterate the method. I've not described the method just yet, but I will do that shortly. And we can look at, with each iteration of the algorithm, what is the total error? Which is the sum of the squared distances of each data point from its associated center. And we're trying to minimize this distance, and here, we've ran the algorithm with a number of different initial conditions. You can see that, generally speaking, the error falls into a local minimum. Now just for reference, I'll show you here a principal component analysis plot of our example data. And you can see the principal component analysis has done a good job of separating the blue and the red clusters. But the the green and the orange are mixed up here. And this is presumably because these data points actually fall on a sphere, which has been flattened in this analysis. And in the flattening process, the orange and the green have been mixed together. So here I'm just going to illustrate what happens as we iterate the self organizing map algorithm. So we've got a short movie here, where we can show the initial conditions. The initial condition is that we start off with the PCA, the principal component analysis surface, which we have populated with a regular array of centers. The next thing we can do is we can show the evolution of the self organizing map surface as we iterate the self organizing map algorithm. You can see it there, it jiggles about. But then eventually the surface tends to curve around and tries to fit the sphere-like structure of the data. So this kind of gives you some intuition of what the self organizing map surface is trying to do. It's trying to find a surface which optimally fits the nonlinear structure in the data. So now I can give you, or go into a little bit more detail about how the self organizing map iterative algorithm works. It works by scanning across all your data points and updating the position of the centers. So as I illustrated in the previous movie, the initial position of the centers is a regular array placed on the principal component analysis surface. And then what we do is we just iterate, we go through all the data points and we iterate this simple algorithm. For each data point x (sub) i we find the closest in Euclidean space center, which we'll call m (sub) k. Then, on the self organizing map, you find all the centers which are within a given distance of m (sub) k. And this is the distance, not in the Euclidean sense, but in terms of the surface, of the self-organizing map surface. So, the way to think about that is if in the lattice of centers, if m (sub) k has a position of i,j in that regular lattice, then we can calculate the distance between i' and j' of every other center, and then all those centers which are below a particular distance, call it r, we're then going to update their positions. We're going to change their position from their current position to their current position plus alpha, which is a parameter called the learning rate, which multiples the vector difference between the data point under consideration and the closest center m (sub) k. This second step could be, we could do it in a slightly more sophisticated fashion where instead of simply being proportional to this vector, you can devise a function which is of the self organizing map distance too. So here we can look at, we can take another view of the, the iteration of the self-organizing map algorithm by looking at how the points, the data points, are projected onto the self-organizing map surface as we iterate the algorithm. And you can see in the final state where the self-organizing map has converged, we can see there's a very clear separation of each of the clusters. And in this case, the red and the orange clusters are no longer flattened into each other. They are very clearly separated. And this illustrates one of the advantages of this approach when your data is strongly nonlinear. This is a limitation of the principal component analysis approach. Now one way to choose, there are a number of parameters that we have chosen, this value, the value r, and so on. And we want to consider how one way that we might choose those parameters. And one way to think about that is to think about the error. So remember, by the error in K-means is the same here is the sum of the squared distances of each data point from its associated center. So this is kind of a measure of how well your K-means or self-organizing map is fitting the data. Now, because self-organizing map is a constrained version of K-means, we would expect its error to be larger than the K-means because it's going to in a certain sense suffer from the constraints. K-means is freer to minimize the error, in a way. But we would hope that if we've chosen the parameters well, we would hope that this error is not too much. You know, not, is not tremendously greater than the K-means error. So here what we've done is, I've shown for this solution to this problem, this example problem. The blue curve shows the squared error as a function of the iteration of the algorithm. And the orange line at the bottom indicates the K-means error. So you can see in the immediate parts of the iteration, the error is very large, but as the self organizing map converges, it converges to an error which is, although greater than the K-means error, is not enormously so. So this kind of gives us a feeling that we've chosen the parameters reasonably. So in summary, the self organizing map is a method of dimension reduction. It's a way of looking at high dimensional data which has the advantage that it's not constrained to be linear, so that the surface can adapt to the nonlinearities in your data. [MUSIC]