In this lecture, we look at two examples of K-means clustering. The first uses a small dataset to show how the algorithm operates. The second is a remote sensing example to share what's operation in unsupervised classification. We now implement the K-means algorithm using the two-dimensional dataset shown here. We will see this dataset a couple of times in these lectures. Looking at the pixel locations in this data, it seems there are two or possibly three clusters. In practice, that might not be so obvious, so a guideline is needed in terms of the initial choice of how many clusters to find. Because we can merge clusters later, it is good to estimate on the high side, recognizing however, that the more clusters there are, the longer the algorithm is likely to take to converge. A guideline which has been used in remote sensing, is to estimate the number of information or ground cover classes in a scene and then search for about 2-3 times that number of clusters. The choice of the initial clusterizations is important, because it can influence speed of convergence and the ultimate set of clusters. It is important that the initial set be spread well across the data. Several guidelines are available. One is that the initial cluster centers be spaced uniformly along the multi-dimensional diagonal of the spectral space. That is a line from the origin to the point of maximum brightness on each axis. Better still, we could choose the multi-dimensional diagonal that joins the actual spectral extremities of the data, that requires a bit of [inaudible] processing to find the lower and upper spectral limits in each band, but that is quite straightforward. Another approach is to space the initial cluster centers uniformly along the first principal component of the data. That will work well with a highly correlated data sets, but might be less useful if the datasets show little correlation. Diagrammatically, this slide shows the evolution of the K-means method, iteration by iteration on a small data set. Here we're searching for two clusters. The bottom row shows four iterative assignments of the pixels to the clusters along with the corresponding SSA values. After the fourth iteration or assignment, there is no further migration of the pixels between the clusters, that incidentally those pixels which changed clusters in the first two steps. The top right-hand diagram shows how the means migrate with iteration. That tells us why the algorithm is sometimes called the method of migrating means. The ISODATA algorithm is a variation of the simple k-means approach. It adds two or three further possible steps as outlined in this slide. First, we can check to see whether any clusters contains so few pixels as to be meaningless. If the statistics of the clusters are important, say for use in a later maximum likelihood classification, then poor estimates will be obtained if the clusters do not contain a sufficient number of members. Secondly, we could see whether any pairs of clusters are so close that they should be merged. In Module 3, we will look at similarity measures for use in classification. They will give an indication of whether classes and clusters are too similar spectrally as to be useful and therefore should be merged. Thirdly, we can check whether some clusters are so elongated in some spectral dimensions that it would be sensible to split them. Elongated clusters are not necessarily a problem, but if they are, then comparison of the standard deviations of the clusters along each spectral dimension will help reveal the elongated nature. We now look at the application of clustering to unsupervised classification using a five channel dataset of an image recorded by the HyMap sensor near the city of Perth in Western Australia in January 2010. The table shows where the five bands are located in the spectrum. From our knowledge of spectral reflectance characteristics, that choice seem sensible in being able to differentiate among the cover tops we most expect to see. The image display uses just three of the bands selected to give the standard color infrared product in which vegetation is emphasized in rate. A remote sensing image analysis package called multi-spectral from Purdue University, was used for this exercise. Six clusters were specified with no provisions to include taken close or elongated clusters. However, if any cluster was found to have fewer than 125 pixels, it would have been eliminated. None were found in this exercise. Once clustering was complete, the unsupervised cluster map shown on the right-hand side of this slide was produced. The clusters represented by different colors by the visual patterns of the classes in the image. It is easy to associate the brown and orange clusters with highways, road pavements, and bare regions, the yellows with buildings, and the shades of green with various types of vegetation. Clearly the dark blue cluster is water. We strengthen this interpretation further in the next slide. In the table here, we see the mean vectors of the final set of clusters, which with the spatial clues in the map itself, allow us to associate the cluster colors with ground cover classes as indicated in the K to the cluster map, which of course is now a thematic map. Here we plot the cluster means by wavelength and see that they follow the spectral reflectance curves of the ground cover class labels assigned to the clusters. This is further information that has been used to identify the clusters. Insofar as it is possible, it is always good to look at unsorted representations such as this to help understand the identity of the clusters that had been found by the algorithm. It is instructive to see where the cluster centers lie in spectral space. While we can't envisage the third-dimensional space, we can look at two-dimensional scatter plots using two of the band's most significant to vegetation, they need infrared and the visible red bands as seen here. The final cluster centers represent well the scatter of the pixel vectors. Here we summarize the essential elements of the ISODATA algorithm and how clustering is used for unsupervised classification. The first question, here should allow you to develop a feeling for the importance of the placement of the initial cluster centers.