In this lecture, we look at how to do clustering with very big data sets, including hyperspectral images. We are now in the era of big data, particularly with the recording and storage of many high-volume image datasets. In 2014, NASA was managing more than nine petabytes of data, with about 6.4 terabytes a day being added. This accounts for an unbelievably large number of images. How do clustering techniques cope with such large amounts of data? If we want to apply clustering techniques to large images for unsupervised classification, or use clustering to extract information from archived datasets, a process called data mining, then the methods for clustering we developed in the last couple of lectures are limited in value. In this lecture, we will look at a recent clustering approach that is suitable for big datasets. There are others as well, but the one we look at here illustrates the types of method now being explored for use on so-called big data. Just before doing that, consider the time demand of the k-means algorithm. For P pixels, C clusters and I iterations, the k-means algorithm requires PCI distance calculations. For N bands, the distance calculations involve N multiplications each, giving a total of PCIN multiplication to complete a k-means clustering exercise. For 1000 by 1000 pixel image segment, involving 200 bands and searching for 15 clusters, then if 100 iterations were required, 30 by 10 to the tenth multiplications are needed. How can we devise an approach to clustering that is much faster, and is able to cope effectively with large images? But in searching for an improved technique, the k-means technique should not be abandoned. Its simplicity means it is still used with big datasets. But to make it more suitable, there are several alternatives that should be examined. First, the simplest is to use a more powerful computer. But from an operational point of view, it is important to note that most remote sensing practitioners would want to use readily available and not specialized computer hardware. Secondly, a better method for initiating the cluster centers might be found that helps speed up convergence by reducing the number of iterations needed. Several methods have been suggested for that in the big data context. Thirdly, a multiprocessor or multicore machine could be used to speed up the computation by taking advantage of parallel calculations. However, steps need to be taken to parallelize the k-means algorithm, which because of its iterative nature, requires some innovative modifications. Finally, a more efficient version of the k-means algorithm might be possible. We will examine one such technique here. It speeds up significantly the time required to undertake clustering and to allocate a pixel to a cluster class. In a sense, it is a particular case of the third dot point. The method we will look at for fast clustering is called k-trees. We met decision trees in the context of the support vector machine. But now, we want to look at them more generally. We start with some nomenclature. Trees consists of nodes, linked by branches. The uppermost node is called the root, and the lower most nodes are called leaf nodes. In between, there are internal nodes. The nodes are arranged in layers as shown. Progression of a pixel down the tree is based on decisions at the nodes. Those decisions direct the pixel into one of the available branches. In the k-trees algorithm, we allocate leaf nodes to the clusters that we are trying to find. Both in number and position in the spectral space. Although we don't know how many there will be beforehand. Some authors use the leaf nodes to represent the individual pixels within the clusters. With the clusters themselves being the internal nodes in the layer directly above. That does not help in developing the algorithm and just adds an additional unnecessary complication. The k-trees algorithm has one parameter that the user has to specify beforehand. It is called the tree order. Which specifies the maximum number of pixels in a cluster and as we will see in the following slides, the maximum population of any node in the tree. Full details of the K-trees algorithm will be found in this paper by Geva. It is a little hard to understand in the remote sensing contexts since it is written in the language of computer science. We will develop the algorithm for example, using a simple two-dimensional set of data, and using remote sensing terminology. We will use the set of eight vector samples shown here in vector and diagram form, and choose a tree order of three. Specification of the order controls the structure of the tree, as we will see. The tree starts with a single root node and a single leaf node. We then feed in the first sample, say sample C. Feeding in is also called insertion. Since we have no other information, the sample simply flows down to the leaf node, as does the second sample A, shown on the right-hand side of the slide. We will use black letters to indicate samples of current interest and red letters to indicate samples which have already been fed into the tree. A third sample, say G, can be accommodated, but it fills the leaf node since we have specified a tree order of three. A fourth sample, say D, cannot be accommodated in the current tree because the leaf node cannot contain more than three samples by design. That leaf node has to be split. The K-trees algorithm does the split by doing a K-means clustering of the four samples as on the next slide. The K-means clustering in the K-trees approach always looks for two classes, so that the over full leaf node is split into two new leaf nodes. In this example, the vectors A, B, C, and G have to be allocated to two clusters. We show that process in the slide. Because of the initial choice of cluster centers, the algorithm converges in one iteration for this very simple example. When complete, the two clusters have the main vectors indicated on the slide. The main vectors from the clustering step of the previous slide become the attributes or members of the root node, and the internal leaf is split into the two clusters, as shown in the left-hand diagram. The tree now has capacity to absorb more pixels, so our 5th sample F, can be inserted as shown on the right-hand side of the slide. That pixel is checked against the two main vectors in the root node, and it seemed to be closest to M-C-D, so it is allocated to the left-hand leaf node. As in the previous slide, when pixel B is inserted, it is checked against the main vectors held in the root node and found to be closer to M-A-G, as a result of which it is allocated to the right-hand leaf node. The main A-B-G is calculated and used in place of M-A-G in the root node. But when we try to insert the 7th sample H, the capacity of the left-hand leaf node is exceeded and that node has to be split again, using the K-means algorithm. We are looking to separate the pixels C, D, F, and H into two clusters using the K-means approach. Rather than go through the exercise, we assume for simplicity that the solution shown in the diagram here has been found. The means of the new clusters are shown. This leads to the tree now having three leaf nodes, as seen on the next slide. The root node now contains three main vectors and has reached its capacity. If any further leaf nodes are added, then the root node will have to be split. Consider the insertion of the final pattern E to the tree. When entering the root node, it was seen to be closest to the M-A-B-G mean, so it should be placed in the bottom right-hand leaf as indicated. But since that exceeds that leaf's capacity, the leaf has to be broken into two separate leaves using K-means clustering as on the next slide. For simplicity, we assume that the K-means algorithm has found the clusters illustrated here in the diagram to the left with the main vectors indicated. The tree now has four leaf nodes as shown on the next slide. But the root node now needs to be split into two because its capacity has been exceeded. The two new nodes resulting from the split will be internal nodes, and a new root node is created. We split the root node using the k means approach, that the elements now to be clustered are the main vectors stored in the root node. Again, assume the results of the k means clustering are shown here. The new means mcdfh and mabeg are now computed. Since all pixels have now been fit into the tree, we have its final version as seen here. It has three layers with two internal nodes and four leaf nodes. Any pixel vector fit into the top of the trail will make its way down to one of the clusters via the decisions. That is distance comparisons at the root and internal nodes. For example, the vector 33 will fly through as shown by the dotted green line into the cluster, say H. Apart from [inaudible] there cluster. There are two things we want to know about clustering algorithms. First, how long does it take to build the tree? Secondly, especially with unsupervised classification mind, how quick is it at allocating unseen data to a cluster? If we look at the speed of allocation first, we can do so by counting the number of distance comparisons. In the simple case here, that the k trees and the equivalent k means approach require the same number of comparisons. But what about with bigger data sets? If we take the simplest case of H node in a k tree requiring two distance comparisons, the number of comparisons increases by two for each newly added. Which in this case also doubles the number of clusters. By contrast, the number of distance comparisons for the k means the algorithm goes up as pairs of two say for, large numbers of clusters. K trees algorithm is much faster when allocating an unseen sample to an existing cluster. Getting a meaningful comparison of the times to build the k tree, and the k means approach is not straight forward. We can make comments on the numbers of nodes to be built and the checks within them. But the complexity introduced by the effect of different tree orders makes meaningful theoretical comparisons difficult. We will use an example instead, as in the next slide. This example is taken from the paper cited here. Also a by given. The data to be clustered, consisted of vector samples from a three dimensional normal distribution with zero mean and unity variance. A range of sample sizes was used, starting at 1,000 and progressing to 256,000 by successive doublings. The k tree order was chosen as m equals 50. For comparison, a k means clustering was run on the same set of samples. It was initiated with the same number of clusters as found by the k trees approached on the same data set. Ten sets of a tests were done, with the results averaged. Here we see the comparison, which is quite compelling for a given number of samples to be clustered. The k trees approach is much faster than the k means algorithm as seen. As always, there is a trade off. Geva found that the k trees clustering was not quite as accurate as the k means approach. Given that k-means is an iterative procedure which involves all pixels at all stages while k trees is a single pass per sample and segments that are dispatched during learning by its branching structure, that is not surprising. But in general, Geva found the difference not to be a problem in most practical situations, especially given the spade benefit of k trees. Another significant factor in favor of the k trees approach is that it can be adapted to run a multi-core processes and not to require all samples to be holding core memory during clustering. Those additional developments are new and contained in the paper by Woodley and others sought at in this slide. We are now at the end of our lectures on clustering. It would be good to compare this summary with those of the previous three lectures in order to reinforce overall the important aspects of clustering, especially when used as a tool for unsupervised classification. The first two questions here are important to the development of classification methodologies, which we will pursue in module three. The last question, hearts a particular benefit of clustering with the k trees approach.