[MUSIC] In this video, we will learn to create indexing structures for efficient image retrieval. As we already know, searching through a collection of images implies computing an image descriptor for a given query image. And retrieving images who's descriptors surround the computed query vector. GIST descriptor is not as large a descriptor to work with. If we were to choose 4x4 grid with 8 orientations and 4 scales, a typical design choice, we would end up with a descriptor of 2 kilobytes in length. And a database of image descriptors for a collection of 1 million images would take 2 gigabytes of RAM, which is easy to fit on a modern computer. If, however, we were to describe the set of possible key points in the image, things get way worse. Each point has been described with a more simple SIFT descriptor with fewer parameters. But every image has multiple key points, leading to a minimum of 128 kilobytes per descriptor. This will generate memory pressure of 512 Gb of RAM for 1 million images, which would not easily be fit, even on a modern server. Retrieval performance is another issue since search query should finish within a fraction of a second to keep a user satisfied with search. However, linear search that considers billions of comparisons renders this performance infeasible. To speed up the image retrieval, one would naturally want to organize the precomputed descriptors represented as points in Euclidean space into an efficient data structure favoring the retrieval of nearest neighboring points. The main way to create approximate descriptors is through vector quantization. Let us map the set of all vectors to cluster centroids, equivalent to representing every object with a number corresponding to a cluster. The simplest possible way to perform clustering is to split the space of descriptors with a uniform grid along each dimension. Yet this is not optimal for image descriptors that will form natural cluster of images with similar content. Thus, some kind of adaptive quantization, such as k-mean quantization is required. K-means quantization splits the space of descriptors into a tessellation using the well known k-means clustering procedure. After we have clustered the space of descriptors into groups, it is convenient to represent our index in inverted form. We represent the set of descriptors as K lists where each list i contains images clustered into AiF cluster. The basic scam of querying the index proceeds as follows. We quantize the query image and find the cluster nearest to the image. Next, we retrieve all of the images in that cluster as approximate nearest neighbour and sort them according to the Euclidean distance to the query image. Another way to quantize image descriptors is through the use of hashing. The idea of semantic hashing is to map image descriptors from Euclidean space to the space of binary codes, that is strings of zeros and ones. In such a way that descriptors close according to Euclidean distance would map to codes close according to the Hamming distance. This implies that for instance 100 vectors closest to the query in the Euclidean space would generate 100 binary codes closest to the query according to the Hamming distance. One of the most straightforward algorithms to perform semantic hashing is Locality Sensitive Hashing, or LSH. First, we select a random hyperplane, that is a random direction, in the space of all descriptors and project all of the descriptors in the image collection to that direction. Next, we choose the hyperplane threshold or intercept close to the median of all projections. As each image gets located on either side of the hyperplane, we may mark it with either one or zero according to the sign of those catalog project between the descriptive vector and the hyperplane normal. If we add more around in projections, we stochastically decompose the nuclear space into closed or half closed areas or buckets. End points in each bucket are marked with a binary code. In the limit, these binary codes approximate the Euclidean distance between vectors. LSH algorithm hashes input items so that similar items map to the same buckets with high probability. For instance, items in the blue bucket in the slide are marked with 101 binary code. The ones in adjacent buckets differ by one bit only. The ones placed further will differ by two bits and so on. K-means and locality sensitive hashing both have their strong and weak points. K-means, while being an adaptive approach, well suited for clustering, suffers from performance issues that effectively limit the number of clusters one we want to compute. For LSH to achieve the accuracy provided by K-means, it must require too many bits of encoding. This is why K-means and LSH are used in addition to one another. An intuitive extension of K-means and LSH would be to combine them into a more efficient data structure for search. Instead of quantizing the original vectors, we may quantize the difference between the vector and its associated cluster. Hence, we will describe the location of a vector within the cluster computed by K-means. We first apply K-means clustering to decompose the entire input set of descriptors into clusters and then apply semantic hashing procedure for each cluster independently. To approximate the difference between the descriptor and the centroid of the cluster, the descriptor got quantized to. We now arrive at a practical algorithm to perform indexing for large image collections, it is called GIST IS for GIST Indexing Structure. We first compute image descriptors such as neural codes or GIST for every image in the database. Next, we cluster our descriptors into a large number of clusters, say 20,000. After clustering, we apply locality sensitive hashing to compute 512 bit binary codes to encode the difference between cluster centroids and original image descriptors. We may choose to only store 64 bit image identifier alone with 512 bit binary code in RAM. Loading descriptors off the disk when required. The search procedure and search and indexing structure, first, quantized is the query descriptor and picks a few hundred closest clusters. In practice, this elects around 2.5% of images instead of ideally picking 1% of that if the distribution of clusters were uniform. Having the cluster selected, we can compute the binary codes for the difference between the image and each of the cluster centers and search through the binary codes for each cluster that we store in memory. We select the images, whose binary codes differ from the query in less than 200 and 20 bits, for example, that would leave us with approximately 5% of images. In the end, we return images sorted by hamming distance or, alternatively, we may load the original neural codes of the disk and rerank the search results with respect to the original Euclidean distance between the query and the selected images, which would be fast. In this episode, we have described the indexing procedures that are required for the CBIR systems to be effective and scaled to billions of images. The main procedure to do that is through vector quantization that would provide adaptive indexing if we use k-means, for example, but it would slow down for huge image collections. We can also use semantic caching, for instance, LSH and maybe some others that efficiently compute binary codes, but are better suited for more uniform vector distributions. A reasonable trade-off will be combining k-means and semantic caching into a structure such as GIST IS that favor or quick retrieval in huge databases. [MUSIC]