[MUSIC] From the previous lecture, you studied on powerful functionalities of DBMS. At the same time, it was clearly noted that conventional Relational DBMS is not designed for complex data types, such as spatial data. In order to fully utilize the proposed framework, you should have a firm understanding of Spatial Database Management System, which is the topic of this lecture. [SOUND] Now, let's start with the question, what is Spatial Database Management System? In other words, what is the difference of Spatial DBMS from conventional Relational DBMS? Literally speaking,Spatial DBMS is designed for handling spatial data, such as vector data and raster data for handling spatial data, it needs Object Relational DBMS, in which abstract data type, in other words, user defined new data types can be managed. Because spatial data is saved so that spatial DBMS needs spatial query language equipped with spatial operations, and spatial indexing for query optimization and so on. So Spatial DBMS can be defined as a DBMS with additional capability of handling spatial data. Now, let's think about why conventional Relational DBMS cannot deal with spatial data? In fact, Relational DBMS can handle spatial data, but the problem is it is not efficient. The figure shows the limitation. The cadastral map can be modeled and the data can be saved in Relational DBMS. However, the spatial components- polygon, edge, and point should be saved in three different tables, which makes simple spatial operation very complicated. Because all the three tables should be visited using time-consuming, expensive, table join operations. So we needed another way to deal with spatial data. One simple solution was so-called 'dual architecture', in which Relational DBMS manage only attribute data, and a separate file system is used for spatial data. It can overcome the inefficiency of Relational DBMS, however, the dual architecture cannot provide powerful DBMS features, such as transaction management, and many others. It could be used for only single user applications. When Object Relational DBMS was introduced, and it can deal with Abstract Data Types in 1990s, spatial data can be tightly integrated with DBMS. In other word, column data type can be Polygon, Line, or Point. As a result, full functionalities of DBMS can be supported with this architecture. [SOUND] I mentioned object Relational DBMS quite a few times already. Let's discuss more about what it is, in more detail. ORDBMS can be considered as a middle ground to bridge the gap between Relational DBMS and the object-oriented programming, which supports three functionalities, (1) Complex data with user-defined class, (2) type inheritance, (3) object behavior with method ORDBMS present the best of the two. Consequently, it can deal with spatial data. Among many ORDBMS, objective relational DBMS, throughout this course, I will make use of PostgreSQL, and its extension for spatial data type - PostGIS, which are open source software's discussed in the second week. The example is showing how to create user defined data type, ADDRESS is the case, which is composed of the Road, City, and Zip code. And then using SQL, we can create a table named Restaurant, which has the column ADDRESS. The final SQL actually inserts a restaurant named Spiagia to Restaurant table. As you can see, ORDBMS can store a complex data type in table, which was not possible in conventional Relational DBMS. The figure shows a series of processing in SQL window of PostGIS to create the data type, create a table, and insert a record, which was described just in the previous slide. [SOUND] Now we have a method to store spatial data in DBMS, the next issue is how to retrieve spatial data from DBMS? As we studied it, for the retrieval process, SQL is also used. That means that you need a SQL with spatial operations for spatial data. SQL-3, which is a standard SQL established 1999, allows to support spatial data type, such as Point, Curve, Surface, Geometry Collection, and operations, such as spatial reference, Envelope, some Boolean operations of topology, and spatial functions, such as distance, buffer, intersection, and many others. [SOUND] Join operation is to connect two tables based on correspondence check on common field. Join is a key operation, at the same time, a very expensive operation in DBMS query processing. Actually, indexing is to basically speed up join operation. Likewise, spatial join to connect two tables based on spatial relationship is also very powerful, at the same time, expensive, in other words, time-consuming operation. Instead of correspondence or exact matching in regular Join operation, spatial join is based on spatial relationships, such as intersect, contain, covers, and so on. Let's take a look at an example of spatial join. The query is, what subway stations are located in Little Italy in New York City? For the query, we need two tables. New York City Neighborhood Table, which is based on polygon, and subway table, based on point. The query sets two variables, N and S, to point to the two tables, and then find Little Italy from New York City neighborhood table, and find the subway stations of which location is “WITHIN” Little Italy, then finally, return Little Italy and subway station names. Please note that for spatial relationship, a Boolean function ST_Within is used for spatial join here. The figures on the side illustrate the query processing and the result of the query in table view and map view. I brought in another example of spatial join. The previous example was spatial join between polygon and point, this time it is between polygon and line. The query is, what are the roads which cross the boundary of Corona neighborhood? The SQL query is similar to the previous one, but this time the spatial relationship is defined by a Boolean function ST_Crosses. From all the streets of New York City, the query retrieves only street which meet the boundary of Corona neighborhood. And now, you are looking at the query result in table view and map view. Now let's think of the same query with different dataset. The query is, what are the roads which cross the boundary of Dane County, Wisconsin? For the query, we are using county map and road network of the whole country. For the query, all the road from Alaska to Florida should be checked if they meet the boundary of Dane County. How about that? Would it work? Yes, it would, however, it will take lot of time. [SOUND] That is why spatial indexing is required in Spatial DBMS. For efficient query processing of the previous spatial join, what if we have to check the roads only around Dane County in Wisconsin? The query processing can be done very quickly. Spatial indexing is to provide a better search performance in spatial context. For that, R-tree, which is an extension of B-tree was introduced. There are many other spatial indexing methods, but R-tree is considered as the standard. Additionally, it's worth mentioning, Geohash, which is a powerful method for spatial data searching and organization, which is going to be used in spatial big data. The figure is composed of points and rectangles. For the given spatial data, you can apply R-tree based on MBR, which stands for minimum bounding rectangles. You're looking at the given spatial data sets, which are organized by two different levels of MBR, m, n, o, and p, represent 4 higher level MBRs, and each MBR has three or 4 lower level MBRs. For example, m has a, b, c, d as lower level MBR. The hierarchical structure can be transferred to tree structure, as you can see now, which is R-tree. Similarly to B+tree, it stores data sets on the leaf nodes, and it is a balanced tree. Quadtree is another spatial data indexing method. As you can see the example, Quadtree partitions 2D space into four quadrants, recursively, and saves the data to the end nodes. The figure shows the corresponding Point-Region Quadtree, each node has exactly four child nodes, and the data is saved to the end nodes. Which one do you think is better than the other? R-tree or Quadtree? Of course it depends on your spatial data sets. Generally speaking, if you have unevenly distributed data sets, R-tree would be better, because it can balance the shape of the tree structure. On the other hand, if you have evenly distributed point data, Quadtree would be a very good choice. Now let's take a look at real example of applying spatial data indexing and its performance. The given query is, what are the name of the roads in Manhattan that is more than 30 feets width? In which ST_Contain function is used for spatial relationship for the spatial join. Without indexing on the left and with indexing on the right. The difference of the query performance is 47 seconds versus 1 second. It is a significant improvement. What if we deal with the road networks of the whole country, the US. The difference will be obviously phenomenal. [SOUND] Now let's see the difference between GIS and SDBMS again. We discussed it in the second week. GIS has a variety of spatial data handling capability, but only to a certain degree. SDBMS is basically designed for spatial data management, with all the wonderful features inherited from conventional Relational DBMS. [SOUND] In this lecture, we started with the definition of Spatial DBMS and Object Relational DBMS was introduced as a DBMS solution for complex data types, such as spatial data. SQL for spatial data and spatial join was discussed. Finally, two different spatial indexing methods were introduced, and performance enhancement with real example was given for your understanding of the value of indexing method. Alright, this is the end of this lecture. See you all in the next lecture.