Welcome back. In this video, we will provide you a quick summary of the main points from our last course on big data modeling and management. If you had just completed our second course and do not need a refresher, you may now skip to the next lecture. After this video, you will be able to recall why big data modeling and management is essential in preparing to gain insights from your data, summarize different kids of data models. Describe streaming data and the different challenges it presents, and explain the differences between a database management system and a big data management system. In the second course, we described a data model as a specification that precisely characterizes the structure of the data, the operations on the data, and the constraints that may apply on data. For example, a data model may state that a data is structured like a two-dimensional array or a matrix. For this structure, one may have a data access operation, which given an index of the array, we use the cell of the array to refer to. A data model may also specify constraints on the data. For example, while a total data set may have many arrays, the name of each array must be unique and the values of a specific array must always be greater than zero. Database management systems handle low level data management operations, help organization of the data using a data model, and provide an open programmable access to data. We covered a number of data models. We showed four models that were discussed in more details. The relational data to date is the most used data model. Here, data is structured like tables which are formally called relations. The relational data model has been implemented in traditional database systems. But they are being refreshly implemented in modern data systems over Hadoop and Spark and are getting deployed on cloud platforms. The second category of data gaining popularity is semi-structured data, which includes documents like HTML pages, XML data and JSON data that are used by many Internet applications. This data can have one element nested or embedded within another data element and hence can often be modeled as a tree. The third category of data models is called graph data. A graph is a network where nodes represent entities and edges represent relationships between pairs of such entities. For example, in a social network, nodes may represent users and edges may represent their friendship. The operations performed on graph data includes traversing the network so that one can find friend of a friend of a friend if needed. In contrast to the previous three models, that there is a structure to the data, the text data is much more unstructured because an entire data item like a new article can be just a text string. However, text is the primary form of data in information retrieval systems or search engines like Google. We also discussed streaming data, or data with velocity, as a special class of data that continually come to the system at some data rate. Examples can be found in data coming from road sensors that measure traffic patterns or stock price data from the stock exchange that may come in volumes from stock exchanges all over the world. Streaming data is special because a stream is technically an infinite data source. And therefore, we keep filling up memory and storage and will eventually go beyond the capacity of any system. Streaming data, therefore, needs a different kind of management system. For this reason, streaming data is processed in memory, in chunks which are also called windows. Often only the necessary part of the data stream or the results of queries against the data stream is stored. A typical type of query against streaming data are alerts or notifications. The system notices an event like multiple stock price changing within a short time. Streaming data is also used for prediction. For instance, based on wind direction and temperature data streams, one can predict how a wildfire is going to spread. In the last course, we also covered a number of data systems that we called big data management systems. These systems use different data models and have different capabilities, but are characterized by some common features. They are also designed from the start for parallel and distributed processing. Most of them implement data partition parallelism, which, if you can recall, refers to the process of segmenting the data into multiple machines so data retrieval and manipulations can be performed in parallel on these machines. Many of these systems allow a large number of users who constantly update and query the system. Some of the systems do not maintain transaction consistency with every update. That means, not all the machines may have all the updates guaranteed at every moment. However, most of them provide a guarantee of eventual consistency, which means all the machines will get all updates sooner or later. Therefore, providing better accuracy and time. The third common characteristic of big data management systems is that they are often built on top of a Hadoop-like platform that provides automatic replication and a map-reduce style processing ability. Some of the data operations performed within these systems make use of these lower level capabilities. After this refresher on data modeling and management, let's start big data integration and processing.