In this video, we will provide you with a quick summary of the main points from our first three courses to recall what you have learned. If you have just completed our third course and do not need a refresher, you might skip to the next lecture. We started our first course explaining how a new torrent of big data combined with cloud computing capabilities to process data anytime and anywhere has been at the core of the launch of the big data era. Such capabilities enable or present opportunities for many dynamic data-driven applications, including energy management, smart cities, precision medicine, and smart manufacturing. These applications are increasingly more data-driven, dynamic and heterogeneous in terms of their technology needs. They're also more process-driven and need to be tackled using a collaborative approach by a team that puts value on accountability and reproducibility of the results. Overall, by modeling, managing, integrating diverse data streams we add value to our big data and improve our business even more before we start analyzing it. A part of modeling and managing big data is focusing on the dimensions of the scalability and considering the challenges associated with these dimensions to pick the right tools. We also talked about characteristics of big data, referring to some Vs like volume, variety, velocity, veracity and valence. Each week presents a challenging dimension of big data, namely size, complexity, speed, quality and connectedness. We also added a sixth V, value, referring to the real reason we are interested in big data. To turn it into an advantage in the context of a problem using data science techniques, big data needs to be analyzed. We explained a five steps process for data science that includes data acquisition, modeling, management, integration, and analysis. The influence of big data pushes for alternative scalability approaches at each step of the process. If we just focus on the scalability challenges related to the three Vs, we can say big data has varying volume and velocity, requiring dynamic and scalable batch and stream processing. Big data has variety, requiring management of data in many different data systems, and integration of it at scale. In our introduction to the big data course, we talked about the version of a layer diagram for the tools in the Hadoop ecosystem, organized vertically based on the interface. Low level interfaces for storage and scheduling on the bottom and high level languages and interactivity at the top. Most of the tools in the Hadoop ecosystem were initially built to compliment the capabilities of Hadoop for distributed file system management using HDFS. Data processing using the MapReduce engine, and resource scheduling, and negotiation using the YARN engine. Over time, a number of new projects were built, either to add to these complementary tools or to handle additional types of big data management and processing not available in Hadoop, just like Spark. Arguably, the most important change to Hadoop over time was the separation of YARN from the MapReduce programming model to solely handle resource management concerns. This allowed for Hadoop to be extensible to different programming models and enable the development of a number of processing engines for batch and stream processing. Another way to look at the vast number of tools that have been added to the Hadoop ecosystem is from the point of view of their functionality in the big data processing pipeline. Simply put, these are associated with three distinct layers for data management and storage, for data processing and for resource coordination and workflow management. In our second course, we talked in detail about the bottom layer in this diagram, namely data management and storage. While this layer includes Hadoop's HDFS, there are a number of other systems that rely on HDFS as a file system or implement their own no-SQL storage option. As big data can have a variety of structured, semi-structured, and unstructured formats and gets analyzed through a variety of tools, many tools were introduced to fit this variety of needs. We call these big data management systems. We reviewed Redis and Aerospike as key value stores where each data item is identified with a unique key. We also got some practical experience with Lucene and Gephi as vector and graph-stores respectively. We also talked about Vertica as a column-store database where information is stored in columns rather than rows. Cassandra and HBase are also in this category. Finally, we introduced Solr and Asterisk DB for managing unstructured and semi-structured text and MongoDB as a document store. The processing layer is where all these different types of data get retrieved, integrated, and analyzed, which was the primary focus of our third course. In the integration and processing layer, we roughly refer to the tools that are built on top of HTFS and YARN, although some of them were with other storage and file systems. YARN is a significant enable of many of these tools making a number of batch and stream processing engines like Storm, Spark, Flink and Beam possible. This layer also includes tools like Hive and Spark SQL for bringing a query interface on top of the storage layer, Pig for scripting simple big data pipelines using the MapReduce framework and a number of specialized analytical libraries, formation learning, and graph analytics. Giraph and GraphX of Spark are examples of such libraries for graph processing. Mahout on top of the Hadoop stack and MLlib of Spark Are two options for machine learning. Although we had a basic overview of graph processing and machine learning for big data analytics earlier in our second and third courses, we haven't gone into the details there. In this course, we will use Spark's MLlib as one of our two main tools, providing a deeper introduction to the machine learning library of Spark. The third and top layer in our diagram is the coordination and management layer. This is where integrations, scheduling, coordination, and monitoring of applications across many tools in the bottom two layers take place. This layer is also where the results of the big data analysis get communicated to other programs, websites, visualization tools and business intelligence tools. Workflow management systems help to develop automated solutions that can manage and coordinate the process of combining data management and analytical tasks in a big data pipeline as a configurable, structured set of steps. Workflow driven thinking also matches this basic process of data science that we overviewed before. Oozie is an example workflow scheduler that can interact with many of the tools in the integration and processing layer. Zookeeper is the resource coordination tool which monitors and manages and coordinates all these tools and named after animal. Now that we've reviewed all three layers we are ready to come back to the integration and processing layer, but now in the context of machine learning. In which we will use machine learning techniques to apply to our five step data science process and analyze big data. Just a simple Google search for big data processing pipelines will bring a vast number of pipelines with a large number of technologies that support scalable data cleaning, preparation, and analysis. How do we make sense of all of it to make sure we use the right tools for our application? How do we pick the right pre-processing and machine learning techniques to start doing predictive modeling? Over the next few weeks Dr. Mei will walk you through some of the most fundamental machine learning techniques, along with introductory hands on exercises we designed for you to ease you into the world of machine learning. Let's get started.