[MUSIC] Hello, everybody. Last week, we learned all the issues and solutions that come up with integrating, starting and analyzing high volume of sources. This week, we will learn what big data is and how the how to framework can bring some solutions to it. Information is growing at a phenomenal rate. 44 times as much data and content of a common indicate and 80% of the world's data is unstructured, then the world is changing and becoming more instrumented, interconnected and intelligent. Big data corresponds to that asset that are so big and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. There are a number of concepts associated with big data. Originally, there were three concepts. Volume, variety and velocity. Big data shouldn't be a silo or an isolated repository. It must be an integrated part of the Enterprise information architecture. An example of a big data platform solution brings together any data source at any velocity to generating site. An example of big data platform vision is bringing big data to the Enterprise. [MUSIC] Now, we need to have a look at some concepts. Everyone talks about big data, but what is it? Big data literally means a huge amount of data, but underlying issue corresponds to the analysis of high volume of any type of data on real-time in order to add value to business. When we talk about big data, we talk about three Vs. Volume, very, very large amounts of data around beta bytes or exabytes. 1 billion, billion or 1 quintillion. Variety, heterogenous, semi-structured or unstructured data. Velocity, dynamic think of the web and Facebook. Veracity, trust in it's quality real life data is typically dirty. This means it's not a cure. Why is the data so big and varied? Worldwide information volume is growing annually at a minimum rate of 59%. A single jet engine produces 20 terabytes of data per hour. Facebook has an sample has 1.38 billion users. 140 billion links, about 300 petabytes of data. And information from genome human is of the order of zettabytes or 1 billion people. Nowadays, it is possible to predict heart disease by traditional analysis from heart rate, blood pressure, etc., or big data analysis connecting exercise and fitness test, such as diet, fat and muscle composition, genetics, environment, social media and wellness, share information, etc. As we can see, big data is needed everywhere. In the case of social media marketing, 80% of consumers treust peer recommendations. If three close friends of person X like items P and W and if X also likes P, then the chances are that X likes W too. Regarding social event monitoring is possible to prevent terrorist attack. You can have a look at the Net Project, Shenzhen, China. If we focus on scientific research, there is a new yet more effective way to develop theory by exploring and discovering correlations of seemingly disconnected factors. [MUSIC] In the case of big data analytics, it helps to integrate high volume of data of great variety of high speed to achieve a corporate vision that generates competitiveness and profit. Big data analytic applications often include data from both the internal system and external sources, such as weather data or demographic data on consumer's compiled by third party information service providers. In addition, streaming analytics applications are becoming common in big data environments. As you said, look to do real-time analytics on data fed into Hadoop systems through Sparks, Sparks trimming mobile or other open source stream processing engines, such as link and storm. There are some approaches to the problem of big data. One of them has been highly implemented, the Hadoop framework. Apache Hadoop has become a core component of the Enterprise data architecture as a complement to existing data management system. Accordingly, Hadoop is assigned to easily interoperate, so you can extend your existing investments in applications, tools and processes with Hadoop. Here, I will show the main components of the Hadoop Ecosystem. We will talk about some of them on more detail later. There is a programming more of a proposed who manage big data, the MapReduce paradigm. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel distributed algorithm on a closer of processes or standalone computers. In general terms, a Map Reduce program is composed of a map procedure which performs filtering and sorting and reduced method which performs a summary operation. Here I'll show the map reduce stages for a word count example flow. The map phase is owned by mappers. Mappers run an uncertainty input key values first. Each mapper emits zero, one or multiple output key value pairs for each input key pairs. The combined phase is done by combiners. The combiners should combine key value pairs with the same key. Each combiner may rule zero, one or multiple times. The shuffle and surface is on by the framework data from all mappers are group and sorted by the key. The programmer may supply custom compare functions for sorting and a partitioner for data split. The partitioner decides which reducer will get a particular key value pair and data split among reducers. Now, we will see the mass common data types in Hadoop. Sentiment, clickstream, sensors, geographic data, server logs and text. The sentiment data type corresponds to understand how your customers feel about your brand and products right now. The type Clickstream is focused on capturing and analyzing website visitor's data trails and optimize our website. In the case of Sensor Machines, we can discover patterns in data streaming automatically from remote sensors and machines. For geographical special data, we can analyze location-based data to manage operations where they occur. The Server Logs files can be used to diagnose and process failures and prevent security breaches. Text documents are utilized as an input to understand patterns in text across millions of web pages, emails and documents. [MUSIC]