Hi, folks. Welcome back to our Module 3: Big Data. So far you have learned a lot of things about data. You learned how to deal with relational database, which is structured data. You also learned how to deal with a data warehouse where it is maybe not structured that well, but will be around some subject or interest. In this module, you are going to learn big data, and you are going to know what is big data and how we can define big data. Basically how big we can declare it is a big data. Then you are going to know the advantages and challenges of big data. You're going to know a list of tools you may be using for big data analysis. The learning objectives for this course, for this module, after finishing this module, you will be able to explain what are big data using the 10 V's. Also, you're going to explain big data applications. You're going to be able to describe the challenges of big data and also possible solutions. You can understand and you're be able to explain the workflow of big data architecture. You can identify the big data layers with examples in this architecture as well as you can identify those tools used for big data. That will be our goals. Let's get started. First of all, I'm going to ask you a question. What is data? This is actually a question I asked you during our first course, relational database design. Basically, data is some numerical or not numerical information. However, we learned those numbers, those streams or labels, and those images may not be that useful unless we can turn them into information, into knowledge, into the business intelligence for the decision-making purpose. We learned how we can use data warehouse to satisfy partially of this goal. However, it is not enough because what we have learned so far, in particular, most of them are structured data that are organized very clearly and defines in a very restrictive format, and in particular, the tabular format, where we have rows as the columns and rows as the records, columns as the attributes. We can search against this tabular format using the Structured Query Language and you can get the answers from the information stored in this format. This has been the most common practice in the field part of because it is useful, accurate, and we can use it for all kinds of scenarios in the history. However, nowadays we are getting into new scenarios happening and we have to have new perspective of how to organize our data. First of all, why do we need data? We'll know data has some power, data is the OU of the next century, and that's why you guys take data science and learn data science tools. Basically, data is the future and I cannot say no to this conclusion. Also, we can go from how the size is conducted in the history to know why the data is important. Basically, before 1600, the science is called empirical science, where we are going to observe something and then conclude with our observations. We learn from experience. That is the time when the old people always have the wisdom is because they had so many experience. In 1600-1950s, we have the science called theoretical science where we finally can do scientific research. We can find the proofs. We can find the equations. We can do some simple calculation to get to the result where our empirical science may not be able to help us to do that. After 1950s-1990s, we call the science as computational science, is because of the invasion of computers and the power of computation, so that we can do a lot of calculation that we are unable to do, which enables those signs that we are unable to finish before the computer is invented, before we have the computation power at that time. However, it is still limited because we still have limited computation power. We still have limited imagination, how the computers, and how the Internet can change. We learn, and we use, and we share everything including our data. From 1990 to now we call the science is actually data science. Because now when we do something we need support, evidence, and we can find them in our data. Also, even though we do not have a very clear hypothesis, we just want to learn a pattern from the data rather than we define some algorithms by human beings. We can use the data as the feel to get those meaningful information out. Those things will be useful and efficient and sometimes will be impressive. That is because data is the future, and that is not surprising. In 2012, Harvard Business Review just says, the sexiest job of the 21st century is Data Scientists. Also not surprisingly after 10 years this is still the case, so congratulations. Let's talk about how data, and in particular how big data can make a huge difference in the scenarios we can have. First one is big data in the banking and security. Banking industry use big data a lot because there are a lot of related issues, such as the risk analysis. When and whether or not we issued loan certain amount of money to a certain company? hat is the risk of this loan and what is the return of this loan? We have to analyze these candidates by examine into, not just the history of these candidates, but also the industry. Maybe the political environment, maybe the trend of some environment policy. There are so many factors. We'll put some roles, play some roles in the risk management and risk analysis, and definitely we cannot have a very simple model for that. What we can do is use the big data to connect a lot of related information and then try to find the patterns and calculate the related risk and then make the decision. Also for the fraud detection, we all use transactions, credit cards, and some small amounts on cash advance, etc. How the banking industry, how the management team can set up some rules to get notification if certain transaction is illegal. For example, if you purchase something from a location where you seldom visit or never visit and in theoretically, you may not be able to be appearing in that location to make that transaction. Then this can be caught by a big data analysis and to show this is a extremely anomaly of your behavior. Then that fraud transaction can be detected and can be rejected. These have to be happen really quickly because you cannot just wait on a cashier for 10 minutes, which is awkward. If our government wants to anti-money laundering, which is a crime of course, then we can use the big data analysis again to detect some anomalies in the transactions. If somebody save the cash, deposit the cash in a larger amount of time or in a very high frequency, then this may not be a normal behavior for a good behaved, mannered human being. That may be a thing we want to get into more investigation and then find out why. For the security departments, the Homeland Security, government security, your company's security, etc, big data can be used for that purpose as well. We can catch illegal trading activities, and we can do a secret words and spreading on the Internet. So we can use some network analysis to see whether or not these bars and dents of your network, and if you are reaching out to a person far beyond of your normal network, which will be a flag that asks for investigation. For the natural language processing, we can try to understand the words separating on the Internet, even sometimes the words are just written, handwritten digits, or handwritten letters, or even some weird design of something you want to catch up. The big data can be very helpful to detect those things our human beings may not have the power to do. Also it is not surprising big data in marketing has a huge application because marketing relies on data a lot. That is why you may get some phone call asking your opinion about some products or about some of your experience. Understanding the customer behavior and the preference is very important in the marketing field, and how to understand customer behavior. We cannot just throw out some survey and connect to them and just to say, this is the result. We have to analyze those survey result to make sure the survey is clean and valid, and then we can further get into a analysis of those patterns hidden in those ratings and texts, comments, and find out the useful information. For the media content, the online advertisements, the content recommenders, are actually driven by analyzing customer data at a large scale. That is why when those social network companies going to IPO, they have to announce how many users they have, how many new users they can attract in certain period of time, and how many active users they have right now because users provide data for them and those data can be analyzed and can make profit. That's why the YouTube and Facebook, now Meta, and TikTok can make such a huge success. It's because you, as a user, even though you may not pay anything to this company, but you are contributing your time, your effort, especially your behavior to these companies to be analyzed. Then they can send those advertisements you may find useful and you may find I'm going to try it, etc. So that those companies can sell ads on those networks and you, even though you're not paying anything, but you are getting things you can get. In the healthcare, we actually just survived from a pandemic. Big data is being used to deliver evidence-based diagnosis for reducing the cost and the time in the medical development. We can accelerate the time frame for a drug development significantly getting help from big data. Also we know that even in a pandemic, big data is used to analyze and track the possible positives and just to let people be aware and be prepared. Also big data is used in Machine Learning models to extract features and patterns from x-ray images. Normally, a image sometimes is well structured, sometimes it is not structured at all. For those unstructured images, it is very hard to analyze using normal learning methods. However, using the convolutional neural networks, we can extract features from those images and then make those process doable, and then we can get those patterns from them and we can know what is the subject and what is the object, what is the things in the image. Now, we see that a lot of auto driven cars, vehicles are using the data connected algorithm to detect human beings and differentiate the moving objects with other objects so we can drive safely. In manufacturing, big data is even more important because nowadays, manufacturing is in a global setting. It is very hard and sometimes very risky because nowadays, we are having a supply chain issue globally because of so many factors are happening, the pandemic, the lock downs, and the wars, and something happening somewhere. The global setting just makes the manufacturing, if you are procure certain parts in certain country and then you assemble them in some other country, etc. if you have a huge connected supply chain, which will be a lot of risks, and then you have to analyze and a manage it. Big data can provide quantitative and qualitative support because we can provide the data evidence and we can provide numbers and the support for a high-quality analysis for the decision making. The supply chain management will, of course, be one of the example as we just discussed. Also, for the predictive maintenance, sometimes we want to keep up with something and we want to know what will be happening next. We want to have some insights into the future, so the predictive power based on the data is very helpful. Also we can do quality control. That is, we can analyze and exam our products and our services to make sure it can meet our demands and it can meet our expectation. Lastly, we can do the product forecasting and also risk management. Those things all rely on a large amount of data globally so that we can get the insights from the evidence and from the support. The big data in retail is actually a combination of everything we just talked because for the retail industry, it is nowadays very common as e-commerce. We buy things on Amazon, we buy things on eBay, and we just order things without going to the actual shopping mall, and especially after the pandemic, everyone learned how to order online. This might be the biggest application of big data because it combines so many aspects that big data can benefit us from the backend, the supply chain management, from the manufacturing part to the front end, marketing strategies to understand the user behaviors. It covers every single thing big data can do and can help and can be a huge application. This will be why data, especially big data, can be so useful and can be so must-have for everyone because not only we can get benefit from big data, but also we are part of the big data. Our next topic will be the definition of big data and many Vs. We'll take a break and I'll see you later.