A term you may have heard of before this course is Big Data. There have always been large datasets, but it seems like lately, this has become a pasword in data science. What does it mean? We talked a little about big data in the very first lecture of this course. As the name suggests, big data are very large datasets. We previously discussed three qualities that are commonly attributed to big datasets; volume, velocity, variety. From these three adjectives, we can see that big data involves large datasets of diverse data types that are being generated very rapidly. But none of these qualities seem particularly new. Why has the concept of Big Data been so recently popularized? In part, as technology in data storage has evolved to be able to hold larger and larger datasets. The definition of "big" has evolved too. Also, our ability to collect and record data has improved with time such that the speed with which data is collected his unprecedented. Finally, what is considered data has evolved, so that there is now more than ever. Companies have recognized the benefits to collecting different information, and the rise of the internet and technology have allowed different and varied datasets to be more easily collected and available for analysis. One of the main shifts in data science has been moving from structured datasets to tackling unstructured data. Structured data is what you traditionally might think of data, long tables, spreadsheets, or databases, with columns and rows of information that you can sum or average or analyze, however you like within those confines. Unfortunately, this is rarely how data is presented to you in this day and age. The datasets we commonly encounter are much messier and it is our job to extract the information we want and corralled into something tidy and structured. With the digital age and the advance of the Internet, many pieces of information that we're in traditionally collected were suddenly able to be translated into a format that a computer could record, store, search and analyze. Once this was appreciated, there was a proliferation of this unstructured data being collected from all of our digital interactions, emails, Facebook and other social media interactions, text messages, shopping habits, smartphones and their GPS tracking websites you visit. How long you are on that website and what you look at, CCTV cameras and other video sources et cetera. The amount of data and the various sources that can record and transmit data has exploded. It is because of this explosion in the volume, velocity and variety of data that big data has become so salient a concept. These datasets are now so large and complex that we need new tools and approaches to make the most of them. As you can guess, given the variety of data types and sources, very rarely as the data stored in a neat, ordered spreadsheet, that traditional methods for cleaning and analysis can be applied to. Given some of the qualities of big data above, you can already start seeing some of the challenges that may be associated with working with big data. For one, it is big. There was a lot of raw data that you need to be able to store and analyze. Second, it is constantly changing and updating. By the time you finish your analysis, there is even more new data you could incorporate into your analysis. Every second you are analyzing, is another second of data you haven't used. Third, the variety can be overwhelming. There are so many sources of information that it can sometimes be difficult to determine what source of data may be best suited to answer your data science question. Finally, it is messy. You don't have neat data tables to quickly analyze. You have messy data. Before you can start looking for answers, you need to turn your unstructured data into a format that you can analyze. So, with all of these challenges, why don't we just stick to analyzing smaller, more manageable, curated datasets and arriving at our answers that way? Sometimes questions are best addressed using these smaller datasets, but many questions benefit from having lots and lots of data and if there is some messiness or inaccuracies in this data. The sheer volume of it negates the effect of these small errors. So, we are able to get closer to the truth even with these messier datasets. Additionally, when you have data that is constantly updating, while this can be a challenge to analyze, the ability to have real-time, up-to-date information allows you to do analyses that are accurate to the current state and make on the spot, rapid, informed predictions and decisions. One of the benefits of having all these new sources of information is that questions that weren't previously able to be answered due to lack of information. Suddenly have many more sources to glean information from and new connections and discoveries are now able to be made. Questions that previously were inaccessible now have newer, unconventional data sources that may allow you to answer these formerly unfeasible questions. Another benefit to using big data is that, it can identify hidden correlations. Since we can collect data on a myriad of qualities on any one subject, we can look for qualities that may not be obviously related to our outcome variable, but the big data can identify a correlation there. Instead of trying to understand precisely why an engine breaks down or why a drug side effect disappears, researchers can instead collect and analyze massive quantities of information about such events and everything that is associated with them, looking for patterns that might help predict future occurrences. Big data helps answer what? Not why? Often that's good enough. Big data has now made it possible to collect vast amounts of data, very rapidly from a variety of sources and improvements in technology have made it cheaper to collect, store and analyze. But the question remains, how much of this data explosion is useful for answering questions you care about? Regardless of the size of the data, you need the right data to answer a question. A famous statistician, John Tukey, said in 1986, "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." Essentially, any given dataset may not be suited for your question, even if you really wanted it to and big data does not fixed this. Even the largest datasets around might not be big enough to be able to answer your question if it's not the right data. In this lesson, we went over some qualities that characterize big data, volume, velocity and variety. We compared structured and unstructured data and examined some of the new sources of unstructured data. Then, we turn to looking at the challenges and benefits of working with these big datasets. Finally, we came back to the idea that data science is question-driven science and even the largest of datasets may not be appropriate for your case.