Hi again, I'll be leading you through a lot of the concepts we're presenting throughout this course. As we said in the introduction, all of your instructors, including me, work at Databricks, were rebuild curriculum to train data teams so that they can effectively work with spark and Delta Lake, which will learn about later. I personally have been developing statistics and computer science curriculum for over ten years. First in New York City Public Schools, and then later at the district level nationally. I currently live in New York City with my wife, my dog Charlie and my cat, Loki. And I love it here. New York City is a huge hub for tech and data science, which is fun for me personally and profesionally. But I bring it up now, mostly because I want for all of us to take a moment and just imagine New York City as a place with a lot of people who produce a lot of data, take a second and consider the number of people who live here. It's about eight million. And a huge section of that population has smartphones, and on each of those smartphones let's say they have ten or so apps. And those apps are collecting and producing data pretty much all the time. When we think about the vastness of that data well, that's big data. If the millions of people who are creating hundreds of millions of data points every day, or maybe every minute, or maybe every second. The term itself big data was going around 2005, right around the time data intensive industries were looking for a new solution. In this video, we're going to talk about the specific characteristics that define big data and uncover what does it mean when data is big? By the end of this video, you'll be able to describe the characteristics that define big data. When we start talking about big data, you'll find that it's often characterized by what people call the five v's of big data. Those v's are volume, velocity, variety, veracity and value. Will talk about what each of these words means, and what each of these words means to a data analyst. When we talk about volume, we're talking about the massive amounts of data that's being generated every second of every day. We worked through a little bit of a thought experiment around New York City and how much data we could reasonably expect to come from all the people living in just that one city. And we said that's big data, but when we're talking about what qualifies as a massive amount, we have to consider the whole world. So lucky for us, the International Data Corporation, the IDC, the report in 2018, where they discovered that there are currently 33 or at the time 33 zetabytes of data in existence in the world. And based on that study, they're predicting that that is going to shoot up to 177 zetabytes of data by 2025. So, just to put that into perspective, let's try to think about how big is that zetabytes is. My ordinary old laptop that it works well enough for me, has 256 gigabytes of storage. If I had a really powerful computer then that would be about four times as much storage and coming in at one terabyte. One zetabyte is actually one billion terabytes, so it's vastly bigger, to say the least. And when we think about 177 zetabytes by 2025 that's a 600% increase. So anything about volume from a data analyst perspective I've represents both a challenge and an oppurtunity. On the one hand, we've got a ton of data that's getting generated everyday, and you know more data means better decisions potentially, but we as analysts also have to think about how are we going to access all this data? When we refer to velocity, we're referring to the speed at which new data is generated and the speed at which data moves around. But usually moved data is being generated is also moving to in from the database back to the end user. It's also moving all between databases and it's moving around really quickly. As data analysts, we can certainly appreciate the advantages of being able to gather, analyze and report on these large amounts of data, but we also have to think about how it's going to get processed and served to us for analytics. Will it be easy to query and can you create a real time report that truly represents the data? The next v is variety, and it refers to having different types and sources of data. So analysts usually work with structured and semi structured data that can generally be coerced into some kind of tabular format, something that looks kind of like a spreadsheet, but many businesses are also collecting unstructured data like video files and social media posts. So we've got all these files of all these different types from all these different sources. How are we as analysts going to be able to work with all those data structures and where are we going to be able to access each different kind of data? The next v, we want to talk about is veracity, which refers to the quality an accuracy of data. In this image were showing three different data sources reporting three different Q1 earnings, and we've already talked about how quickly data is coming in and how quickly data is moving around any given system, so it's not so hard to imagine that there may be some inconsistencies in that data. For data analysts, it s obviously important that we use high quality, accurate data to produce the best reports, and we always have to consider am I using the most accurate data? Can this day to be trusted? While all of these we talked about do affect how an analyst work, data value is the view that speaks directly to what an analyst brings to an organization. Extracting value from big data can be really complicated and it needs to be transformed into shareable, actionable insights and made visible to the larger organization. Next, we're going to take a look at how these characteristics make it a challenge to work with big data.