You may have heard that data is really important for building AI systems. But, what is data really? Let's take a look. Let's look at an example of a table of data which we also call a dataset. If you're trying to figure out how to price houses that you're trying to buy or sell, you might collect a dataset like this, and this can be just a spreadsheet, like a MS excel spreadsheet of data where one column is the size of the house, say in square feet or square meters, and the second column is the price of the house. So, if you're trying to build a AI system or Machine Learning system to help you set prices for houses or figure out if a house is priced appropriately, you might decide that the size of the house is A and the price of the house is B, and have an AI system learn this input to output or A to B mapping. Now, rather than just pricing a house based on their size, you might say, "Well, let's also collect data on the number of bedrooms of this house." In that case, A can be both of these first two columns, and B can be just the price of the house. So, given that table of data, given the dataset, it's actually up to you, up to your business use case to decide what is A and what is B. Data is often unique to your business, and this is an example of a dataset that a rural state agency might have that they tried to help price houses. It's up to you to decide what is A and what is B, and how to choose these definitions of A and B to make it valuable for your business. As another example, if you have a certain budget and you want to decide what is the size of house you can afford, then you might decide that the input A is how much does someone spend and B is just the size of the house in square feet, and that would be a totally different choice of A and B that tells you, given a certain budget, what's the size of the house you should be maybe looking at. Here's another example of a dataset. Let's say that you want to build a AI system to recognize cats in pictures. I'm not sure why you might want to do that, but maybe the fun mobile app, and you want to tag all the pictures of cats. So, you might collect a dataset where the input A is a set of different images and the output B are labels that says, "First picture is a cat, that's not a cat. That's a cat, that's not a cat" and have an AI input a picture A and output B is it the cats or not so you can tag all the cat pictures on your photo feed or your mobile app. In Machine Learning tradition, there's actually a lot of cats in Machine Learning. I think some of this started when I was leaving the Google Brain team and we published the results with somewhat infamous Google cat, where an AI system learn to detect cats from watching YouTube videos. But since then, there's been a tradition of using cats as a running example when talking about Machine Learning with apologies to all the dog lovers out there. I love dogs too. So, data is important. But how do you get data? How do you acquire data? Well, one way to get data is manual labeling. For example, you might collect a set of pictures like these over here, and then you might either yourself or have someone else go through these pictures and label each of them. So, the first one is a cat, second one is not a cat, third one is a cat, fourth one is not a cat. By manually labeling each of these images, you now have a dataset for building a cat detector. To do that, you actually need more than four pictures. You might need hundreds of thousands of pictures but manual labeling is a tried and true way of getting a dataset where you have both A and B. Another way to get a dataset is from observing user behaviors or other types of behaviors. So, for example, let's say you run a website that sells things online. So, an e-commerce or an electronic commerce website where you offer things to users at different prices, and you can just observe if they buy your product or not. So, just through the act of either buying or not buying your product, you may be able to collected a data set like this, where you can store the user ID, the time the user visited your website, the price you offer the product to the users as well as whether or not they purchased it. So, just by using your website, users can generate this data from you. This was an example of observing user behaviors. We can also observe behaviors of other things such as machines. If you run a large machine in a factory and you want to predict if a machine is about to fail or have a fault, then just by observing the behavior of a machine, you can then record a dataset like this. There's a machine ID, there's a temperature of the machine, there's a pressure within the machine, and then did the machine fail or not. If your application is prevent the maintenance, say you want to figure out if a machine is about to fail, then you could for example, choose this as the input A and choose that as the output B to try to figure out if a machine is about to fail in which case you might do preventative maintenance on the machine. The third and very common way of acquiring data is to download it from a website or to get it from a partner. Thanks to the open internet, there's just so many, there's as that you can download for free, ranging from computer vision or image datasets, to self-driving car datasets, to speech recognition datasets, to medical imaging data sets to many many more. So, if your application needs a type of data, you just download off the web keeping in mind licensing and copyright, then that could be a great way to get started on the application. Finally, if you're working with a partner, say you're working with a factory, then they may already have collected a big dataset, machines, and temperatures, and pressure into the machines fail not that they could give to you. Data is important, but there's also little bit over-hyped and sometimes misused. Let me just describe to you two of the most common misuses or the bad ways of thinking about data. When I speak of seals of large companies, a few of them have even said to me, "Hey Andrew, give me three years to build up my IT team, we're collecting so much data. Then after three years, I'll have this perfect dataset, and then we'll do AI then." It turns out that's a really bad strategy. Instead, what I recommend to every company, is once you've started collecting some data, go ahead and start showing it or feeding it to an AI team. Because often, the AI team can give feedback to your IT team on what types of data to collect and what types of IT infrastructure to keep on building. For example, maybe an AI team can look at your factory data and say, "Hey. You know what? If you can collect data from this big manufacturing machine, not just once every ten minutes, but instead once every one minute, then we could do a much better job building a preventative maintenance systems for you." So, there's often this interplay of this back and forth between IT and AI teams, and my advise is usually try to get feedback from AI earlier, because it can help you guide the development of your IT infrastructure. Second, misuse of data. Unfortunately, I've seen some CEOs read about the importance of the trend in use, and then say, "Hey, I have so much data. Surely, an AI team can make it valuable." Unfortunately, this doesn't always work out. More data is usually better than less data, but I wouldn't take it for granted that just because you have many terabytes or gigabytes of data, that an AI team can actually make that valuable. So, my advice is don't throw data in a AI team and assume it will be valuable. In fact, in one extreme case, I saw one company go and acquire a whole string of other companies in medicine, on the thesis, on the hypothesis that their data would be very valuable. Now, a couple years later, as far as I know the engineers have not yet figured out how to take all this data and actually create value out of it. So, sometimes it works and sometimes it doesn't. But, they will not over-invest in just acquiring data for the sake of data until unless you're also getting an AI team to take a look at it. Because, they can help guide you to think through what is the data that is actually the most valuable. Finally, data is messy. You may have heard the phrase garbage in garbage out, and if you have bad data, then the AI will learn inaccurate things. Here are some examples of data problems. Let's say you have this data sets of size of houses, number of bedrooms, and the price. You can have incorrect labels or just incorrect data. For example, this house is probably not going to sell for $0.1 just for one dollar. Or, data can also have missing values such as we have here a whole bunch of unknown values. So, your AI team will need to figure out how to clean up the data or how to deal with these incorrect labels and all missing values. There are also multiple types of data. For example, sometimes you hear about images, audio, and text. These are types of data that humans find it very easy to interpret. There's a term for this. This is called unstructured data, and there's a certain types of AI techniques that could work with images to recognize cats or audios to recognize speech or texts, or understand that email is spam. Then, there are also datasets like the one on the right. This is an example of structured data. That basically means data that lives in a giant spreadsheet, and the techniques for dealing with unstructured data are little bit different than the techniques for dealing with structured data. But AI techniques can work very well for both of these types of data, unstructured data and structured data. In this video, you learned what is data and you also saw how not to misuse data, for example by over-investing in an IT infrastructure in the hope that it will be useful for AI in the future, but we're actually checking that they're really will be useful for the AI applications you want to build. Finally, you saw data is messy. But a good AI team would be the help you deal with all of these problems. Now, AI has a complicated terminology when people throw around terms like AI, Machine Learning, Data Science. What I want to do in the next video is share with you what these terms actually mean, so that you'd be able to confidently and accurately talk about these concepts with others. Let's go on to the next video.