Since we've spent some time discussing what data science is, we should spend some time looking at what exactly data is. First, let's look at what a few trusted sources consider data to be. First up, we'll look at the Cambridge English Dictionary which states that data is information, especially facts or numbers collected to be examined and considered and used to help decision-making. Second, we'll look at the definition provided by Wikipedia which is, a set of values of qualitative or quantitative variables. These are slightly different definitions and they get a different components of what data is. Both agree that data is values or numbers or facts. But the Cambridge definition focuses on the actions that surround data. Data is collected, examined and most importantly, used to inform decisions. We've focused on this aspect before. We've talked about how the most important part of data science is the question and how all we are doing is using data to answer the question. The Cambridge definition focuses on this. The Wikipedia definition focuses more on what data entails. And although it is a fairly short definition, we'll take a second to parse this and focus on each component individually. So, the first thing to focus on is, a set of values. To have data, you need a set of items to measure from. In statistics, this set of items is often called the population. The set as a whole is what you are trying to discover something about. The next thing to focus on is, variables. Variables are measurements or characteristics of an item. Finally, we have both qualitative and quantitative variables. Qualitative variables are, unsurprisingly, information about qualities. They are things like country of origin, sex or treatment group. They're usually described by words, not numbers and they are not necessarily ordered. Quantitative variables on the other hand, are information about quantities. Quantitative measurements are usually described by numbers and are measured on a continuous ordered scale. They're things like height, weight and blood pressure. So, taking this whole definition into consideration we have measurements, either qualitative or quantitative on a set of items making up data. Not a bad definition. When we were going over the definitions, our examples of data, country of origin, sex, height, weight are pretty basic examples. You can easily envision them in a nice-looking spreadsheet like this one, with individuals along one side of the table in rows, and the measurements for those variables along the columns. Unfortunately, this is rarely how data is presented to you. The data sets we commonly encounter are much messier. It is our job to extract the information we want, corralled into something tidy like the table here, analyze it appropriately and often, visualize our results. These are just some of the data sources you might encounter. And we'll briefly look at what a few of these data sets often look like, or how they can be interpreted. But one thing they have in common is the messiness of the data. You have to work to extract the information you need to answer your question. One type of data that I work with regularly, is sequencing data. This data is generally first encountered in the fast queue format. The raw file format produced by sequencing machines. These files are often hundreds of millions of lines long, and it is our job to parse this into an understandable and interpretable format, and infer something about that individual's genome. In this case, this data was interpreted into expression data, and produced a plot called the Volcano Plot. One rich source of information is countrywide censuses. In these, almost all members of a country answer a set of standardized questions and submit these answers to the government. When you have that many respondents, the data is large and messy. But once this large database is ready to be queried, the answers embedded are important. Here we have a very basic result of the last US Census. In which all respondents are divided by sex and age. This distribution is plotted in this population pyramid plot. I urge you to check out your home country census bureau, if available and look at some of the data there. This is a mock example of an electronic medical record. This is a popular way to store health information, and more and more population-based studies are using this data to answer questions and make inferences about populations at large, or as a method to identify ways to improve medical care. For example, if you are asking about a population's common allergies, you will have to extract many individuals allergy information, and put that into an easily interpretable table format where you will then perform your analysis. A more complex data source to analyze our images slash videos. There is a wealth of information coded in an image or video, and it is just waiting to be extracted. An example of image analysis that you may be familiar with is when you upload a picture to Facebook. Not only does it automatically recognize faces in the picture, but then suggests who they maybe. A fun example you can play with is The Deep Dream software that was originally designed to detect faces in an image, but has since moved onto more artistic pursuits. There is another fun Google initiative involving image analysis, where you help provide data to Google's machine learning algorithm by doodling. Recognizing that we've spent a lot of time going over what data is, we need to reiterate data is important, but it is secondary to your question. A good data scientist asks questions first and seeks out relevant data second. Admittedly, often the data available will limit, or perhaps even enable certain questions you are trying to ask. In these cases, you may have to re-frame your question or answer a related question but the data itself does not drive the question asking. In this lesson we focused on data, both in defining it and in exploring what data may look like and how it can be used. First, we looked at two definitions of data. One that focuses on the actions surrounding data, and another on what comprises data. The second definition embeds the concepts of populations, variables and looks at the differences between quantitative and qualitative data. Second, we examined different sources of data that you may encounter and emphasized the lack of tidy data sets. Examples of messy data sets where raw data needs to be rankled into an interpretable form, can include sequencing data, census data, electronic medical records et cetera. Finally, we return to our beliefs on the relationship between data and your question and emphasize the importance of question first strategies. You could have all the data you could ever hope for, but if you don't have a question to start, the data is useless.