[SOUND] We need to have data once we determine what we are interested in knowing. We got to this point by first having defined some random variables that we like to get an understanding on, such as average amount an American spends on entertainment, or average age of a person pursuing online education, etc. So next is to go and find the data. So where does the data come from? There are many existing sources that already have vast amounts of data. Some examples are electronic versions of government publications, company reports and business journals. But there is also a wealth of information available in reference sections of a good library or in the county courthouse records. Another alternative would be to contact a data collection agency which typically incur some kind of a cost. You can either buy subscriptions or purchase individual company financial reports from agencies like Bloomsburg, Dow Jones & Company, Travel Agency of America, Graduate Management Admission Council and Educational Testing Services, just to name a few. However, there are times that the data we need doesn't exist. In such cases, we have to conduct some type of a study so that we can get the data. So now let's look at what we need to do in order to conduct the study. When initiating study, we first define our variable of interest or response variable. Again, we are learning new vocabulary here, and it's important for you to know these. For one, you will come across these terms in reports and meetings, and you should not be at a loss about what they are implying. Let me give you an example. Imagine that you are interested in knowing how efficient our employees are in doing their tasks. So we can define our variable interest as worker productivity. Let's say we will measure this as number of customers served per hour. We believe that the worker productivity responds to several factors. Thus, it's a response variable. Once you know the variable of interest, now you need to determine what factors would influence this variable. These factors are also known as the independent variables. Let's continue with our example of measuring worker productivity. Factors that may influence one's productivity could be amount of training the workers have received, reliability of equipment that they use, the number of years at this job, and et cetera. So in this study, these would be the independent variables. Factors which influence the response variable worker productivity. Now let's practice. Does the number of hours spent studying impact a student's GPA? To study this, we design an experiment in which we select a group of incoming freshmen, these are first year students, and ask them to keep track of hours of studying they do per week. And they will track their GPA for the next four years. What is the variable of interest and what is the independent variable? We are interested in measuring the impact of hours spent studying on GPA. So, the variable of interest is GPA, which we believe responds to the independent variable, hours spent on studying. So now that we know our variable interest and have identified some independent variables, we can go on with conducting the study. Depending on what you're trying to do, you may be doing a study which is known as an experimental study. An experimental study is one in which you go and manipulate the independent variable. For instance, we want to measure the impact of a particular fertilizer on plant yield. Here, the variable of interest, our response variable, is plant yield, and the independent variable is the amount of fertilizer used. For different plots of land, we can manipulate how much fertilizer is given and then measure the yield at harvest time. Since in the study we can manipulate the independent variable in this case fertilizer amount applied, then we're doing an experimental study. Of course, for the sake of simplicity in this example, I'm assuming that all other factors such as soil conditions, water, and etc are being held equal. Of course, one of the major issues in determining the quality of a study is how well and completely the researchers have identified the independent factors. Failing to do so could result in a wrong, or incomplete understanding. There are many times that we can't, or it's simply unethical for us to control the independent variable. When that is the case, we are conducting an observational study. The question of how children develop their language has always interested researchers. In such a study, one possible variable of interest could be measuring the number of words a baby knows by age one. However, if we believe that factors such as amount of spoken words for a baby in the first year as an influence on the baby's speech development. It would be unethical for us to control this factor by, let's say, asking some parents to not speak to their child at all for the entire year and then ask another set of parents to speak to their child for 18 hours per day in the first year so we can determine the impact of being spoken to In a speech development of their babies. Studies such as this is only conducted as an observational study where the participant will simply self report how much they spoke to their babies, and how many words the baby could speak by age of one. Once you have identified the response variable and independent variable, now you can start collecting the data. Here, again, you have more than one choice. Cross-sectional data are data collected at the same, or approximately same, point in time. Suppose that a bank which pays for the employees' cell phone usage wishes to analyze last month's cell phone bills for its employees. The bank decides to look at month of May cell phone bills of its employees. Looking at different bills for the same month will provide a cross-sectional data. Your data collection is considered a time series data if you're collecting data over different time periods. A famous study, known as the Nurse's Study, is a great example for time-series data. The study was initiated in 1976, and then expanded in 1989 to study women's health. And this study collected house-related data on over 200,000 nurses over time, to understand impact of various factors on women's health issues. This was an observational study. Which means, the nurses lived their lives without interference from the researches, and every so often answered questions about their health, did some medical tests, and the data was collected over many years for analysis. So now let's practice. Going back to this earlier study about impact of studying on GPA. This study, we design an experiment in which we selected a group of incoming freshmen and asked them to keep track of hours of studying they do per week, and we track their GPA for the next four years. Is this an observational study or experimental study? What about the data? Is it time series or cross-sectional? We have asked the students to record their study habits without manipulating it, so it's an observational study. And the data is collected over four years of college, so it is time series. A study done by a researcher to investigate the impact of drinking beer on weight. The study was conducted on a sample taken from six districts of Czech Republic. This is, by the way, a real study. In this study, a random sample of 1141 men and 1212 women aged 25 to 64, completed a questionnaire and underwent a short examination in a clinic. Intake of beer, wine, and spirits during a week, frequency of drinking and a number of other factors were measured by a questionnaire. What kind of study is this? How was the data collected? This is, again, an observational data. Data is collected over the same time period for all subjects. Thus it's a cross-sectional data. We can't finish this lesson without at least mentioning big data. This is a term that has taken hold in today's business world and we can do a lot of statistical analysis on big data. But what is big data? You have big data when you have access to massive volumes of both structured and unstructured data. We have a lot of data available today. Companies are collecting data through every means possible. GPS devices, your smartphones, your credit card, your health tracking devices, your game console just to name a few. The availability of the data opens up the extra opportunity to answer compelling questions, such as can you tell someone is about to have a heart attack before they show any symptoms? Can you tell if someone is going to default on their loan before issuing them that loan? The ability to make sense of these massive data is yet another way that we can find data which can be used for gaining an understanding about our variable of interest. All of the analysis done in the statistics starts with a subset of population of interest known as sample. So once we know what data we need, we must create the sample. In the next lesson, we will discuss how to create such sample.