Last time we talked about the principles of data collection, we noted two important rules. One more observations are better than fewer and two randomness is crucial. So where can you get your data? There are many sources and you will be more knowledgeable about the available options for your particular situation than I am, but let's consider a few broad types of questions and the general sources of data that may be useful. Let's start with one of our favorite examples. You want to determine whether people in your organization are indeed compensated in an equal manner. Let's assume that your null hypothesis is black men, black women, white women, and white men are compensated equivalently. We've already discussed how to take your data divide it into appropriate subsets an analyze it, but where can you get the data to begin with, and what data would you like to have? For people inside your organization, it is often relatively easy to get data. Presumably your organization has a payroll system, and this system can generate compensation data for all employees. Is that all you need? Maybe, if everyone in your organization has the same role and works in the same office, but for many organizations, you might need to collect more information to do an apples to apples comparison. In most organizations there are different occupations, different promotions levels, and maybe different office locations. You would need to collect information on each person's occupation level, an location if you thought that those were important factors to using your subsets. For some organizations there are other factors that you will want to include, for example, if an organization offers overtime work and some people choose to do this while others choose not to, then you may want to collect information on this. You can then compare subsets of white men who don't work overtime to black women who also don't work overtime, etc. The general point is that you will likely have all this information in your organization's IT systems, but the information may be scattered throughout multiple systems. Don't just limit yourself to the easy items that you get from the payroll system, if there are other items that you think are important. Sometimes you might want to determine how your organization considers perspective employees. Does the organization evaluate applicants equally? Let's assume that your null hypothesis is young and old applicants are evaluated equivalently, maybe to be more precise, this might be applicants below 40 years old and applicants 40 years old or older are evaluated equivalently. Here, you will want to look at subsets of applicants those who are under 40 and those who are 40 and older. Each applicant is either given a job offer or not. You can determine whether the under 40s or over 40s are getting substantially different levels of job offers, but what else do you need to perform an apples to apples comparison? You may get lots of applications from people who don't meet the minimum qualifications for a job, so you probably want to limit this only to the people who meet those qualifications. You may also have a handful of key criteria that you take into account during hiring, prior work experience in a similar field, a specialized graduate degree might be a plus, and so on. In this case, you will want to generate subsets of young applicants above the minimum threshold, young applicants who also have prior relevant experience, young applicants who also have a specialized graduate degree, and older applicants in each of these categories. Where will you get these data? Whereas organizations tend to have huge amounts of data on actual employees, they're not always so consistent at retaining information on perspective employees, but you may have access to this as well. Much of this may well be in an internal system, especially for applicants who reached the interview stage. In that case, you might be able to get everything you need from an IT system in your organizations, HR or talent management office. Sometimes not everything you desire will be entered into the system, but applicants files will be sitting in storage. You can get those files and hand collect the specific information you need. Should you spend the time to do this? Personally, I would say absolutely yes, but then again, I think it's a great use of my time to spend weeks in the musty archives of libraries to get data from the 18th century. You might ultimately decide that it is not worth getting this additional information or you might be able to employ an eager subordinate to collect it. It can be more difficult to access information that resides outside the organization, but this can vary quite a bit. If much of your business is conducted online, then you should have access to remarkably detailed data about customers choices from which websites they arrive, what on your site seems to appeal to them, what they buy, what they consider, but then choose not to buy, etc. You can learn a lot about customers through their online behavior. In fact, there's a thriving cottage industry and developing simple experiments that you can use to learn about your customers preferences through their online behavior. For example, as noted in an earlier video, thanks to the online profiles of visitors, if you think to analyze your customers by age, then you might find that senior citizens' abandon your site quickly. From quick market research, you might learn that they're put off because the font is too small and the dancing logos are too annoying. Alternatively, if you have hypothesis about why older customers find your site unappealing, you can run a low-cost experiment to find out. Your web team can make an alternative version of some of the commonly visited web pages with larger font, and when a visitor arrives who the system knows is a senior citizen thanks to their cookies or profile, the site can randomly deliver either the regular font pages or the large font pages. You can then look at the reaction of these visitors, do the senior citizens who see large font, stay longer and maybe even buy something? Note that this experiment allows you to create two subsets, senior citizens who see small font and senior citizens who see large font. And from our earlier videos you already know how to compare them to see if the large font generates a significant increase in website time or purchases. For brick and mortar customers, you frequently have more limited data. In this case, surveys of randomly selected customers may be a useful way to get data. My one bit of advice about this is that surveys can be expensive to administer, and the longer the survey, the harder it is to get someone to complete it, which raises the cost even further. So the more that you know precisely what information you need before designing the survey, the more precisely you can ask your questions and the short of the survey can be. This usually leads to greater success in data collection. For many external stakeholders beyond customers, surveys are a common way to obtain data, and the above rule applies. Shorter surveys with more precise questions tend to be better than longer surveys with more vague questions. Finally, sometimes it's useful to get secondary data that might describe the populations that are of interest to you. For example, if you have a bank branch or other outlet at a particular location, you can see the demographics of the people who choose to enter your store. Let's say that your branch has 1000 customers composed as follows, 35% black customers, 35%. Asian and 30% white customers. Or this could be 40% customers over 60 years old, 40% between 40 and 60 and 20% customers below 40 years old. No matter what subsets you care about, the question might be is our customer base representative of the local area? Are we under appealing to a particular group, and if so, then why? Government census data is one source of information on the demographics of the local area. You can almost definitely find the relevant information on the web. Imagine that you find that in the local area, 45% of the households are black, whereas only 35% of your customers are black. First, you can determine whether you can be 95% confident that this is a true under appeal to potential customers who are black. Armed with this information, you can explore more deeply why this might be. The government collects a huge amount of fabulous data. Other sources of secondary data include trade associations, industry associations, and obsessive people who love to collect, clean and share information. Many of these are free, consulting firms often collect information and offer it for sale, and some sources of data exists on the Internet for one purpose, but can be useful for others. For example, LinkedIn can provide great information on the characteristics of potential hires.