[MUSIC] Hi, in this module I'm going to introduce you to survey data. Survey data is a mainstay of quantitative social science research. Probably, if you look at any of the major social science journals, at least the empirical papers overwhelmingly, we'll be making use of survey data of one form or another. Quantitative students typically conduct their first studies by analyzing existing survey data. As we'll learn about in a moment, there's lots of survey data already out there ready for you to download and analyze. Working with a mature data set like that, that is a data set that's already been prepared and publicly released is a good way to learn data management and advanced methods for analysis. And as I just said, many survey data sets are now publicly available alongside documentation. Now there is fundamentally two types of survey data, you might say that there's two types of microdata in general, cross-sectional and longitudinal. So cross-sectional, again used to be a main stay of social science research. Cross-sectional data where we have a snapshot of a population at a specific point in time via a survey at a specific point in time that asked people about their characteristics at that point in time. Cross-sectional data tends to be relatively simple. Again, it's like a snapshot. You'll only have to worry about dealing with people measured at one point in time. Longitudinal data, as you will soon learn is actually quite complex. Longitudinal data includes observations of the same people again, and again, and again at different points in time. When they're different ages and their characteristics change from one observation to another. So this is very complex data, and it requires some programming skills quite often to learn how to manage those data, to arrange it, and prepare it for analysis. Cross-sectional data, in some ways is very diverse, because cross-sectional surveys are relatively easy to mount, and they don't require much in the way of resources. So people can launch a cross-sectional survey with, again, limited resources on almost any topic that interests them. So if you look around, you can find hundreds, perhaps thousands of cross-sectional surveys soliciting people's attitudes on any number of different topics. So again, the data out there is quite diverse, and if you can't find the data that you want on a particular topic that's interesting to you, it's relatively straightforward to run your own cross-sectional survey. By contrast, longitudinal data tends to be more focused. Longitudinal surveys are extremely complex and expensive. They typically require a large staff who can help track the respondents in the survey from year to year. So because longitudinal surveys require very large budgets, they tend to be focused. So they may be very large surveys but most longitudinal surveys will focus on a particular topic. We'll introduce some of those in just a moment. Generally, what we see is at cross-sectional surveys, especially now are heavily focused on description. Again because cross-sectional surveys are relatively easier to mount than longitudinal surveys, they can be again launched with relatively limited resources. So people in observing some new social phenomenon, some new trend, they can very quickly get a simple cross-sectional survey into the field to describe new social phenomenon, new social patterns, new social trends to gain at least a descriptive understanding. Longitudinal data, longitudinal surveys focus heavily on inference that is assessing cause and effect. By following people over time, you have more opportunity to see how their characteristics earlier in life shaped outcomes later in life. Finally, cross-sectional surveys to the extend that they have a time dimension tend to be retrospective. Because cross sectional surveys are typically one shot. You go out and interview people at a fixed point in time. If you're going to learn anything about previous points in time, you have to ask people so they're retrospective, you're relying on people's memory. By contrast, longitudinal studies are prospective. You begin with first, what's called a wave of longitudinal study. And then you reinterview people repeatedly at future points in time, and so the resulting survey is what we call prospective, it doesn't rely on recollection or recall on the part of the respondents. One of the big distinctions between different types of survey data that you'll have to keep in mind as a student, or even as a faculty researcher is the distinctions between public and proprietary data, and the trade-offs involved in using these two types of data. The first big distinction is that public data, that is data that you can download from the web is typically fairly mature. A lot of people have already looked it over, used it, so a lot of the problems may have already been cleaned up by the time you get to it, hopefully in the process of preparing it for public release. By contrast, proprietary data is much more likely to be raw data. So by proprietary data, it may mean data that you have to obtain special permission to make access to, it might be data from your adviser, or from some other researcher who gives you again, a special permission to make use of it. Such data, because fewer people have worked with it, is less likely to be mature than public data. It may have problems that have not yet been addressed. So when you're working with proprietary data, you have to be much more sensitive to the possibility that peculiarities that you noticed are ones that you will have to deal with or perhaps fix before you conduct your analysis. Basically, you need to be much more weary as you proceed. A related distinction between public data that you can just download and proprietary data that you obtain through special arrangement is that public data is much more likely to be documented. So the big public data sets, especially the big longitudinal data sets that are public but I'm about to introduce in the next module generally have enormous amounts of documentation. You can look at the original questionnaires, you can look at codebooks that explain every single variable in the data set. And so you can refer to this documentation as you conduct your analysis. By contrast, proprietary data, probably a lot less likely to be documented properly. So if you run into a problem making use of proprietary data, peculiar values that you noticed when you're conducting an analysis, you may have to go to the person that collected the data to ask them what's going on. So you'd better hope that the investigator who collected the data has a good memory, and in fact, you'd better hope that they're still around and that you can contact them. Otherwise, you may have mysteries that will remain forever unresolved. Another issue that distinguishes the use of public data from proprietary data is that in some ways, making use of public data is more competitive than making use of proprietary data. The very fact that data is public means that not only you may be working with it, but many other people at many other institutions. There may be other people who independently have thought of the same research question as you have, downloaded the same data, and are perhaps even making use of similar methods to analyze it. It may show up at a conference and find that somebody is presenting an identical paper even though the two of you have never met, ever. By contrast with proprietary data, it's more likely that you will be the only one or one of a small number of people who are using a particular data set to work on a particular problem. So that may give you some advantages in terms of presentation or publication. And you may not have to worry as much about the possibility that before you finish your paper, somebody may finish an essentially similar paper and publish it, making your paper superfluous. Finally, public data tends to be, you might say, familiar. Because public data has been out there, it's mature, it quite often tends to focus on topics that people are already aware of, that they may have even already published on in established traditions, established literatures. By contrast with proprietary data, there is more likelihood, hopefully, that the data are novel in the sense of addressing a topic or a question that has not been previously considered. So again, you may have more opportunity for novelty in terms of analysis and publication by making use of proprietary data if you can gain access to it. Now, if you're going to be making use of data, public or proprietary, it's very likely that you're going to have to manage it and I want to talk about some of the issues that are related to working with survey data and managing that data. Now, the difficulties, the complexities of working with survey data are most serious when you're working with longitudinal data. Again, longitudinal data, as I've emphasized earlier, tends to be complex. You have multiple observations of the same people at different points in time. Data has to be rearranged to preparedit for analysis. And so you may have to, if you're working with longitudinal data, take the variables that are in the data set and transform them, rearrange them to produce the specific variables that you want to use in your analysis, whether it's a outcome variable or a right-hand-side variable. For these sorts of tasks, programming skills will come in useful. So one of the limitations of traditional statistical training is that quite often statistics classes don't give much training in the management of data. They provide data predigested to students to conduct their analysis. But they don't really teach students how to manipulate the data to get it prepared for analysis. So you may have to acquire some programming skills on your own in order to make use especially of longitudinal data to carry out the analysis that you want to do. Now again as I said, statistics classes rarely teach this data management. You'll probably have to either teach yourself, or take a programming class, or in a computer science department which will give the requisite techniques. So here, I've tried to talk about the differences between cross-sectional and longitudinal data, the differences between public and proprietary data. And some of the issues that you have to keep in mind when you're preparing to analyze the data. The importance of learning to manage the data. In the next module, I'm going to introduce you to the major sites where you can access both cross-sectional and longitudinal public data.