Hello, everyone. In this demonstration, what we're going to do is present a live example of analyzing design survey data using the RStudio software. In this example, what we're going to illustrate is different approaches that one could take to conducting analyses of publicly available survey data selected from a national population, in this case, we are going to be looking at data from the European Social Survey, and we're going look at some simple hypothetical descriptive analysis just to see what can happen when conducting these analyses of survey data. The key messages here, remembering what we talked about in the lectures, is when you download existing survey data to conduct secondary analyses, make sure that you're wary of essential design features of that data collection including the sample design, any weighting features, and so forth when you're preparing for the analysis. Then in statistical software like RStudio, you can then look at procedures that allow you to take those design features into account, execute those procedures and make sure that you're generating correct inferences based on the design survey data. We're going to turn to the RStudio software here and within the RStudio software, we have opened the R Markdown file that was available in your materials for this week. The name of this R Markdown file is survey_analysis_example.Rmd. Within the RStudio software, I opened that R Markdown file, so you see that here open in the Program Editor window within RStudio. This R Markdown file includes R code and also some description of what's happening in terms of the R code that we're using. That's a nice thing about the R Markdown environment within RStudio, is you can mix regular text that can be helpful for colleagues who are working on a research project to provide more descriptive information with actual R code that can be executed to conduct analyses, conduct data management, whatever the case may be. You see this mix of regular text and then these chunks of R code that are indicated in these gray boxes here. Where you see what look like apostrophes with an r in curly brackets, that indicates a set of R code nested within the larger set of text that describes this R Markdown file. What we're going to do is we're going to use the Russia dataset from a recent round of the European Social Survey, again, perform some hypothetical analyses and look at these alternative approaches that one could take when analyzing design data. The first thing that we're going to do, the first set of working R code, we're going to read this European Social Survey Russia dataset into R and create an object that contains that particular data. Now, the data are stored online in a CSV file, we're going to read that CSV file into an R object and then work with that R object further. That's what this chunk of R code is doing here, we're using this function called read.csv, we're reading in this data file from my website. H equals t means that the first row of that CSV file has variable names available for all the variables in that data file and we're saving that result in this DataFrame object called Russia. Within an R Markdown file, I can click on this little green play button to run that current chunk of R code. I'm going to go ahead and do that right now. Then in the Console that you see below the code, I see that I executed that command and there were no errors or no problems with creating that design object. Now, I have this Russia DataFrame object that exists and I can proceed within the R Markdown file. The next thing that we're going to do is load the contributed survey package in R. In the R environment, there are a variety of packages that people can write that you can install and then load within the RStudio environment that enable different analysis. In particular, the survey package in R lets you perform appropriate analyses of survey datasets that recognize complex sample design features, again, like the weights and the stratum codes and the cluster codes that describe the sample design that was used in the case of the ESS. What we're going to do is load that contributed survey package in R so that all the functions in that package are available and then we're going to create a new design object that contains the relevant sample design features of the ESS Russia data. In this particular design specification, the ids argument that you see here, this indicates the variable that contains sampling cluster codes based on the sample design, and the name of that variable in the ESS Russia dataset is psu. The strata argument to the svydesign function indicates the name of the variable in the ESS dataset that contains the stratification codes that we used for the sample design; and as we talked about in lecture, stratification of a sample can help to improve the efficiency of your survey estimates. We indicate the variable contain the stratum codes as well, name of that variable, again, is stratify. Then we have a weights argument that indicates the name of the variable in the ESS dataset that contains the final survey weights for the survey respondents. Remember, these weights account for different probabilities being selected into the sample, differential nonresponse across different subgroups, and they're designed to make sure that your weighted survey estimates are unbiased with respect to the overall sample design. The name of that variable in the ESS dataset is called pspwght. More generally, when you download a survey dataset online, you can read the documentation for that survey dataset to identify the names of these different variables in the dataset that you need to account for in these kinds of analyses. The data argument is the DataFrame object that I want to use for the analysis and then the argument nest equals TRUE, here it says that the sampling clusters are nested within the sampling strata, in terms of our overall sample design. That's usually the case in these national survey datasets. I'm going to run then this chunk of R code to load the survey package and then run that survey design function to create my design object called russia.dsgn that contains all of this important design information. I'll go ahead and execute this code. Now I have that russia.dsgn object available for further analysis. The next thing we're going to do is estimate the proportion of the Russian population that voted in the last election along with its standard error, and in this analysis, we're going to refer to that previous design object that we just created, russia.dsgn. Again, this is important because that design object communicates the sample design features that are important to account for in our weighted estimation and our variance estimation so that the sample design features are reflected in the estimates that we're ultimately computing in our analysis. Let's look at how we would do that, we're going to use the svymean function, which is the function for computing estimates of means and proportions, and in that svymean function, the first argument, we're using a tilde and then specifying that this variable called voted_lastelection, which is a variable in the ESS Russia dataset, is a factor variable, that just means it's a categorical variable. That factor function declares a variable to be categorical, and again, we start that with a tilde here. That's the variable that we're analyzing. Here is that design object that we just created, russia.dsgn, this is the design object that, again, communicates the essential design information of the ESS sample design for this particular analysis: the weights, the stratum codes, and the cluster codes to make sure that we're doing this analysis correctly and reflecting those sample design features. The third argument says na.rm equals T, that just means remove missing values. If there are missing values on that particular variable that came from the CSV file, we're just going to ignore those for now, for this initial analysis. Let's go ahead and execute this analysis function fully accounting for the sample design features. When I run that current chunk, down in my Console, also you see within the R Markdown file, you see the weighted estimates of these population quantities. Our weighted estimate of the proportion of the Russian population that voted in the last election is about 62.9 percent or 0.629 if we interpret this as a proportion. We see the weighted estimate of that mean and then a standard error that reflects the variability of that estimate due to the stratified sampling, the cluster sampling, and the use of weights in estimation. That standard error of 0.017 gives us a sense of the uncertainty or the variability in that estimate due to the sample design features. We have our fully weighted estimate of 62.9 percent, our fully design adjusted standard error, and now we can make good population inference about the fraction of the Russian population that voted in the last election and we can attach a standard error to that estimate to communicate the uncertainty in the estimate. Now, so that's the fully design adjusted approach, now let's take a more naive approach where we ignore those sampling stratum codes which, again, generally help to make our estimates more efficient or reduce our standard errors, and the cluster sampling codes which reflect cluster sampling and tend to increase standard errors because there's more variability when we conduct cluster sampling in terms of our estimates. We're only going to specify the weights in the design object before we estimate the proportion. In this approach, we're leaving out the stratum codes and the cluster codes and we're only specifying the weights for estimation purposes when we create our design objects. Notice that in the IDS argument, we're now saying one, that just means that each individual in the dataset really is their own cluster and that the individuals aren't clustered in any way by the sampling clusters that were selected, which, of course, is not the case but we're just doing this for illustration. We indicate the weights variable and, again, we indicate the same dataset. We create a new version of the design object, and then once we have that new design object, we use the same svymean command, but we're just referring to the new design object that has a different set of design variables indicated. Let's go ahead and run this analysis. We create that design object, redo the analysis, and now we see again that the weighted estimate is the same, like we saw in lecture because we're still using the weights to compute the estimate of that proportion but the standard error has now dropped to 1.1 and the reason for that is we're ignoring key features of the sample design that affect the uncertainty associated with our estimate, so instead of the standard error being 0.017, it's now 0.011, or if we interpret this on a percentage scale, it's 1.1. We are understating the variability in our estimate because we're failing to account for those other features of the sample design. When we fully account for those features, as we saw above, there's more variability in our estimate, there's more uncertainty but we're doing a good job of communicating that, we're communicating the correct uncertainty. Finally, let's see what happens when we completely ignore all the design features. Again, a completely naive analysis where we're ignoring the weights that are used to correct, again, for sample selection and potential nonresponse bias, we're ignoring the stratum codes and the cluster sampling codes which reflect the sample design features. In our survey design function, we're using the dataset, we're saying that there is no clustering and we're saying that everybody has a weight of one. People don't have differential weights, it's a so-called self-weighting sample where weights are essentially ignored. We'll create that new design object and then we're going to run that same svymean command but now referring to this unweighted design object. Let's go ahead and run this chunk of R code. Now, in the results that you see after running this code, look what happens. We see that a failure to use the weights to compute our weighted population estimate of the proportion that voted in the last election increases the estimate by more than two percentage points, and a failure to account for the sampling features makes that standard error even lower than it actually should have been, so again, we're overstating how precise our estimate is. Remember, the use of the weights, in particular, ensures that our estimate is unbiased with respect to the sample design that was used and any non-response that occurred in terms of our survey. This is a relatively mild example of what can happen depending on the approach that you take to the analysis. We saw about a two percentage point shift in our estimate when ignoring the weights. In other cases, like we saw in lecture with the NCSES data, estimates can really shift substantially if you fail to account for those weights and the weights are informative about that estimate of interest. You can feel free to try out other example analyses of this dataset, the dataset is available online, you can see how to load it in RStudio with the command that we used above. You can try out these commands with other categorical variables, try different design specifications, and just gain some familiarity with how we account for these sample design features when you analyze these datasets. Again, the best way to prevent these errors from happening in the future is to carefully read the documentation for these types of survey datasets before you embark on these analyses. Again, this will maximize the quality of the analysis that you perform of these design survey data. Thank you.