In this video, I'd like to give you one more example on computing base weights and then make a comment or two on what you might do for nonprobability samples when it comes to base weights. So we're going to look at an example here from something called the SMHO population in the PracTools package, this stands for survey mental health organizations. And I'm going to use pps sample, it can be very efficient the frame has gotta measure size that's a good predictor of the variables you're collecting. So in our case, let's suppose that in this small population, I'm going to estimate total expenditures by these populations and this scale on the Y-axis for these graphs is in millions of dollars of expenditures. And then on the X-axis I've got the number of beds in the hospitals, so not every hospital has inpatient beds, some of them have zero. That's why you see some points down here, but we'll take care of that problem in a minute. What I've done though is plot separately for different types of hospitals, psychiatric, residential or veterans, general hospitals and then multi-service and substance abuse hospitals. Now what you see here is the dots are all over the place in psychiatric but this is a non-parametric smoother that's been run through the scatter plot. It's like a fancy way of taking your pencil and drawing a line through the points, sort of an eyeball regression line except this is pretty formal. It's a smoother called lowess, which is available in R. And you'll see it's quite straight and there's a fairly good relationship between beds and expenditures here in psychiatric hospitals. Likewise, there's a relationship in residential or veterans hospitals. And you can see, I made the scales same in these plots it goes from 0 to 200 here and there and then the other two. So you can get a feel for how the expenditures, the range of them are different in these groups but at least according to the nonparametric smoother, there's a pretty nice relationship. So beds might be a decent measure of size if I'm using or if my target variable is expenditures. So I'm going to use the sampling package and I I need to packages here. PracTools which has got that SMHO dataset and sampling which draws samples for me and the particular data set in PracTools is called smho98, shmo98. So the first thing I want to do is recode hospitals with 0 beds, because if I use the hospitals is measure size to draw PPS sample some of them got zero, they're going to have zero chance of being picked, I don't want that. So this syntax here says, first assigned to the size variable smho98BEDS, those are the number of inpatient beds in each hospital. And then I say Size less than or equal to 5, that creates a boolean, that's true or false, depending on whether beds is less than or equal to five or not. And for those cases where size is 5 or less, I'm going to recode it to 5, so that takes care of the zeros and then it also scoots the 1 2 3 and 4s up to five. Now next I compute selection probabilities in the PPS sample and I'm assigning those two variable named pk. And I'm using the function inclusion probabilities, which is in the sampling package, so here's my measure of size, size, here's my sample size 10. So it'll go through and it'll calculate 10 times the relative measure of size for each of these hospitals. And a nice thing about the inclusion probabilities function is it will sort through your data and if you've got some units that are so big they're going to be picked with probability one, it'll find those. And then it'll redo the relative sizes for everything else, so that's a nice feature of this. In this case I don't have any certainty because I made sure the sample size was little at 10. I can do a summary on pk just to see what it looks like, so there's quite a range, we are doing just an example here you might not want this bigger range in practice. pk goes from a minimum of 00067 to a maximum of 0.3231 so relatively that's a big increase from smallest to largest. Now I use my set seed statement to make this repeatable, so here's the seed that I picked out to fix the random number generator it's something. And then the function in sampling is called UPsystematic. Which in whatever order the population is given, it will pick a random starting place and then skip down systematically among the noncertainty units to select a sample and then it will also pick off any certainties that you have. So this smho98 has only got 875 hospitals in it, this creates a vector of 875 0s and 1s, zero means you're not in sample one means that you are. And then samplings got another nice function called get data, so from the population file smho98 I retrieve the sample units. And I set the result of UPsystematic equal to sam, so this just extracts the sample data so that's easy. And now I want to append the weights and I write my own statement for that. So, in the samdat object I'm going to cbind or column bind samdat which I just extracted with get data and then I'm going to add a column that's labeled weight, wt. And what's it equal to, its equal to 1 over pk, the selection probabilities but only for the units where sam is exactly equal to 1. That's what this sam equal equals 1 does, is extract those spots from pk that are in the sample. And then here's the first three sample units, so I've got units 36 103 156, these are from one of the variables in the file that's called STRATUM. Here are my beds, here's the expenditure total and here's the weight, 72.27 53.55 and so forth. And the sampling package will select a number of different kinds of multistage samples too, however, one way to do this, which is pretty straight forward and easy to keep track of. You use sampling to select first-stage units, clusters, you extract the sample data for those sample clusters, use the gets dataset statement. And then you can treat those first stage units or clusters as stratum and select a stratified sample using the stratum function or strata function from the sample clusters. And that works fine and you can keep track the selection probabilities. So in the sense it's a straightforward way to think about it and possibly more straightforward than some of the multistage routines that are offered in the package. Now there are other pieces of software around certainly, SAS is one it's got a procedure called proc surveyselect that will select quite a few samples, Simple random, Bernouli, Poisson, pps. You can do stratified samples of those types, so it's got some features you may like. In Stat it's easy enough to select simple random sample using the code in the basic package. There are also user written packages or you can develop your own code to pick other types of samples. There are some good examples at this website at UCLA, University of California, Los Angeles in the US. And I'd recommend this as a good source for information on all sorts of software, statistical software questions and usages. They have got pages that cover Stata, SAS, the R package in general, the R survey package, it's quite a nice selection of things. Now non-probability samples, what do we do about those? As I said earlier, you don't really have base weights in the sense of being inverse selection probabilities because you don't have a probability sample. You can compute quasi-randomization weights at least that's one approach people have taken to dealing with these things. So one way to do that is to have something called a reference sample. So that could be different things, could be census microdata, say if you've got that available in your country. Then you select a small parallel probability sample and a large or a large independent probability sample. So in other words here you select your own reference sample here you'd use one that exists out there already. And this last one is a handy choice because you don't have to spend any extra money as long as it meets your needs. Now, what do you do at that point? You combine your non-probability sample with the reference sample, whatever it is, and then you run a logistic regression, basically, to predict the probability of being in the non-probability sample. And that's a kind of a pseudo selection probability or pseudo probability it says of being in the non-probability sample. So you could use the inverse of that as a base weight or pseudo base weight if you wanted to. Based on empirical things that people have done, it seems like you need quite an extensive set of covariates in both your reference and non-probability samples to be effective. If you're doing a household survey you can't get by with just age, race and gender, for example. You need things like education level, income level and maybe a number of other things to get realistic pseudo-probabilities for use. So in the coming videos we'll give you some more software examples of how to fill in the steps, including nonresponse adjustment and calibration.