>> Remember, we stated exploratory data analysis begins by looking at one variable at a time. This is called univariate or descriptive analysis. In order to convert raw data into useful information, we need to summarize and examine the distribution of any variable of interest. The variables of interest are the variables of interest to you, the researcher. In answering your research questions, addressing your research problem and telling the story you wish to tell with your research. By distribution of a variable we mean, what values the variable takes and how often the variable takes those values? Here's an example. >> A random sample of 1,200 US college students were asked the following questions as part of a larger survey. What's your perception of your own body? Do you feel that you're overweight? About right? Or underweight? This table shows part of the data, 5 of the 1,200 observations. Information that would be interesting to get from this data includes what percentage of the sample students fall into each category? Or how our students divided across the three body type image categories? Are they equally divided? If not, do the percentages follow some kind of pattern? There's no way that we can answer these questions by looking at the raw data, which are in the form of a long list of 1,200 responses. That's just not very useful. However, all these questions will be easily answered once we summarize and let the frequency distribution at variable body image that is once we summarize often each of the categories occurs. In order to summarize the distribution of categorical variable, we first create a table of the different values or categories the variable takes. How many times each variable occurs? Which is the count. And more importantly, how often each variable occurs? Which is expressed by converting the counts to percentages. Now that we've summarized the distribution of the body image variable, let's go back and interpret the results in the context of the questions that we post. What percentage of the sample students fall into each category? How are students divided across three body image categories and are they equally divided? You can see that most of the samples that is 71.3% felt that their weight was about right and that a comparatively small percentage felt underweight at 9.2%. The overweight category was 19.6%. Go back to your SAS program to learn how to use a frequency procedure, which is what you use to generate distributions for the variables of interest that pertain to your research question. The frequency procedure is typed as PROC FREQ and it's followed by a semicolon. Next, include the table statement typed as TABLES followed by a list of variables that you would like to examine. >> For my research question, I'm interested in looking at the association between how much a person smokes, that is quantity and frequency of smoking and the presence or absence of nicotine dependence. Going back to the NESARC codebook, you may recall that I have a number of variables that measure smoking behavior and nicotine dependence. It's the actual name of the variable rather than the longer descriptive name that I'll include in my program in order to generate frequency distributions. I have the names of each of the variables circled here in red. >> Here's an example of a NESARC program with the frequency procedure and table statement added, then a run statement is added at the end of the program. The run statement is responsible for executing all the previously entered SAS statements in this program. When you add a PROC FREQ and table statement with your own variables of interest from the dataset you've chosen to ask your research questions about and then conclude with a run statement, you've written an entire SAS program. You will build on this program as you progress through your project. Before you run the program, it's always a good idea to save it. Click the Save button at the top of the program window. After you save the program, you can run it by clicking on the Run button at the top of your program menu. SAS studio has three different tabs located at the top of the main program window. You write your code on the page marked by the Code tab. Just to the right of the Code tab is the Log tab and to the right of that is the Results tab. Once your program is finished running, either the Results tab or the Log tab will open. It's a very good habit to get in to to check you Log tab first after running a program, so you can check for any errors in your program. When you click on the Log tab, you'll see a section at the top of the page entitled Errors, Warnings, Notes. If any errors were detected in the program, the number of errors located in parenthesis will be listed to the right of the red X errors icon. The same with any Warnings or Notes. If there are any warnings, the number will be listed in parenthesis to the right of the yellow triangle Warnings icon. And any notes SAS has about your program will be numbered again in parentheses to the right of the blue I circle Notes icon. If you double click on any of these icons that indicate the number of errors, warnings, or notes that exist in the program Then a list will appear beneath the double-clicked icon. If you do find any errors listed on the Log page, go back to the Code page, correct the errors. Save the corrected code and then run your program again. If no errors are listed on the Log page, click back to the Results page and you'll see the results of your program. In this case, it will be the distribution tables you generated. In this example, our program is run successfully and the Notes listed in blue font tell us a few things. It tells us the reference library, which was called MYDATA was successfully assigned. Also, there were 43,093 observations read from the dataset. >> That's good news since I know that is the number of observations or individuals in the dataset. If the number of observations seem incorrect, it's possible there may be a problem with the program that processed the data incorrectly rather than actually generating an error message. >> If your program ran successful and there are no error messages in the log, then go ahead and click the Results tab to see the frequency tables that were generated for each of your variables. Because the actual variable names are cryptic, it's a good idea to give them labels that are more easily interpreted at a glance. We'll use our NESARC program as an example and show you how to generate variable labels in your program. SAS programs are made up of two distinct steps, DATA steps and PROC steps. You write DATA steps, which enable you to manage and manipulate your data, you write in code telling SAS exactly what to do with your data. PROC or procedure steps enable you to analyze and present your data. PROC steps are pre-written procedures, so the code in the PROC step is not giving SAS instructions to execute like the code you write in a DATA step does. The PROC codes you write basically control how the PROC step runs. With this in mind, you're going to enter your label statement here. Labels require the use of the term LABEL, then you enter your variable names. Each variable name is followed by an equal sign and then the new descriptive variable name within quotation marks. Once you've completed the list of your new variable labels, this LABEL statement like all SAS statements ends with a semicolon, then save your program and rerun it by clicking the Run button. >> I check the Log, which shows no errors and then I view the Results. As you can see, now my frequency distributions include both the variable name and the variable labels. If I look at the first table, which shows the distribution of responses for the variable tobacco dependence in the last 12 months, I know from consulting my code book that one means yes and zero means no. Following this list of response options, the table also shows the frequency of each. The percent of individuals with and without nicotine dependence. The cumulative frequency and the cumulative percent. By consulting your codebook, you can interpret and describe each of these frequency tables in the same way.