Now we're going to use the Chi-Square test of independence to test the hypothesis I proposed about smoking frequency and nicotine dependence from working with NESARC data, specifically. Is how often a person smokes related to nicotine dependence among current, young adult smokers? Or in hypothesis testing terms, is smoking frequency and nicotine dependence independent or dependent? That is, are the rates nicotine dependence equal or not equal among individuals from my different smoking frequency categories? For this analysis, I'm going to use a categorical explanatory variable with six levels. The number of days smoked per month, which you may remember I called USFREQMO, with the following categorical values. Smoking approximately 1 day per month, 2.5 days per month, 5 days per month, 14 days per month, 22 days per month, and 30 days per month. My response variable is categorical with two levels, that is, the presence or absence of nicotine dependence in the past 12 months called TAB12MDX in the NESARC data set. To run this in SAS, we need only to extend our frequency procedure. We include PROC FREQ;, the table statement, our categorical response variable, asterisks, our categorical explanatory variable followed by a forward slash. And the abbreviated term for Chi-Square, CHISQ. Again, ending the command with a semicolon, and then saving and running the program. The SAS results for a Chi-Square include a table of the response variable by the explanatory variable. And also, a calculation of the Chi-Square statistic, along with the associated P value. Looking back at the Chi-Square table, also known as the cross tabs, or cross-tabulation, you can see a myriad of numbers and percentages, with such labels as frequency percent, row percent, and column percent. Our p-value of 0.001 clearly tells us that smoking and nicotine dependence are associated. >> The Chi-Square table can be very confusing on first examination. Before we try to interpret this output, let's look at three different tables that pull apart the different numbers represented in a cross tabs. For example, we're going to use percentages from a Chi-Square table examining the distribution of insured and uninsured individuals by geographic region. Table A shows row percentages. Each cell includes the percent of observation within each row. That is within region Northeast, Midwest,South and West that are either insured or uninsured. >> As you can see adding across the rows gives a 100% of the observations within region. Table B includes the total percentage of observations in each cell. Here the percentage in each row and column add up to 100%. Finally, Table C shows column percentages. Each cell includes the percent of observations within column that is within groups, either insured or uninsured. Adding down the columns gives us 100% of observations by insurance status. >> So which of these percentage types should be examine when trying to interpret the Chi-Square results for smoking frequency and nicotine dependence? If the output is set with the explanatory variable categories across the top of the table. And response variable categories down the side. It will be the column percents that we want to interpret. In other words, we're interested in whether the rate of nicotine dependence differs according to which explanatory group the observations belong to, that is, which smoking frequency group. Notice that we're not interested in the column percentages for those observations without nicotine dependence, indicated with the demicode of zero. Instead, we're interested in describing the presence of nicotine dependence within the smoking frequency groups. That is, these column percentages circled with blue. If I use SAS code, to graph the percent of young adult smokers with nicotine dependence within each smoking frequency category, I could visualize the association, and see that there seems to be a positive linear relationship, that is the more days per month a young adult smokes, the more likely they are to have nicotine dependence. I know from looking at the significant P-value of 0.0001 that I will accept the alternate hypothesis that not all nicotine dependent's rates are equal across smoking frequency categories. If my explanatory variable had only two levels I could interpret the two corresponding column percentages and be able to say which group had a significantly higher rate of nicotine dependence. But my explanatory variable has six categories. So I know that not all are equal. But I don't know which are different and which are not.