0:08

Now we're going to use the chi-square test of independence to

Â test the hypothesis I proposed about smoking frequency and nicotine dependence,

Â from working with NESARC data.

Â Specifically, is how often a person smokes related to

Â nicotine dependence among current young adult smokers?

Â 0:26

Or in hypothesis testing terms, is smoking frequency and

Â nicotine dependence independent or dependent?

Â That is, are the rates of nicotine dependence equal or

Â not equal among individuals from my different smoking frequency categories?

Â 0:44

For this analysis,

Â I'm going to use a categorical explanatory variable with six levels.

Â The number of days smoked per month, which you may remember I called USFREQMO,

Â with the following categorical values.

Â Smoking approximately 1 day per month, 2.5 days per month 5 days per

Â month, 14 days per month, 22 days per month and 30 days per month.

Â 1:23

To run this in Python, we'll import the SciPy stats library.

Â Next we will request our contingency table of observed counts, which I am calling

Â ct1, and we'll use the Pandas crosstabs function to generate these.

Â 2:54

As an extra note, in Python, the object ct1 here is actually called

Â a two-dimensional array, where the columns represent the first dimension,

Â called axis = 0, and the rows represent the second dimension,

Â called axis = 1 Finally, I request chi-square calculations,

Â which include the chi-square value, the associated p-value, and

Â a table of expected counts that are used in these calculations.

Â I call these calculations cs1 and ask Python to print them.

Â 3:27

My results first include the table of counts

Â of the response variable by the explanatory variable.

Â You can see that there were 64 participants who smoked approximately

Â one day a month without nicotine dependence.

Â And seven participants who smoked once a month with nicotine dependence.

Â 3:47

At the other end of the table, among smoking daily, that is 30 days a month,

Â 521 participants do not have nicotine dependence.

Â And 799 do have nicotine dependence.

Â 4:12

Examining these column percents for those with nicotine dependence,

Â that is, TAB12MDX = 1, we see that as smoking frequency increases,

Â the rate of nicotine dependence also increases.

Â Now, looking at the chi-square results, the chi-square value is large, 165.

Â And the P value, shown in scientific notation, is quite small.

Â Approximately 7.4e-34.

Â Which clearly tells us that smoking and

Â nicotine dependence are significantly associated.

Â So why did we calculated the column percents?

Â To better understand this choice, let's look at three different tables that pull

Â apart the different numbers represented in a cross-tabs contingency table.

Â For example, we're gonna use percentages from a chi-square table examining

Â the distribution of insured and uninsured individuals by geographic region.

Â 5:09

Table A shows row percentages.

Â Each cell includes the percent of observations within each row.

Â That is, within region Northeast, Midwest, South and West.

Â That are either insured or uninsured.

Â 5:26

As you can see,

Â adding across the rows gives us 100% of the observations within region.

Â Table B includes the total percent of observations in each cell.

Â Here, the percentage in each row and column add up to 100%.

Â Finally table C shows column percentages.

Â Each cell includes the percent of observations within column

Â that is within groups either insured or uninsured.

Â 6:00

So which of these percentage types should we calculate when trying to interpret

Â the chi-square results for smoking frequency and nicotine dependents?

Â If the output is set with the explanatory variable categories across the top of

Â the table, and response variable categories down the side,

Â it will be the column percent that we want to interpret.

Â 6:20

In other words, we're interested in whether the rate of nicotine dependence

Â differs according to which explanatory group the observations belong to, that is,

Â which smoking frequency group.

Â Notice that we are not interested in the column percentages for

Â those observations without nicotine dependence.

Â Indicated with a dummy code of 0.

Â Instead, we're interested in describing the presence of nicotine dependence within

Â the smoking frequency groups; that is, these column percentages circled in blue.

Â If I want to graph the percent of young adult smokers with nicotine dependence

Â within each smoking frequency category, I would first import the seaborn and

Â matplotlib.pyplot libraries and then add the following code.

Â First setting out explanatory variable to categorical and

Â a response variable to numeric.

Â And then requesting a bivariate bar chart.

Â With smoking frequency categories on the x-axis, and the mean for

Â nicotine dependence, which is the proportions of ones on the y-axis.

Â Now I can visualize the association, and see even more clearly that there seems

Â to be a positive linear relationship, that is the more days per month a young adult

Â smokes, the more likely they are to have nicotine dependence.

Â I know from looking at the significant P value,

Â that I will accept the alternate hypothesis.

Â That not all nicotine dependents rates are equal across smoking frequency categories.

Â If my explanatory variably had only two levels,

Â I could interpret the two corresponding column percentages and be able to say

Â which group had a significantly higher rate of nicotine dependents.

Â But my explanatory variable has six categories.

Â So I know that not all are equal.

Â But I don't know which are different and which are not.

Â