Hi, my name is Michael Sobel and I'm a statistics Professor at Columbia University in New York City. I also have background in the social sciences. My research is predominantly in the area of causal inference, where I work on a variety of problems. I've also taught several courses at Columbia on causal inference. But the best place is to start by looking at the syllabus, talking about the course a little. Causal relationships are of great interest in science, policy, business, more generally daily life. Therefore, it's important to have principle methods for using data to draw conclusions about the presence or absence of such relationships. In the past 35 years or so, statisticians have been developing such methods, and there's now a large literature on the subject. The approaching methods that stem from this literature have now entered the mainstream of statistical thinking and are increasingly used in the sciences and social sciences as well. More and more there are books and courses in biostatistics and statistics departments devoted to this subject. Okay, there are a few key ideas and concepts that are at the heart of these methods. And as the aim of this course to first present these ideas and concepts. And then to use them to address real life problems of interest that arise in both experimental and non experimental context where causal relationships are of interest. With such a large subject matter in a short course such as this, it is not possible to both survey the literature and cover it in depth. And I have opted for the survey approach. However, for each topic covered, I will point the reader to additional readings. Now on background, the course is geared to students who have studied at least calculus at linear algebra and who have taken several statistics courses. Now, in terms of prerequisites, I would think that the introductory sequence in probability and statistical inference, like a one-year sequence, and a course in linear regression, okay. Students may proceed at whatever pace they are comfortable with, okay. So we will begin by taking up the concept of causation. This isn't a philosophy course, so our treatment will be brief. But causal inference is the act of making inferences about causation, a concept which philosophers and others have discussed for many centuries. And upon which there is still not a consensus, and therefore it really behooves us to say something about the notion of causation we are making inferences about. After that we develop a notation, the so-called potential outcomes notation, that captures the idea. This notation is the key development that lets us mathematically define what we mean by an effect. And to do so independently of any procedures, we might use to make a statement about the value of an effect. This separation is absolutely critical, as it is this which allows us to ask under what conditions a procedure, say linear regression, can be used or not used to estimate such effects. Furthermore, the notation becomes indispensable in complicated situation, and there, it allows us to avoid confusing ourselves. So with potential outcomes notation in hand, we can then define unit and average effects. Units are subjects in an experiment, participants in a survey, etc. And the average is taking over these units or with respect to some population from which the units have been drawn. Without making additional assumptions, of course, we cannot observe the unit effects, and so we typically focus on the average effects. And we'll talk about conditions under which these are identified and can be estimated. I'm going to begin by illustrating and discussing these issues in the context of some simple randomized experiments. We shall see that various types of average effects are identified in such experiments, and we shall discuss estimation and hypothesis testing for various types of randomized experiments. In so doing, we shall consider both randomization inference in which we ask questions only about the subjects in the experiment. And super population based inference, where we think of the subjects as members of a larger collection of elements that we want to make inferences about. An good understanding of why causal inferences are justified and randomized experiments makes it possible to extend the justification for causal influences in this context. The observational studies where the researcher does not have the ability to assign study participants to receive or not receive a treatment. Observational studies are common especially in the social sciences, and are often used to address questions about causation. For example, a researcher may want to study the effect of education on earning, but it is not possible to assign persons to different levels of schooling. Further, in addition to wanting to know the average effect for a population of individuals, the researcher might want to know the values over various subgroups. For example, men and women, or say, the average effect of a university education on earnings for those people who actually attend university, not for everybody, just for those who attend university. In principle, this extension to observational studies is mathematically straightforward. If the investigator knows the pretreatment variables, which we'll call covariants, that predict both subject's treatment assignment and their outcomes, then within each value of the covariates, it is as if a randomized experiment had been conducted. Analysis is then also conceptually straightforward. That said, there are many practical issues that arise in observational studies about how to estimate average treatment effects, issues which are not really problematic in randomized experiments, okay? Now, this has motivated a number of different approaches to estimation of effects in observational studies, for example, matching, weighted, regression, subclassification. And it's important for us to understand these different approaches. These approaches are also utilized in more complicated settings, for example, in sequentially randomized experiments and longitudinal observation studies, where the number or estimands or estimates of interest are more complicated than just an average treatment effect. We shall look at such studies and more complicated estimands in the sequel. Furthermore, in practice, one does not know if all the covariants have been identified and measured, and it is not possible to directly test for this. We can however at least address this issue through sensitivity analysis and other methods. This issue doesn't arise in randomized experiments, because the investigator ensures these conditions are met by controlling the assignment mechanisms, which allocates subjects to treatments. That said, even in practice, randomized experiments can be problematic. There are many ways they can go awry and or fall short of addressing important scientific questions. As an example where a randomized experiment breaks down, in a study conducted by researchers at the University of Michigan in the 1990s, unemployed persons who were looking for work were assigned to receive or not receive assistance in looking for employment. In follow-ups, subjects were asked about their employment status and psychological state among other things. Now, by virtue of their random assignment, there is no problem comparing the different groups of subjects on follow-up variables by their assignment. But because a substantial percentage of persons assigned to the treatment group did not actually take the treatment, this comparison does not estimate the effect of the treatment itself. That is because the treatment received is also an outcome, and an outcome that is not randomly assigned. So you could throw away the information from those subjects who were assigned to treatment but didn't take treatment and analyze the resulting subset of subjects, i.e those assigned a treatment who actually took it up and those assigned to the control group. Or one can include the subjects assigned at treatment who did not take the treatment with those who were not assigned to take the treatment. Either way, intuition suggests there may be a problem. Perhaps the subjects assigned to treatment who did not take treatment believed the program would not benefit them. If they were correct, throwing away their data and comparing only subjects who complied with their treatment assignment would lead to overestimating the benefits of the treatment. Similarly if these subjects were included with the untreated, the effect of the treatment would be overestimated. In the sequel we shall look at some remedies. More generally, substantive researchers are often interested in how outcomes are mediated by intermediate variables. For example, an educational researcher might want to know the effect of studying on student's score on a widely administered exam. He cannot make students study, so we assigns them to either receive encouragement to study or not to receive encouragement. Comparison of the results for the two groups allows him to estimate the effect of encouragement on test scores. But he's also interested in the effect of time studied. Presumably, encouragement affects study time which in turn, translates into test cores. How can he estimate the effect of time studied as well? As another example, a bit more complicated, but also closer to the types of questions a theoretically oriented scientist might ask. A neuroscientist studying pain may want to know how exposure to temperature affects a subject's reported pain. And he randomly assigns his subject to receive varying exposures, after which he has asked the subject to report the pain level on a 0 to 100 scale. But he or she is most likely interested in how the pain is produced. That is the pathways by which the neural activity in different brain regions mediates the overall relationship between the stimulus and the subject's response. The examples above which are representative of many situations of both practical and scientific interest require a researcher to deal with response of interest and an outcome between the response and the treatment. This is a very tricky and important subject and we'll be taking up further in the sequel. We will also consider there, a few important topics that have received a great field of attention in social sciences, especially in economics. First, regression discontinuity, a topic which goes back at least to the educational literature in the 1960s, and which also appears as risk-based allocation in the medical literature, is concerned with a situation where treatment is assigned on the basis of a threshold. For example, persons who score above a cutoff on a reading test are not assigned to treatment while persons who scored below are assigned to take the treatment, a remedial reading course. After completion of the course, the reading ability of all subjects is assessed. What we would like to know is the effect of the remedial reading course on reading ability. But we do not have any idea of how the poor readers would have done in the absence of the course, nor how the better readers would have done had they been given the course. And we have no groups of otherwise comparable subjects to compare, i.e, we have no good readers who took the course nor poor readers who did not. Another topic, fixed effects regression models have a long tradition in economics and econometrics. Suppose one wants to study the effect of globalization on economic growth? Clearly, one cannot assign countries to different levels of globalization, so this will be an observational study. Leaving aside the issue of how to measure these things, naive comparison of countries with different levels of globalization will not suffice. As there are many other ways the countries with different levels of globalization can also be different. And one does not want to attribute to globalization, an effect due to other variables. Even if one can measure many of these other variables, they are most likely remaining factors that have not been measured, for example, cultural factors that are related to globalization and on which economic growth depends. The fixed effects approach in this example uses panel data to deal with such hidden factors provided these are constant over time. For each country in the analysis, globalization, and economic growth are measured at two or more times. And the hidden factors disappear when the difference between the time points is considered. The same idea can be used with cluster data in which subsets or observations are grouped together into natural hierarchies. For example, to study the effect of education on wages, one might use monozygotic twins as they are genetically the same and also share a familiar environment that is not adequately measured, but which contributes to both educational attainment and wages. We shall see that from a causal influence standpoint, there are serious problems with the use of these models in both the panel, the data context, and the context where observations are clustered, hierarchically organized. These difficulties are not generally recognized either in the technical literature or in the substantive literature, which relies upon these kinds of methods. Later, I want to take up a little more systematically two other important topics, longitudinal causal inference and interference. We've already considered some special cases of longitudinal causal inference in our discussion of mediation. And longitudinal causal inference is typically concerned with a case where treatment regimen, a sequence of treatments, is assigned to a subject. For example, an experimenter may randomly assign subjects to receive medication twice daily, daily, or weekly, and one may wish to study the effect of different regimens on later outcome. A more complicated situation occurs when a treatment administered in a given period is allowed to depend upon previous treatments and outcomes. The same concerns apply to longitudinal observational studies, okay? Now interference is difference, okay? Thus far we have assumed that a subject's outcome is affected only by his or her treatment assignment, not the treatment assignment of others. This is often reasonable, but there are clearly important cases where this doesn't hold. And failing to take interference into account can lead to very inaccurate conclusions. As an example, the United States government conducted a randomized experiment to study the effects of moving from housing projects to suburbs. Participants in housing projects were assigned and given assistance to move to a low poverty area, or assigned to one of two other groups. Many participants assigned to move did not, in fact, do so. A key point is that participants knew one another, thus, if person a is best friends with person b and both were assigned to move, they may do so. But if only person a is assigned, perhaps neither moves later in outcome of interest. Say how safe a person feels in their environment is measured. If this outcome depends on whether or not a person move, one sees that the assignment of a or b can affect the outcome of the person b or of a. Interference is a very interesting topic as is longitudinal data analysis more generally, but each topic requires a lot of startup to discuss, and analysis is not easy. However, we will briefly touch upon these things in the sequel.