0:53
They really are two types of interviewer effects,
each really with its own sets of studies and a dedicated literature.
One of them is due to interviewer behavior or so
it seems, and is associated with increased variance to interviewers.
The other is due to interviewer's fixed attributes, like race, gender, age,
and it's reflected by directional error or bias.
For example, more a pro-feminist responses
to a gender relevant question asked by female than male interviewers.
So the final topic in our discussion of interviews and
interviewers concerns interviewer effects.
And we'll turn first to those that are due to interviewer behavior and
are reflected in interviewer variance.
When interviewers are included as a factor or a term in the analysis of survey data.
The variance that they contribute to total variance is attributed generally to
differences in behavior between one interviewer and the next,
that is to how they administer the questionnaire.
So for example, one interviewer may elicit more
strongly disagree responses to a question and another interviewer who may
elicit more strongly agree responses than another interviewer.
So just to make this a little more graphical, on the left is a sort of
depiction of the true value being the red bull's eye in that target and
the black small circles being individual responses to a question
asked by a number of interviewers and there's no apparent clustering.
The black circles seem to be distributed around the true value
in a kind of random way.
But on the right, again the red bull's eye is the true value.
It's clear that there's clustering
of responses that the small circles by the different colors.
So, the color coding is intended to represent interviewers.
So you can see that different interviewers are eliciting different
ranges of answers from the respondents who they happened to interview.
And this is generally not something you would want to be happening.
The idea is that it shouldn't matter who asks the question, so
to the extent that different interviewers are eliciting different ranges or
different distributions of answers,
that's generally not a good indication about data quality.
3:14
So how would you quantify this kind of clustering?
Leslie Kish, a many years ago, introduced the measure rho int
to capture what is essentially a correlation between interviewers and
the answers that they elicit.
This is sometimes referred to as the intraclass correlation.
And it's conceptually kind of a straightforward.
There's a distinction between variance between interviewers, which is what you
would hope you don't have very much of because you would ideally want very
similar distributions of answers to a question elicited by all interviewers.
And within interviewer variance which is generally fine, you would hope
that a question is well enough designed but it can elicit a range of answers.
Because if a question is eliciting the same answer from all respondents,
that's generally not considered very discriminating.
So rho int really looks at the between interviewer variance, the kind of
the bad stuff over total interviewer variance which is the sum of between and
within interviewer variance.
If you would like the between interviewer variance to be zero, but it never is.
Interviewers no matter what they're training,
tend to contribute some variance to the total variance.
And this type of analysis [INAUDIBLE] calculating rho int really
assumes that respondents have been randomly assigned to
interviewers [INAUDIBLE] this is called interpenetration.
If this isn't the case, the differences between interviewers
in the variance of the responses they elicit, could in fact be due to
real differences between the respondents that they interview.
But if there's random assignment, then we assume that the true values for
the respondents interviewed by all interviewers
is pretty much the same as the distribution of this true values
over the cases assigned to each respondent are assumed to be the same.
One reason to measure rho int in addition to quantifying the clustering,
is it allows us to compute what are called design effects to the interviewers,
which really is a way to get a handle on
the extent to which interviewers are inflating overall variance, and
essentially reducing the confidence that we have in the estimates.
So, rho int is been used and studied by a number of researchers.
Because interpenetration is required, rho int really requires
a special design where respondents are randomly assigned to interviewers.
So you can't compute rho int for every study that's conducted,
every survey that's conducted really only is relevant or
applicable if the survey isn't uses an interpenatrated design.
But one study that calculated rho int for a number of different types of questions
was done by O'Muircheataigh and Campanelli, and they found four example,
that 26% of the attitude questions asked had a significant rho value, and
attitude questions with Likert scales, 33% of them had a significant rho int value.
26% of the factual questions that were asked in the study
also produced a signficant rho int.
And they actually were able to show that the clustering, the inflation of
variants due to interviewers was about the same as geographical clustering,
and there's a geographical clustering occurs in face to face interviews
when interviewers are assigned respondents, or
obtain interviews from respondents, in a particular small geographical region.
And so it's not surprising that the answers one interviewer elicits
may differ from those another interviewer elicits.
But that effect, that kind of clustering, is roughly the same as what
O'Muircheartaigh & Campanelli found interviewers were contributing,
even within interpenetrated design.
Schnell & Kreuter did a study that focused just on questions about crime,
and they looked at the interviewer effects for different types of questions as well.
They found that the interviewer effects were greater for
sensitive than non-sensitive questions, for non-factual, or really
attitudinal or opinion questions versus factual questions or behavioral questions.
For open questions rather than close form questions.
So, questions that responded to ask to answering their own words rather than pick
a response option to answer.
7:36
And for questions that they judge to be difficult rather than easy,
they actually calculated an index of what they called harmful question properties.
Properties that lead questions to the difficult or
in some other way problematic and they found that the effects of interviewers
increase as this harmful index increase as well.
To me this make sense, there's something about sensitive questions
that might make them more Affected by interviewer behavior.
And we've talked about the benefit of self administration when
collecting sensitive information.
But if an interviewer's present,
you can imagine that there's something about the way some interviewers ask those
questions that's different from the way others ask those questions that would lead
to a higher impact of interviews on variance.
Or attitudinal questions may be more sensitive to interviewer behavior,
particularly if respondents haven't really thought a lot about a question and
have the phrase is, they don't have a crystalized opinion.
9:01
Open then closed questions.
Again, you can imagine that the way the interviewer perhaps prompts for additional
information in open questions could differ from one interviewer to another.
And questions that are difficult versus questions that are easy.
Again, wouldn't be sensitive to interviewer involvement if some
interviewers try to help respondents with a difficult task and
some don't or they do this in different ways.
So just on intuitive grounds it seems that there is something about
interviewer behavior that could be related to responsible for
the size of rho, the size of the impact interviewers have on variance.
So a study by Mangione, Fowler and Louis found evidence for
a particular interviewer behavior that is closely related to rho, and
the particular interviewer behavior was the way interviewers probed.
As you'll recall from our discussion of standardized interviewing,
non-directive or neutral probes are really the one
part of that technique that allow interviewers to use their discretion.
And to decide whether to probe or
which of the neutral probes available they should administer.
And it could well be that it's that amount of discretion that is
actually responsible for interviewer variance.
So what they found was they computed a correlation between rho values and
particular behaviors.
They had behavior coded audio recordings of these interviewers.
And her are the four behaviors that were significantly correlated with rho.
So questions when interviewers correctly probed, questions when they incorrectly
probe, that is they administered a direct probe or a leading probe.
Questions where they fail to probe and then the fourth category
is not really related to probes, this is recording errors.
So questions in which interviewers incorrectly entered the response.
But the idea is that questions that require more probing require more
discretion by the interviewer and so
different interviewers will probe differently and
this will inflate their impact relative to questions where there's less probing.
So this is consistent with the idea that standardized interviewing by really
narrowing the set of possible interviewer behavior can reduce interviewers
impact on answers, or at least as measured by Interviewer variance.
But that doesn't necessarily mean that alternative techniques,
which may give interviewers other kinds of discretion, will increase
their impact on answers or on interviewer variance.
As we discussed earlier, proponents of standardized interviewing have suggested
that departure from the script increases interviewer variance.
In a recent study by West and his colleagues, they looked at interviewer
variance in standardized interviewing and conversational interviewing
in face to face interviews in a national samples study done in Germany.
And they found a conversational interviewing improved data quality
in 25% of the questions.
Very similar to the findings in the studies we discussed earlier about
the benefits of conversational interviewing for response accuracy and
data quality.
But the proponents of standardization would argue, well that
increase in quality will come at the cost of increased interviewer variance.
But, West and his colleagues found that in only 10% of questions
did conversational interviewing increase interviewer variance.
So conversational interviewing improve data quality in
substantially more questions than it inflated interviewer variance in.
And even when interviewer variance increase in conversational interviewing,
it didn't offset the improved quality that came from
the interviewing technique that I was explaining the meanings of questions,
the intentions behind the questions.
Moreover they found that interviewer variance was similar
between the interviewing techniques.
And finally they found that conversational interviewing led to more variance in
duration.
Remember in the studies we talked about it almost always led to longer interviews.
But actually in their study, it didn't lead to longer interviews overall,
just more variation in the duration.
And so what this suggests is that conversational interviewing
might be implemented more variably between interviewers
perhaps because more discretion is involved.
But that this is not increasing interviewer variance,
as the proponents of standardization were concerned would be the case.
14:03
interviewers ability to recruit participants to a take part in interviews.
So this would be really kind of non-response effect of interviewers, or
non-response origin of interviewer effects.
So that what appears to be an effect due to measurement may actually
reflect differences in recruiting by different interviewers.
That is different interviewers might recruit different types of respondents for
whom the true values differs than other interviewers.
And this can happen despite an interpenetrated design
where cases are randomly assigned to interviewers.
If different interviewers are recruiting different respondents
from those that have been assigned to them.
We can see something that's really indistinguishable ordinarily from
interviewer effects due to something about the way the questions are asked.
So West and Olsen demonstrated this phenomena or confirmed that this is a real
possibility using administrative records which served as their true value so
they could see where interviewer behavior was leading to departures from true value.
To see if there were also interviewer effects that were due to non-response or
differences in the true value of the respondents who were recruited.
And what they found was that for
two questions, int was reliably different than zero.
So they focused on those two to see what the records
could tell them about the origins of those effects.
And questions were age when the respondent was married, and
age when the respondent was divorced.
Based on the records, what they found was that the effect for
the question regarding age at marriage was due to measurement error,
kind of the classic interviewer effect.
There were errors in the answers they were eliciting differently for
some interviewers than for other interviewers.
But for the other question,
age at divorce, the effect was due to a significant non-response error.
So they were able to confirm that the true values of those who were recruited
by different interviewers actually differed.
So in a way, it kind of undermines the interpenetration,
the random assignment of cases to interviewers.
So essentially, the reason there's an interviewer effect for
the question about age at divorce is because some interviewers were recruiting
younger respondents than other interviewers.
This, in effect, undermines the interpenetration.
Even though cases were randomly assigned to interviewers, differential
recruiting led to a different mix of true values for different interviewers.
So as I said earlier, one reason to calculate Rowe int,
in addition to just quantifying clustering, is to quantify
the impact of interviewers on overall variance in the study.
So there's a measure that does this.
It's called the design effect due to interviewers.
So the design effect due to interviewers is a measure of the extent to which
interviewers increase the total error in the survey.
So you can see that there really are two parts of the equation.
The first is the number one, which just represents the variance in
the design where there are no interviewers or no clustering due to interviewers.
And we're just going to call that one.
And we want to know what additional variance can be attributed
to clustering due to interviewers, which has two parts to it Rowe int,
which we've been discussing, and m-1 and we can just think of this as m,
where m is the average interviewer workload, or the number of interviews.
And the idea is that the more interviews any one interviewer conducts, the greater
the impact of their idiosyncrasies or the particular way they administer
the questionnaire might be compared to one of their interviewer colleagues.
So the bigger the m, the greater the impact of any one interviewer which
is another way of thinking about this by reducing the workload,
the number of interviewers that any one interviewer conducts,
the lower the impact of any one interviewer.
So for a fixed budget, it would be better from this perspective, to
have a larger number of interviewers each conduct a smaller number of interviews.
There's a variance on the design effect which is called the design factor.
Often abbreviated by deft, and
it's really just the square root of the design effects, so
instead of talking about variance, we're talking about standard errors.
As I understand it they're really used interchangeably.
18:26
Having done this, having computed the design effect due to interviewers,
we can calculate the effective sample size.
Which is essentially the number of interviews that are conducted,
the number of respondents, divided by the design effect.
And what this is telling us is that we might have, for
example, conducted a 1,000 interviews.
If the design effect is for example, 1.26,
you can tell by looking at the equation that we're going to have the effective
sample size is going to be smaller than the actual sample size of 1,000.
It's going to be turns out 793.
And this is a way of saying that are confidence in the results is as if we
conducted 793 interviews instead of 1,000 interviews.
So this is a useful set of statistics that one can calculate.
Again, you can't calculate Rowe int and therefore, design effect
due to interviewers and the effect of sample size, in this way, in most surveys,
because they don't involve this random assignment of respondents to interviewers.
But where you can, this really I think makes tangible the impact of
interviewers on overall precision of the survey estimates.
And I think you can see that here in this table which comes from
a paper by Groves and Magilavy.
So you can see that there is a Rowe for average workload, a Rowe in the table that
is, and a Rowe for Rowe, and Rowe for design effect through the interviewers.
And what you can see is that the Rowe value for the survey on the right
is actually more than three times as big as the Rowe value for
the interview represented by the middle column, 0.0067 versus 0.0018.
But if you look at the bottom Rowe, the design effect,
to the interviewers is about the same, 1.09, 1.10.
And this is because the average workload is different in the two cases.
It's much smaller in the survey on the right.
So while the Rowe value,
the clustering to the interviewers is much greater than the study in the middle
health in America, that larger Rowe value, the impact of clustering is
diluted because the interviewers are conducting fewer interviews each.
And so, the result is that really about the same
impact on overall variance to the interviewers.
20:49
Interview related error is essentially a correlation between interviewers and
the responses they elicit.
Again, it shouldn't matter who asks the question.
So any value greater than zero is undesirable,
it suggests that there's variation in the quality of answers.
And Rowe is a very useful statistic for these reasons.
But Rowe is an indirect measure of data quality.
It's possible for Rowe int to be low, which would be good it would mean that,
with respect to variance, there's relatively little impact of interviews.
There's relatively little clustering.
But at the same time response validity could be low, so
that a low Rowe doesn't necessarily lead to high quality data.
If for example,
all the interviewers consistently collect the wrong answer to a question.
So you can imagine that interviewers were asking respondents if they have recently
visited a doctor.
And the survey's intended, this is hypothetical of course,
but the survey is intended to measure visits to MDs, Medical Doctors.
But the interviewers either are not instructed to or
maybe because of the constraint of the interviewing technique they're not
permitted to define what the survey means by doctors.
So you can imagine many respondents might adopt a more inclusive definition and
consider the question to also be about non-MDs.
So like physician's assistants or podiatrists.
So what we would have with this example, the occasion which Rowe is low
because the interviewers are essentially administering this question the same way.
If they're not defining doctor.
22:25
But data quality would be low because some percent of the responses would have
included doctors that shouldn't been included leading to answers that
are greater than the true value.
[NOISE] Just to wrap up our discussion of interviewer effects that are due
to interviewer behavior and reflected by increased variance.
The assumption, anyways,
that interviewer variance is related to interviewer behavior.
And the evidence from the study by Manzoni et al., is that interviewer
behavior's like probing, which are less scripted than other behaviors,
may be related to the impact of interviewers on variance.
That is, when there's questions with more probing lead to greater
impact on the interviewers, greater Rowe values.
But this doesn't mean that conversational interviewing,
where interviewers have a certain amount of discretion in what words to use when
following up on questions, doesn't mean the conversation or
interviewing involves higher interviewer variants produces
higher interviewer variants than standardized interviewing.
In fact, conversational interviewing produced no greater interviewer
variants in standardized interviewing and the study by Wes and his colleagues and
it did improve accuracy response quality.
Rowe int can be inflated by non-response errors studied by Weston Olson showed.
If interviewers recruit respondents whose true values
differ from those recruited by other interviewers.
This can produce significant Rowe values because the answers collected by
one interviewer will differ from those collected by another.
Not because of their behavior but
because of the true values of the respondents they're actually interviewing.
The impact of Rowe can be diluted by keeping the workload small,
that is fewer interviewers per interview.
So as I said before,
if you have a fixed budget then having more interviewers conduct fewer interviews
will reduce the impact of Rowe on the design effect due to interviewers.
And then, finally, Rowe int is generally not a measure of response accuracy,
it's a measure of variance.
And so, it's an indirect measure of data quality to the extent that one can
look at response accuracy which is often difficult to do,
requires records or some other external measure of the truth.
But to the extent that one can do that, one get's a more direct sense of how
interviewers are effecting the quality of data.