It's my pleasure today to introduce you to Warren Li, a PhD student here working with me at the University of Michigan School of Information. Warren is in his fifth year of his PhD, and his research focuses on learning and how learners interact with data systems, artificial intelligence, and data collection systems. The work that he's here presenting today was done with Florian Schaub, a collaborator of mine, and Kaiwen Sun, another PhD student whom I advise. Now, my research focuses on learning analytics. I'm interested in using data science to understand learning as a phenomenon and how to help learners. Often this involves building predictive models of learners. There's a huge interest in making sure that this data is meaningful. We've talked a little bit in this course about bias and population bias, and that's the subject of Warren's work here. Warren, thanks very much for joining me. I'm looking forward to hearing what you have to say. Thanks for having me. I'm happy to be here and thanks for that great introduction as well. As Chris mentioned, today we will be talking about disparities in students' propensity to consent to learning analytics. That's a bit of a mouthful there. Propensity, we mean their likelihood to share data, and learning analytics here just refers to using data to better understand learners and help improve their learning. To give you an example, one area where we might use data in the classroom is predictive models. For instance, let's say that students are in a class, they're engaging with course material, we might have different companies that are creating software to help identify students that might be at risk. Here's an example. Here we have a student and says, okay, they have D in the course, we're afraid that they might not pass. The hope with this is to be able to, for instance, help give them an intervention. Maybe we can provide them with some support so that we can turn things around. In fact, in this course, in Coursera learning management system, they use predictive models as well. Now, I don't do much with them myself, but the students themselves may have actually noticed throughout the course that they'll get little nudges. Things like if you complete this assignment, your chance of passing this course is a little greater. I'd love to hear from them what they think about those. But I think it's really interesting that there's a lot of work being done by vendors to just build these systems regardless of what institutions want. But there's also a lot of interest in institutions like the University of Michigan where we build some of our own predictive analytics and our own dashboards. Absolutely. We're actually going to talk about some of those concerns when you're sharing data, for instance, with other companies. This leads to, of course, some privacy and ethical issues. As you hint to that, many of these systems lead to increase data collection. They're based on AI, machine learning algorithms, and to be able to make these models, we need lots and lots of data. Then that begs the question right, of course, what is this data and for who? Unfortunately, a lot of the time, it's sensitive data. Perhaps your academic record, perhaps demographic information, something that some people might not always be comfortable sharing with. It complicates things because there are unclear sharing arrangements with third parties, external companies, and so forth. Well, even with researchers like ourselves, I mean on campus, we have to go through a number of different procedures in order to get access to student data as an internal researcher. There's a lot tied up in here. There's legislation that's involved, and then there's local policy, and, of course, the student's best interests, which is what we all want here. Exactly, and that's the perfect segue there because there have been many researchers, as well as students themselves who want agency in the state that we should be giving students agency because this helps demonstrate respect for their decisions and that we should view them as collaborators rather than data producers, we're not just here generating data, we're part of the process together. But there are also some unintended consequences sometimes. For instance, we know that biased samples. When you have models and they're not representative of the full population, this can have disparate effects on different sub-populations. If we say, allow everyone to opt out and everyone chooses to, this could lead to negative effects on some of those models and for particular groups as well. This is particularly concerning since we know there are disparities in educational achievement. Yeah, absolutely. Often in the United States, we think about these groups based on ethnicity or race. We've seen in this course, we've looked at some of that data with the BRFSS and other behavioral systems. But this is actually a very global issue and it's not just ethnicity and race. It also breaks down any line where you can demark a group, so gender lines, but also disciplinary, and also educational preparedness, first in family, for instance, first-generation college students, and so forth. Absolutely. Just for some context, I'm glad you brought that up because, in this particular study, we are looking at gender and ethnicity. But as you mentioned, there's lots of different groups and even different cultures and different countries. What those groups are, might look a little bit different. To give a particular example of some inequities from predictive model. In recent years, and a comic here on the right-hand side, it mentions that over the years, there's been a lot more interest in fairness in machine learning. I'm going to give one example right here. Ocumpaugh and others in 2014, they train this model, and affect detector here is just a fancy way of saying we try to model students, we'll say, emotions. For example, perhaps someone is frustrated in class or maybe they're bored, so we're trying to detect those behaviors. What they did was to take a lot of data from urban, suburban, and rural students. What they find is that it performed pretty well for the urban students, pretty well for the suburban students, but not so much for those in rural areas. Part of the reason for that is because there's a lot fewer students in rural areas. That's what we mean when we're talking about those disparate impacts and models. For instance, you might need to separate them and train these models separately. Yeah, this was wonderful research coming out of Penn, and I think it really speaks to the need to collect data from the populations that you're interested in supporting. We actually face in the field of education and the learning sciences a huge replication crisis. It's very expensive to run these surveys or to run full-on experiments, as you well know, it takes a lot of time. Often, we only have one or two data points, even if we have 100 students, 1,000 students in there, if those students all come from the same group, what we can really generalize, too, is limited. I think this was excellent work that really brought that to the forefront in the field of educational data mining. As you mentioned, this is extremely important. Let's also talk about some of the consequences of what getting it wrong might be. For instance, if someone is labeled at risk, this carries a negative stereotype. We don't want to go and just say, "Well, let's just label everyone at risk because this comes with some negative consequences." Similarly, if we misclassify students, we'll end up missing some interventions where we could have, perhaps, supported students a little bit earlier on. Yeah, I think that one of the challenges that I face in the research that we do is that there's a difference between prediction and intervention, but they really need to go hand in hand. We need to be identifying the students who need help, and then we need to be able to intervene to actually help those students. Now, of course, some interventions we can use to help everybody, sending an email the night before the assignment saying like, "Hey, here's a hint," or sending it maybe a week before and saying, "Hey, remember, go do that." That's something lightweight and easy. But it's hard for a big institution, even one like Michigan, to give an individual tutoring advice or to pick up the phone and phone someone, but we do for students that we identify most at risk. These two pieces, intervention and prediction, really need to go hand in hand. Exactly, and then there's a complicated puzzle, as you mentioned. There are those simple interventions such as sending an email to everyone, but also some are more costly. For instance, if you recommend the whole different class to someone, that's going to make a huge difference.