Welcome back. In this video, we're bringing you a guest interview with Bart Knijnenburg. A PhD candidate at the University of California, Irvine who's done extensive work on user-based evaluation of recommender systems, particularly in collaboration with Martijn Willemsen at the Eindhoven University of Technology. Bart, Welcome to the class. >> Thank you. Thank you for having me. It's a pleasure. >> So we've already talked a fair amount in this class about different ways of evaluating recommender systems. And we've talked about how to use public data sets in cross validation. We've done some talking about field trials, and also started talking a little bit about involving the users more in the evaluation process. But could you tell us why we want to involve the users? What can we learn from them that we can't learn from some of these other evaluation methodologies? >> All right, well, let me give you a bit of an overview of that. So, let me start with a little bit of a background. So, as you students know by now, most of the field of recommender systems traditionally has evaluated, kind of the fruits of its labor, using metrics of algorithmic accuracy and precision. And, it seems that more recently researchers have come to realize that the goal of the recommender system actually extends beyond the accurate predictions. So because the primary goal, the real-world purpose of a recommender system, is to provide personalized help in discovering relevant content or items. So, and I think that this has caused two important changes in the field. And so the first change that was insighted by Shaun Macknee and his co-authors was that they said that being accurate itself is not enough. And that we should instead study recommenders from a user-centric perspective, to make them not only more accurate, but also more helpful and a pleasure to use. Now the second change is a broadening of the scope of research beyond just the algorithm. I think in this class you've primarily talked about algorithms, but aside from the algorithm, there is actually two important interactive components to any recommender. On the one hand you have the mechanism that allows users to express their preferences, so somehow you need to get the preferences of the user, that's one interactive component. And the other side is once you calculate the good recommendations, how do you present them to the user? So there's an input and output characteristic and kind of an interface for doing this input and output that is very important. In fact, there's one important industry executive, that when he gave a keynote speech at the Recommender Systems Conference in 2009, he actually said that the interactive components account for about 50% of the commercial success of a recommender, compared to what he thought of was only 5% for the algorithm. Now, in my own research, and also the work of Pearl Pu who does very interesting research on this topic. We both have kind of independently found that indeed, the preference elicitation methods and the way we present the recommendations to the user, really has a substantial impact on the user's acceptance and evaluation of the recommender systems as well as their usage behavior. So of course, I'm not saying that accurate recommendations are not important. It's very important to have good recommendations because without good recommendations you never get a good user experience. And a lot of students, a lot of studies including your own [LAUGH] Mike. Have found that there's a good correlation between accuracy and retention and user satisfaction. But there's actually a couple of important exceptions where the more accurate recommender actually turned out to perform worst in terms of retention and satisfaction. So without actually measuring retention and satisfaction, it's very difficult to extend these claims beyond accuracy if you don't actually measure these user-centric aspects. So you can't really say that your system will give users a pleasant and useful experience if you don't actually measure these things. And again, a large part of what the terms of the success of a recommender actually lies beyond the algorithm. So if you only cross-validate the algorithm, you provide a very narrow view into the potential success of the system. So there are certain ways that we have to kind of upgrade our evaluation methods in order to evaluate these other aspects and to measure these other outcomes. Now, the first step that you already said is to go live with the recommender. To actually have an an online evaluation like an A/B test between these different algorithms. And the nice thing is, you can do other things than just change the algorithms in these A/B tests. You can actually measure other aspects of the system. And you can also measure other outcomes. You don't just have to measure the accuracy of the recommendations, but you can also measure like purchase volume, or retention, or something like that. But even that, even going online with your evaluation, doesn't allow you to measure the user experience of the system in a direct way. And I think that, that's actually very important for several reasons. The first reason is that, if you have behavioral metrics like retention and purchase behavior, that kind of behavior is very noisy. So users buy stuff for all kinds of reasons and there's all kinds of ways that they actually interact with the system that are not just caused by these things that you evaluate. So this means that you can't often find the subtle differences that you're trying to find just by looking at behavior. And another problem is that a lot of these behavioral measures are kind of slow. So if you you want to robustly measure retention, then of course, retention is something that happens over months. So you have to test for months. Now, the nice thing is that these user experience metrics are more instantaneous. And they actually provide a good, kind of predictor, of what retention would be. And I think the most important drawback of behavioral measures is that they're often very hard to interpret, and they often have a very counter-intuitive interpretation. So to give you an example of that, I was involved in a project with an industrial partner who were testing a state of the art video clip recommender. And so they were testing this state of the art system against a system that gave completely random recommendations. So obviously the state of the art system should be better. But to their surprise, the total viewing of the video clips, so the total viewing time of the users was actually higher for the random system than for the system that gave state of the art recommendation. So that shouldn't happen. How is that possible? So I got involved and I conducted a user experiment in which we not only measured the total viewing time but also user's choice, satisfaction, and their perceived system effectiveness using subjective questionnaires. Now when we analyzed these metrics we found that these metrics were actually higher in the state of the art recommenders. So users perceived these to be better, but the perceived system effectiveness was also negatively correlated with the total viewing time. So people who watched less in total, actually found the system more effective. So then [LAUGH] it dawned upon us that people with a longer viewing time, possibly just spends more time sitting through bad video clips trying to find something worthwhile to watch. So then we ended up swapping out the total viewing time metric. With the metric of the number of clips that people watched completely, from the beginning to the end. And indeed, this metric actually was higher for users in the state-of-art recommended. So the lesson we learned there is that, including the subjective measures of the user's experience, like choice, satisfaction, and perceived system effectiveness, made us realized that the total viewing time was actually a measure of bad rather than good user experience. And we had to use the number of clips watching completely as an alternative. So, to conclude, if you want to test all the aspects of the recommend or not just the algorithm. If you want to test the impact, not just on the quality of the recommendations but also on user's retention and the purchase behavior. But even their direct user's experience. Then we have to move beyond cross validation studies to a b test or preferably actually these online user experiments. With user-centric evaluation metrics that span behavior as well as attitudes like usability and choice satisfaction and perceived usefulness. >> Okay, so if I want to run one of these studies, how do I go about designing it and executing it in an actual system? >> It's important to realize that there's actually two types of user studies. Formative studies and summative studies. Now, formative studies are non-scientific user studies. And you use them for the purpose of improving the usability and user experience of a recommender system that's still under development. So formative studies are a crucial aspect actually in the development of any kind of system, so any kind of computer system that you are building, you have to do these tests while you are developing it. And there's a wide range of formative user study methods that you can use to do this. And I could even teach an entire MOOC on this topic. [LAUGH] In fact, there are actually a couple of MOOCs on these topics. So, in the links that we provide for this lecture, you can actually find some of these and you can look at them if you want. It's a very interesting topic and is very important for any kind of system developer. Now, going to summative user studies. These are actually scientific experiments that compare different versions of the system against each other in an effort to find out which version is statistically, significantly superior. Now these studies are very closely related to a b tests. So this online test that you've already talked about a little bit with the addition of the subjective measurements using questionnaires. So if you're interested into details of running these kinds of studies, I highly recommend you to read my upcoming book chapter on this topic and we will provide that one in a link as well. It's on running user experiments for recommender systems specifically but I'll give you an overview now, all right? So the first step in conducting any good experiment is to have a good research model or a descriptive theory of the system aspects that you're trying to evaluate, and to see how these system aspects kind of influence the user's experience and their behavior. So that's a tricky process, because you have to figure out how will this thing that we're trying to evaluate influence the user. Now, to help you with that, I suggest that you consult another paper of ours which is a paper on the user-centric evaluation of recommender systems. And that paper actually has a framework that really explains how does a user experience comes about. So you can really use this kind of as a framework for your own studies. So specifically what this framework does is it argues that there's objective system aspects, so these are the things that you're trying to evaluate like the algorithm or the preference solicitation method. And these influence users, subjective evaluations of these aspects which I call in the framework the subjective system aspects. So these are things like, the perceived recommendation quality or the usability of the preference solicitation method. Now, these subjective interpretations of these aspects, they actually influence the user's experience, such as their satisfaction, and as I've said, perceived system effectiveness. And also the influence users interactions, so their choice behavior, and their retention. So, that's kind of the standard chain of effects that is going on, and this chain is actually also influenced by situational characteristics. Like the users choice goal in this particular situation or personal characteristics, for instance, users domain notch in the domain where recommended it. So if you use this general framework, you can set up your own research model, kind of explaining how the things that you're trying to test actually influence the user's experience and interaction. And if you read the paper, you'll see a couple of examples of this that can kind of walk you through that a little bit. So the next step, so that's the first step is to get a theory and to kind of figure out what should be going on, and then you need to operationalize that research model in your own experiment. So in that actually involves a couple of steps. And the first step is that you take this objective system aspect that you're trying to evaluate and you turn it into an experimental manipulation. And you do is, we're actually creating different versions of the system. That differ only in that system aspect. So you create two versions that differ only in the algorithm, or two versions that differ only in a preference solicitation method. So those are the versions that you're going to test. So the next step is that you have to recruit a sample of test users that are actually representative of the real users of the system. And I found that it's actually better to use crowd sourcing platforms like Amazon Mechanical Turk than to use your friends and your colleagues. Because you kind of want to not influence test user so it's better to run it online with people that you don't know so it's really hard for you to influence the test. Now in terms of the number of test users, that's a little tricky. There's some calculations that you can perform to figure out how many you should have and you'll find kind of some pointers to that in the handbook chapter that are referenced. Okay, so now you have your experimental manipulations, you have these different versions, any of the set of test users. Now what you're going to do is you're going to randomly assign each test user to one of these different versions of the system or if you want to test all the systems with each user and this is called a within subjects test. Then you let each user use each version, but you randomize the order in which they use it, so that you kind of cancel out the order effects. Now again, the book chapter that I referenced actually has a good discussion of the positive and negative sides of testing either between subjects or within subjects so you can read that if you want to learn more about that. But generally speaking, I'm a bigger fan of between subject experiments because it's a little more ecologically valid but it does require a more test subjects. All right, so now each user is assigned to one of the systems and now you let them interact with the system And of course, you have to make sure that if your research model actually requires some behavioral variables, that you have to measure those. So when I heard the usual clicks on something that you have done in interest, you have to somehow save those clicks, all right? Now, then after the users are done using the system and this can be either a single session or you can actually require them to use the system for several sessions, but once they're all done, that's when you're going to ask these questionnaires that I was talking about. So you ask them to fill out these questionnaires that measure these subjective system aspects and the experiences that you specified in the research model. Now measuring these subjective system aspects and experiences is not really an exact science, it's not like measuring someone's height or someone weight, it's a little more difficult than that. So typically, it's better to actually ask multiple questions per thing that you're trying to measure. And there are some statistical methods that you can use to then combine these, for instance, seven questions on satisfaction, into one single robust measurement scale. If you want to know how to create those questionnaires, the appendix of the paper with the user-centric evaluation framework actually has a couple of really good scales that you can use. But I encourage you that if you measure something that I'm not defining in that paper to actually develop your own scales, kind of based on those scales. All right, so now you've measured these things, and the users are done with their experiments, you have measured everything on every user. And now it's time to actually analyze the data that you've collected to see whether your research model that you've kind of developed all the way in the beginning was correct or not, and that's kind of the final step. >> So how do do we go about doing that? In particular, you've talked about having the multiple questions, that's going to factor in there somewhere. What's the paradigm and the set of tools that you need in order to do a robust and defensible analysis of an experiment like this? >> So in essence, what you are doing here, what you're trying to analyze is whether there are statistically significant differences between the different versions of the system that you tested, in terms of user subjective evaluations, experiences, and behaviors. So if you've ever taken an applied statistics or research methods class, then this should remind you of something like a T-test or an ANOVA, like you compare different conditions with each other in terms of some outcome. Now as you already mentioned, it's a little more complex than that, because first of all, you have to use subjective constructs that are measured with multiple item scales, so you need to somehow combine these items into a robust measurement. And also the behavioral data is actually not nicely normally distributed. So that actually invalidates a lot of these traditional test, like the T-test and ANOVA because they have normally distributed data as a precondition. And finally, your research model is likely to have some kind of a causal pathway. So typically, you have something like okay, this new algorithm is supposed to increase the proceeds novelty of the recommendations, and that in turn should increase user satisfaction and their exploration behavior. So it's kind of a pathway, it's like a causes b causes c. Now, these kinds of paths are really hard to test with these traditional statistics, because it's typically just one path from a causes b, there's no way to test that third path in there. So to do all this, my preferred method of analysis is actually structural equation modeling because it really allows you to do all these interesting kind of additions to a traditional test method. So a structural equation model actually consists of two parts. The first thing you do is you establish a measurement model. So this is to turn those multi-item scales into robust subjective constructs. And then the second step, is to, based on your research model, test all those causal paths, so the causal paths between the manipulation, the subjective constructs, and then the behavioral outcomes. Again, if you want to learn more on how to do this, the book chapter on user experiments for recommender systems actually contains a lot of details on how to do this and also some nice kind of worked out examples of how to do it in two popular statistical packages. So that'll give you a little bit of an idea of how to run these tests. >> So you talked about the causal pathway. And one of the core lessons that's drilled into particularly statisticians in training is correlation does not equal causation, causation is extremely difficult to prove. Where does the directionality on this causation come from, is that crucial, are there more conservative ways to interpret the model for those who might not be comfortable with the causal interpretations? >> So yeah, I mean, it's really hard to prove better a causes b or b causes a. But there's one point where you can definitely prove the causal direction, which is from this manipulation. So the objective system aspect that you manipulate, that cannot be influenced by anything because you randomize your participants. So there's no way anything can influence which version they got. So the objective system aspect is always, so to say, on the left side of the model, on the side where it can only cause things, it cannot be caused by anything. Now, in terms of how the different subjective variables influence each other and how they influence behavior or behavior influences them, there is always the possibility that there's two possible causal directions, from a to b or from b to a. And the best way to resolve this problem is to refer to theory. Typically, we believe that intentions influence actual behavior, that attitudes influence actual behavior. So the pathway typically in most cases, unless you are an exception within most cases, it's the case that your attitude influences your behavior or your perception of something influences your attitudes. So in that sense, the framework really incorporates this kind of psychological theory where some objective perception influences my subjective perception, influences my attitudes, influences my behavior. And so typically, that's the causal pathway that is kind of the most conservative of the two to test, and you should always in your papers make the note that there might be some feedback loops going on. But this is typically the most obvious pathways to test, and so you should test typically those pathways. If you really can't figure it out, you can specify a correlation rather than a causal path, but it's typically better to make sure that the model fits to actually have these causal paths. So you have to think about that a little bit and make sure that what you are testing kind of makes sense theoretically speaking. That's why you need to have this theoretical model before you start actually evaluating. >> And then, do you need to check, and if so, how do you test for relationships that maybe you didn't expect from the theory to see if they're showing up anyway and causing a problem for your results? >> Yeah, there are some things that you should do in that sense. So typically for instance, when you have a path from a causes b causes c, I typically add the path from a causes c. Because what you're saying is that, my algorithm, for instance, influences perceived novelty which then influences satisfaction. But there might be other reasons why this algorithm causes me to be more satisfied, or it might be that despite the fact that this algorithm increases perceived novelty which increases my satisfaction, some other thing about the algorithm causes me to perceive the lower satisfaction. So I always add this extra direct path into the model as well in order to kind of test whether there are some effects that I hadn't hypothesized. So typically, I try to what they call saturate the model. So basically line up the variables in the causal order that is theoretically expected and then basically try all the arrows going forward. And the book chapter actually talks about this a little bit and gives you an example of why and how you should do that. >> Well, thank you, and I hope that our students go out there and run some fascinating user studies and report on the results. >> Thanks for having me. I hope that I have inspired the students to indeed conduct their own experiments. And I also hope that I have given them enough advice to actually get started on this. It's a difficult undertaking, but it's definitely worth it. And if your students have any questions about an experiment that you're conducting, or about just some sequence analysis that you're trying to do, feel free to shoot me an email as well, and I can see if I can help you out. I'll do my best to try to help you. Thanks.