In this lecture, we are going to talk about trying out your interface with people and doing so in a way that you can improve your designs based on what you learned. One of the most common things that people ask when running studies is do you like my interface, and it's a really natural thing to ask, because on some level it's what we all want to know. But this is really problematic on a whole lot of levels. For one, it's not very specific, and sometimes people are trying to make this better, and so they'll improve it by doing something like, how do you like my interface on a one-to-five scale? Or, this is a useful interface. Agree or disagree on a On a one to five scale. And this adds some kind of patina of scientific-ness to it. But really it is just the same thing. You are asking somebody. Do you like my interface. And people are nice. They are going to say. Sure, I like your interface. This is the Please the Experimenter Bias. And this can be especially strong when there are social, or cultural or power differences between people that you are trying out your interface with. For example, Andron Emedi and colleagues show this effect in India where this effect was exacerbated when the experimenter was white. Now you should not take this to mean that you shouldn't have your developers try out stuff with users. Being the person who is both a developer and a person who is trying stuff out is incredibly valuable. And one example I like a lot of this is Mike Krieger, one of the Instagram founders, is also a former master's student and TA of mine. And Mike, when he left Stanford and joined Silicon Valley, every Friday afternoon he would bring people into the lab, into his office, and have them try out whatever they were working on that week. And so, that way they were able to get this regular feedback each week. And the people who are building our systems got to see real people trying them out. This can be nails on the chalkboard painful, but you also learn a ton. So how do we get beyond do you like my interface? The basic strategy that we're going to talk about today is being able to use specific measures and concrete questions to be able to deliver meaningful results. One of the problems of, do you like my interface, is compared to what? And I think one of the reasons that people say, yeah sure, is that there's no comparison point. And so one thing that's really important is when you're measuring the effectiveness of your interface, even informally, it's really nice to have some kind of comparison. It's also important to think about, well, what's the yardstick? What constitutes good in this arena? What are the measures that you're going to use? So how can be get beyond do you like my interface? One of the ways that we can start out is by asking a base rate question. Like what fraction of people click on the first link in a search results page? Or what fraction of students come to class? Once we start to measure correlations, things get even more interesting like, is there a relationship between the time of day a class is offered and how many students attend it? Or is there a relationship between the order of a search result and the click-through rate? For both students and click through, there can be multiple explanations. For example, if there are fewer students that attend early morning classes, is that a function of when students want to show up or is that a function of when good professors want to teach? With the click through example, there are also two kinds of Explanations. If lower placed links yield fewer clicks, is that because the links are of intrinsically poorer quality or is it because people just click on the first link? They don't bother getting to the second one, even if it might be better. To isolate the effect of placement in identifying as playing a causal role, you need to isolate that as a variable by say, randomizing the order of search results as we try to talk about these experiments let's introduce a few terms that are going to help us. The multiple different conditions that we try that's the thing that we're manipulating. For example, the time of a class, or the location of a particular link on a search results page. These manipulations are independent variables, because they're independent of what the user does. They're in the control of the experiment. Then, we're going to measure what the user does and those measures are called dependant variables, because they depend on what the user does. Common measures in HCI include things like task completion time. How long does it take somebody to complete a task. For example, find something they want to buy, create a new account, order an item. Accuracy, how many mistakes did people make, and were they fatal errors, or were those things that they were able to quickly recover from. We call how much does a person remember afterward or after a period of nine years. An emotional response, how does the person feel about the task being completed? Were they confident? Were they stressed? Will the user recommend this system to a friend? So your independent variables are the things that you manipulate. Your deep ended variables are the things that you measure. How reliable is your experiment? if you ran this again, would you see the same results. That is the Internal Validity of an experiment. So, to have a precise experiment, you need to remove the confounding factors. Also, it is important to study enough people so that the result is unlikely to have been by chance you maybe able to run this same study over and over and get the same results but it may not matter in some real world sense, and the external validity is the generalizability of your results. Does this apply only to 18 year olds in the college classroom or does this apply to everybody in the world? World. Let's bring this back to HCI, and talk about one of the problems you're likely to face as a designer. I think one of the things that we commonly want to be able to do, is be able to ask something like, is my cool, new approach better than the industry standard? Because, afterall, that's why you're making the new thing. Now one of the challenges with this, especially early on in the design process, is that you may have something which is very much in it's prototype stages and something that is the industry standard is likely to benefit from years and years of refinement and at the same time it may be stuck with years and years of cruft which may or may not be intrinsic to it's approach. So if you compare your cool new tool to some industry standard. There's two things varying here. One is the fidelity of the implementation, and the other one, of course, is the approach. Consequently when you get the results, you can't know whether to attribute the result to fidelity or approach or some combination of the two. So we're going to talk about ways of teasing apart those different causal factors. Now, one thing I should say right off the bat, is there are sometimes where it may be more or less relevant, whether you have a good handle on what the causal factors are. So for example, if you're trying to decide between two different digital cameras. At the end of the day, maybe all you care about is image quality or usability or some other factor and exactly what the, makes that image quality better or worse or any other element along the way may be less relevant to you. If you don't have control over the variables then identifying cause may not be always what you want. But when you're a designer, you do have control over the variables and that's when it's really important to ascertain cause. Here's an example of a study that came out right when the iPhone was released done by a research firm User Centric and I'm going to read from this news article here. Research from User Centric has released a study that tries to gauge how effective the iPhone's unusual on-screen keyboard is. The goal is certainly a noble one, but I can't say that the survey's approach, results in data that makes much sense. User Centric brought in 20 owners of other phones. Half at QWERTY keyboards, half at ordinary numeric phones keypads. None were familiar with the iPhone. The research involved having the test subjects enter six sample text messages with the phones that they already had, and six with the iPhone. The end result was that the iPhone newbies took twice as long to enter text with an iPhone. As they did with their own phones and made lots more typos. So let's critique the study and talk about its benefits and drawbacks. Here's the webpage directly from User Centric. What's our manipulation in this study? Well, the manipulation is going to be the input style. Now, how about the measuring in the study? It's going to be the words per minute. And there's absolutely value in being able to measure the initial usability of the iPhone for several reasons. One is If you're introducing a new technology, it's beneficial if people are able to get up to speed pretty quickly. However, it's important to realize that this comparison is intrinsically unfair because the users of the previous cell phones were experts at that input modality and the people who were using the iPhone are novices in that. And so it seems quite likely that the iPhone users, once they become actual users, are going to get better over time. And so if you're not used to something the first time you try it that may not be a deal killer, and it's certainly not an apples to apples comparison. Another thing that we don't get out of this article is is this difference significant? So we read that Each person typed six messages in each of two conditions, and so they did their own device and the iPhone or vice-versa, six messages each. And that the iPhone users were half the speed, or rather the people typing with the iPhone were half as fast as when they got to type with a mini QWERTY at the device they were accustomed to. So while this may tell us something about the initial usability of the iPhone in terms of the long term usability you know I don't think we get so much out of this here. If you weren't satisfied by that initial data, you're in good company. Neither with the authors of that study. So they went back a month later and they ran another study where they brought in 40 new people to the lab who where either iPhone users, qwerty users or nine key users. And now it's more of an apples to apples comparison in that they're going to test people who are relatively experts in these three different modalities. After about a month on the iPhone, you're probably starting to asymptote in terms of your performance. Definitely it gets better over time, even past a month. But you know, a month starts to get more reasonable. And what they found was that iPhone users qwerty users were about the same in terms of speed and that the numeric keypad users were much. Once again our manipulation is going to be input style and we're going to measure speed. This time we're also going to measure error rate. And what we see is that iPhone users and qwerty users are essentially the same speed. However, The iPhone users make many more errors. Now, one thing I should point out about this study is that each of the different devices was used by a different group of people. And it was done this way so that each device was used by somebody who is comfortable and had experience with working with that device. And so we remove the worry that you have newbies working on these devices however especially in 2007 there may have been significant differences in who the people were who were using the early adopters of the 2007 iPhone. Or maybe business users who are particularly drawn to the Qwerty devices or people who had better things to do with their time. They send email on the telephone or using the nine key devices. And so, well this comparison is better tat the previous one. The potential for variation between the user population is still problematic. If what you'd like to be able to claim is something about the intrinsic properties of the device, it may at least in part have to do with the users. So what are some strategies for fairer comparison? To brainstorm a couple of options, one thing that you can do is input your approach into a production setting. And this may seem like a lot of work. Sometimes it is. But in age of the web, this is a lot easier than it used to be. And it's possible even if you don't have access to the server of the service that you're comparing against. You can use things like a proxy server or client side scripting to be able to put your own technique in and have an apples to apples comparison. A second strategy for neutralizing the environment difference between a production version and your new approach is to make a version of the production thing in the same style as your new approach. That also makes them equivalent in terms of their implementation fidelity. A third strategy, and one that's used commonly in research, is to scale things down, so you're looking at just a piece of the system at a particular point in time. That way, you don't have to worry about implementing a whole big giant thing. You can just focus on one small piece ,and have that comparison be fair. And the fourth strategy is that when expertise is relevant, train people up give them the practice that they need so that they can start at least hitting that asymptote in terms of performance. And you can get a better read than what they would be as newbies. So now, to close out this lecture, if somebody asked you the question, is interface x better than interface y, you know that we're off to a good start because we have a comparison. However, you also know to be worried. What does better mean? And often, in a complex system, you're going to have several measures. That's totally cool. There's a lot of value in being explicit though, about what it is that you mean by better. What are you trying to accomplish? What are you trying to improve? And if anybody ever tells you that their interface is always better, don't believe them because nearly all the time the answer's going to be, it depends. And the interesting question is, what does it depend on? Most interfaces are good for some things and not for others for example, if you have a tablet computer where all of the screen is devoted to display, that's going to be great for reading, for web browsing, for that kind of activity, looking at pictures. Not so good if you want to type a novel. So here, we've introduced controlled comparison and as a way of finding the smoking gun, as a way of inferring cause and often for when you have only two conditions we're going to talk about that as being a minimal pairs design. As a practicing designer the reason to care about what's causal is it gives you the material to make a better decision going forward. A lot of studies violate this constraint. And that gets dangerous because it prevents you from being able to make sound decisions. I hope that the tools that we've talked about today and in the next several lectures will help you become a wise skeptic like our friend in this xkcd comic. I'll see you next time.