[SOUND]. In this lecture, we're going to talk about Running Controlled Experiments. This is a great thing to do on your own if you're trying to analyze the usability of a system. But a lot of components of this fall into other types of analysis which we'll do to analyze systems. So there are a lot of lessons here. Let's go through the process of running controlled experiments. We're essentially going to use the scientific method. We start out by stating a lucid, testable hypothesis. And we're going to go through examples of all these stages in this lecture. Once we have our hypothesis, we need to identify independent and dependent variables. That's what kind of things are we going to alter, and what kinds of things are we going to measure? Then, we design the experimental protocol. That's the actually steps of the experiment that we're going to carry out in order to test our hypotheses. Then, we choose a user population. These are hopefully the same people who would be using our system. But if those people aren't available to us, we should choose someone who has similar experience, awareness of the context and environment they'll be working in, or a technical background. The next stages are, first, to run some pilot participants. These can be anyone. Bring in some people who are friendly and have them go through your experiment before you unleash it on the world. Every single experiment I've run, and I have run hundreds of them in my life as a researcher, has had problems in it the first time around. You have to run a pilot experiment because there will be difficulties with your instructions, things will be unclear, people will do things incorrectly. You will have trouble measuring, there are tons of things that can go wrong and something always does. So if you run a pilot test, where you have some participants come in, try to do the test and get some feedback, you'll be able to refine your experiment to make it more effective. Fix your experimental protocol after you've done the pilot test, and then you actually run the experiment. You're going to collect data from all of your participants, and when you're done, you perform a statistical analysis. That will let you know if you have real results that have meaning. Or if there's not any difference between the conditions that you tested. Finally, you draw your conclusions and communicate your results. Recall the experiment that we did when we were studying usability and we wanted to see if it was faster to log in to the iPhone using a thumbprint authentication versus typing in the PIN. We have the same user do both and we timed it, and in the example of this user, we found that it was much faster to log in with the thumb print. But is that universally true? In this lecture, we're going to do a controlled experiment that you all will participate in. And we'll compare the biometric versus PIN authentication. And for those of you who have Android devices, we'll have a third data point for gesture-based authentication. So let's start with step one, stating a lucid, testable hypothesis. In this case, we're going to use the example that a mobile phone log-in using a fingerprint is faster than with a PIN entry. This is easy to test and it's very clear. We're going to time how long it takes to log in with a fingerprint and time how long it takes to log in with a PIN. Next, we need to chose the variables. We need to pick the independent variables, and these are the things that we manipulate, the things we change. In this case, it's our log-in method. Our users will log in either with a PIN or with the thumb print. Our dependent variables, the things that depend on what we're testing or the things that we're measuring, in this case will be the time to log in. It can be confusing to remember independent, indif, it can be confusing to remember independent and dependent values. But if you remember that there are things that you're manipulating and things that you're measuring, you want to have a list of both of those. Next, you design your experimental protocol. That's actually the steps that people will go through when they're doing your experiment. When you're doing this in HCI, you need to choose the tasks first. So if you're analyzing the security of a system, like the authentication system, what are the tasks that a person will do? Well, if we're trying to authenticate on a mobile device, the only task we're going be testing here is to log in. But you can have a lot of different things going on in your system, and you want a task for each of those. We talked about tasks when we were discussing usability in the first week, and these can be similar tasks to those. You want clear, concrete things that a user would do in your system, and those become the tasks in the experiment. You then want to ask if you're going to do a between or within subjects experiment. In a between subjects experiment, in a between the subjects experiment, each subject has one condition. So, if we did a between subjects experiment for our phone log-in, half the people would log in with the PIN and the other half would log in with a fingerprint. A within subjects experiment has each subject run multiple conditions. So in our experiment, the same person would log in both with a thumbprint and with a PIN. Within subjects experiments tend to be more desirable. But there can be some learning effects, so you want to make sure that your users are doing things in different orders. So they don't get better at tasks in one condition, and then perform well when they do that same task later under a different condition. The experimental protocol for this experiment is spelled out on the form that you all will complete. If we look at the text zoomed in, it says, using a stopwatch timer, precisely time how long it takes you to log in to your mobile device. Begin with the lock screen visible. Start the timer as you start authenticating, and stop the timer when the home screen appears. These instructions actually evolved out of doing pilot testing. Because in the first set of instructions, it was unclear whether or not the phone should be completely off. Or whether or not you should be in the lock screen. And it was unclear when you should stop the stop watch. So, these instructions came out of the pilot testing. Users will make their timing following these instructions and then enter the results on this form which I'll discuss later in the lecture. Once you have your experimental protocol put together, you want to run the experiment. You should have done the pilot study. And then you want to have a checklist of all the steps, so all the users are doing the same thing. I have a pretty simple set of instructions because this is a pretty straightforward experiment. However, often you'll have users doing multiple tasks and you want to have a list so you give exactly the same instructions to everyone, and you have them do the same process. If you don't have that list, it's likely that you'll forget something and that can manipulate your results so you don't get anything accurate at the end. Going through your checklist, have everyone do the experiment, collect your data, and then you're ready to actually do the analysis. It's important when you're doing a quantitative experiment that you do a statistical comparison. A t-test is an example of a simple statistical test. We're not going to get deep into statistics in this course, but I will give you an example of how to do a basic t-test analysis. So here, we're looking at an Excel spreadsheet with the example data from our test. We have ten users who've entered the data, and we have the time that it took for them to log in using the biometrics. And the time it took them to log in using the PIN. Now the first thing that we want to do is just see what the average time was for each. So we can use the average function from Excel, select the data from the column. And here we see for the biometric, the average time is 1.44 seconds. We can just copy that formula cell, and we can then see that the average PIN log-in time was 3.54 seconds. Now at first glance, it's pretty clear that the PIN log-in time is a lot longer than the biometric time. But in order to actually say that there's a difference, we have to show that it's significantly longer to log in with the PIN than biometrics. And it doesn't matter how big a difference there is in those numbers, you actually have to do a statistical test. In this case, we're comparing two populations, the test with biometric and the test with the PIN. Since we have two groups and we want to see if there's a significant difference between the average time in each, we can use a t-test, which is a standard statistical test. It gives you the probability that the difference that you see in those two averages is something that would happen by chance. So if we use a more simple example, pretend that our biometric log-in time was 1.44 seconds and the PIN log-in time was 1.45 seconds. Yes, 1.45 is larger than 1.44. But that difference is so small that it could be attributable to one person happening to take a tiny bit longer, or having a tiny bit of difference in starting and stopping their stopwatch. The t-test actually shows us if there's a real difference there, or if it's too small to be considered significant. The way that you run this is to use the t-test function, this is built-in to Excel. And you first select the first set of results, then the second set of results. If you click on the t-test option here, you'll get a set of options that you can put into this formula. I know what those options are. You can read a little more about them. I'm going to do a two tailed t-test, and I'm going to do it as type one. Again, if you click on this link that pops up when you start entering the formula, Excel will explain a little bit more about what those last two values mean. But what we end up with after running the t-test is this value. Ours is 0.001. That's the probability that the difference between these two means is due to chance. In other words, there's a one-tenth of 1% chance that this is simply random difference between the time to log in with the PIN and the time to log in with a biometric. That's a pretty small probability. In general for a t-test, if this number is less than 0.05, you can say there's a significant difference between the means. If it's greater than 0.05, you can't say there's a significant difference. What that means is there has to be no more than a 5% chance that the difference between those means is just random. In this case, our result is 0.001, which is a lot smaller than 0.05, so we can say we have a significant difference. Once you've done your analysis, you report your results. So now that we've seen this process, let's actually do it. You're going to follow those instructions, time yourself and enter your data into the form that's linked on the website. If you go to the section of the website that has that link, you'll be able to see the instructions so you can go over them before you time yourself and enter your data. We've got this set up, so if you've got an iPhone you can enter your thumb print authentication versus the PIN. But if you have an Android, you can also do the gesture-based authentication and we'll collect all that data. When you're done, you'll see that there's a link for you to check out the results. And that'll include some statistics like the t-test that we looked at before, and those are also linked on the site. So this is an example of how you do an experimental analysis