[SOUND]. This lecture is about some practical issues that you would have to address in evaluation of text retrieval systems. In this lecture, we will continue the discussion of evaluation. We'll cover some practical issues that you have to solve in actual evaluation of text retrieval systems. So, in order to create the test collection, we have to create a set of queries. A set of documents and a set of relevance judgments. It turns out that each is actually challenging to create. First, the documents and queries must be representative. They must represent the real queries and real documents that the users handle. And we also have to use many queries and many documents in order to avoid a bias of conclusions. For the matching of relevant documents with the queries. We also need to ensure that there exists a lot of relevant documents for each query. If a query has only one, that's a relevant option we can actually then. It's not very informative to compare different methods using such a query because there's not that much room for us to see difference. So ideally, there should be more relevant documents in the clatch but yet the queries also should represent the real queries that we care about. In terms of relevance judgments, the challenge is to ensure complete judgments of all the documents for all the queries. Yet, minimizing human and fault, because we have to use human labor to label these documents. It's very labor intensive. And as a result, it's impossible to actually label all the documents for all the queries, especially considering a giant data set like the web. So this is actually a major challenge, it's a very difficult challenge. For measures, it's also challenging, because we want measures that would accurately reflect the perceived utility of users. We have to consider carefully what the users care about. And then design measures to measure that. If your measure is not measuring the right thing, then your conclusion would be misled. So it's very important. So we're going to talk about a couple of issues here. One is the statistical significance test. And this also is a reason why we have to use a lot of queries. And the question here is how sure can you be that observe the difference doesn't simply result from the particular queries you choose. So here are some sample results of average position for System A and System B into different experiments. And you can see in the bottom, we have mean average of position. So the mean, if you look at the mean average of position, the mean average of positions are exactly the same in both experiments, right? So you can see this is 0.20, this is 0.40 for System B. And again here it's also 0.20 and 0.40, so they are identical. Yet, if you look at these exact average positions for different queries. If you look at these numbers in detail, you would realize that in one case, you would feel that you can trust the conclusion here given by the average. In the another case, in the other case, you will feel that, well, I'm not sure. So, why don't you take a look at all these numbers for a moment, pause the media. So, if you look at the average, the mean average of position, we can easily, say that well, System B is better, right? So, after all it's 0.40 and this is twice as much as 0.20, so that's a better performance. But if you look at these two experiments, look at the detailed results. You will see that, we've been more confident to say that, in the case one, in experiment one. In this case. Because these numbers seem to be consistently better for System B. Whereas in Experiment 2, we're not sure because looking at some results like this, after System A is better and this is another case System A is better. But yet if we look at only average, System B is better. So, what do you think? How reliable is our conclusion, if we only look at the average? Now in this case, intuitively, we feel Experiment 1 is more reliable. But how can we quantitate the answer to this question? And this is why we need to do statistical significance test. So, the idea of the statistical significance test is basically to assess the variants across these different queries. If there is a big variance, that means the results could fluctuate a lot according to different queries. Then we should believe that, unless you have used a lot of queries, the results might change if we use another set of queries. Right, so this is then not so if you have c high variance then it's not very reliable. So let's look at these results again in the second case. So, here we show two different ways to compare them. One is a sign test where we just look at the sign. If System B is better than System A, we have a plus sign. When System A is better we have a minus sign, etc. Using this case, if you see this, well, there are seven cases. We actually have four cases where System B is better. But three cases of System A is better, intuitively, this is almost like a random results, right? So if you just take a random sample of you flip seven coins and if you use plus to denote the head and minus to denote the tail and that could easily be the results of just randomly flipping these seven coins. So, the fact that the average is larger doesn't tell us anything. We can't reliably conclude that. And this can be quantitatively measured by a p value. And that basically means the probability that this result is in fact from a random fluctuation. In this case, probability is 1.0. It means it surely is a random fluctuation. Now in Willcoxan test, it's a non-parametric test, and we would be not only looking at the signs, we'll be also looking at the magnitude of the difference. But we can draw a similar conclusion, where you say it's very likely to be from random. To illustrate this, let's think about that such a distribution. And this is called a now distribution. We assume that the mean is zero here. Lets say we started with assumption that there's no difference between the two systems. But we assume that because of random fluctuations depending on the queries, we might observe a difference. So the actual difference might be on the left side here or on the right side here, right? So, and this curve kind of shows the probability that we will actually observe values that are deviating from zero here. Now, so if we look at this picture then, we see that if a difference is observed here, then the chance is very high that this is in fact a random observation, right? We can define a region of likely observation because of random fluctuation and this is that 95% of all the outcomes. And in this then the observed may still be from random fluctuation. But if you observe a value in this region or a difference on this side, then the difference is unlikely from random fluctuation. All right, so there's a very small probability that you are observe such a difference just because of random fluctuation. So in that case, we can then conclude the difference must be real. So System B is indeed better. So this is the idea of Statical Significance Test. The takeaway message here is that you have to use many queries to avoid jumping into a conclusion. As in this case, to say System B is better. There are many different ways of doing this Statistical Significance Test. So now, let's talk about the other problem of making judgments and, as we said earlier, it's very hard to judge all the documents completely unless it's a very small data set. So the question is, if we can afford judging all the documents in the collection, which is subset should we judge? And the solution here is Pooling. And this is a strategy that has been used in many cases to solve this problem. So the idea of Pooling is the following. We would first choose a diverse set of ranking methods. These are Text Retrieval systems. And we hope these methods can help us nominate like the relevant documents. So the goal is to pick out the relevant documents. We want to make judgements on relevant documents because those are the most useful documents from users perspectives. So then we're going to have each to return top-K documents. The K can vary from systems. But the point is to ask them to suggest the most likely relevant documents. And then we simply combine all these top-K sets to form a pool of documents for human assessors. To judge, so imagine you have many systems each were ten k documents. We take the top-K documents, and we form a union. Now, of course, there are many documents that are duplicated because many systems might have retrieved the same random documents. So there will be some duplicate documents. And there are also unique documents that are only returned by one system. So the idea of having diverse set of ranking methods is to ensure the pool is broad. And can include as many possible relevant documents as possible. And then, the users would, human assessors would make complete the judgments on this data set, this pool. And the other unjudged the documents are usually just assumed to be non relevant. Now if the pool is large enough, this assumption is okay. But if the pool is not very large, this actually has to be reconsidered. And we might use other strategies to deal with them and there are indeed other methods to handle such cases. And such a strategy is generally okay for comparing systems that contribute to the pool. That means if you participate in contributing to the pool, then it's unlikely that it would penalize your system because the problematic documents have all been judged. However, this is problematic for evaluating a new system that may have not contributed to the pool. In this case, a new system might be penalized because it might have nominated some read only documents that have not been judged. So those documents might be assumed to be non relevant. That's unfair. So to summarize the whole part of textual evaluation, it's extremely important. Because the problem is the empirically defined problem, if we don't rely on users, there's no way to tell whether one method works better. If we have in the property experiment design, we might misguide our research or applications. And we might just draw wrong conclusions. And we have seen this is in some of our discussions. So make sure to get it right for your research or application. The main methodology is the Cranfield evaluation methodology. And they are the main paradigm used in all kinds of empirical evaluation tasks, not just a search engine variation. Map and nDCG are the two main measures that you should definitely know about and they are appropriate for comparing ranking algorithms. You will see them often in research papers. Precision at 10 documents is easier to interpret from user's perspective. So that's also often useful. What's not covered is some other evaluation strategy like A-B Test. Where the system would mix two, the results of two methods, randomly. And then would show the mixed results to users. Of course, the users don't see which result, from which method. The users would judge those results or click on those documents in a search engine application. In this case then, the search engine can check or click the documents and see if one method has contributed more through the click the documents. If the user tends to click on one, the results from one method, then it suggests that message may be better. So this is what leverages the real users of a search engine to do evaluation. It's called A-B Test and it's a strategy that is often used by the modern search engines or commercial search engines. Another way to evaluate IR or textual retrieval is user studies and we haven't covered that. I've put some references here that you can look at if you want to know more about that. So, there are three additional readings here. These are three mini books about evaluation and they are all excellent in covering a broad review of Information Retrieval Evaluation. And it covers some of the things that we discussed, but they also have a lot of others to offer. [MUSIC]