INST 734
Information Retrieval Systems
Fall 2014
Exercise E5


Evaluating Systems

The purpose of this assignment is to gain experience with evaluation of information retrieval systems. We will use the University of Delaware Virtual IR Lab for this assignment. Note, however, that the Virtual IR Lab system is still under development, and some future changes to the system may result in changes to details of what things are called. If I learn of any such cases before the due date of this assignment, I will update the assignment. In the mean time, if anything is confusing, just ask about it!

The version of the Virtual IR Lab that you should use is the UMD instance, which is at http://infolab.ece.udel.edu:8008/VIRLab_UMD (other versions that you might find through a Web search can not be used for this assignment). Each student in the class has an account on the UMD instance of the Virtual IR Lab. Your userid is the same as the userid that you use for ELMS. Your password for the UMD instance of the Virtual IR Lab can be found on ELMS (and you can and should change it after you log in, but don't lose our new password!).

The first thing to do after you log in is to create the two scoring functions that we will compare. For this go to "Create Function" in the left-hand menu enter "BM25a" (without the quotes) in the "function name" field, and then cut and paste the following into the content:

double okapiK1 = 2;
double okapiB = 0.75;
double okapiK3 = 1000;
for(occur)
{
        double idf = log((docN-DF[i]+0.5)/(DF[i]+0.5));
        double weight = ((okapiK1+1.0)*tf[i]) / (okapiK1*(1.0-okapiB+okapiB*docLength/docLengthAvg)+tf[i]);
        double tWeight = ((okapiK3+1)*qf[i])/(okapiK3+qf[i]);
        score+=idf*weight*tWeight;
}
and hit the save button. Then go to "Manage Function" on the left menu and you should see your new function there, listed under "Functions can be used to make search engine". There will also be two other functions that can not be used -- ignore them. Then go to Create Function again, and create a function called "BM25b". Cut and paste the following into the content for that one:
double okapiK1 = 1.2;
double okapiB = 0.75;
double okapiK3 = 1000;
for(occur)
{
        double idf = log((docN-DF[i]+0.5)/(DF[i]+0.5));
        double weight = ((okapiK1+1.0)*tf[i]) / (okapiK1*(1.0-okapiB+okapiB*docLength/docLengthAvg)+tf[i]);
        double tWeight = ((okapiK3+1)*qf[i])/(okapiK3+qf[i]);
        score+=idf*weight*tWeight;
}
and hit the save button. These two functions are the version of BM25 that we discussed in Module 3 and a second one with a single parameter changed.

Now go to "Compare Search Engines" in the left menu. When you do, the left menu changes, and if you want to go back you will need to select "Retrieval Function" in the top menu. If all else fails (as it will if you select About -- don't do that!) try the back button and be patient. And if that fails, start over and log in again from the URL above -- your two retrieval functions (which actually are a combination of the document representation functions and comparison function that we discussed previously) should both still be there!

In "Compare Search Engines" first select your two retrieval functions. I suggest putting BM25a on the left and BM25b on the right. Then select ap8889 as the collection -- this is a collection of news stories from the Associated Press from 1988 and 1989. Then, click on the green "Select a TREC Query" button, from the popup menu select the only entry (ap8889), and and then from the list of queries that appears select query number 1 (which is "antitrust case pend" -- these queries are already stemmed; the real TREC query was "antitrust cases pending"). Finally, click on the bluish "Compare" button. The result (after a few seconds) should be two ranked lists of document numbers. Each list has a value labeled "MAP" above the list; this is the average precision achieved for that system on that query (MAP is actually an error in the labeling; these are AP values for single queries). You should get 0.0425 on the left and 0.04079 on the right. If so, the cut and paste of your retrieval functions and your operation of the system to this point has been correct, and you are ready to actually do the rest of this exercise. If not, check with me to sort out what you (or the system!) are doing wrong.

Now look at the annotations on the right side of the rightmost ranked list. Here you can see a notation that the document that had been in position 3 in the left list is in a higher position in the right list (position 2 in this case). Looking now at both lists, you can also see the scores for each document generated by each system -- they are shown below each document identifier. For example, the system on the left shows a score of 16.9643 for the document at rank 1, which is higher than any other score assigned to this query by that system. Indeed, that's how it got to be at rank 1; both systems sorted the document by scores. Of course, the document at rank 1 for the system on the right also has a score higher than any other document for that query by that system. Note, however, that the same document has different scores on the left and the right. The scores between the two systems can not meaningfully be compared -- I could multiply all the scores on the right by 20 without changing the ranking buy that system. You also can't meaningfully compare scores assigned even by the same system for different queries because the largest possible score often depends on how many query terms there are (and on how common each of those query terms is). Indeed, we only look at scores when we are checking to be sure our computation of the score is correct; otherwise we only use them as a basis for sorting to get the ranked list.

To the right or left of each score you will see a notation of whether the document is relevant or not. It if is relevant, that is shown in green as Relevant(1) because the system is capable if recording different levels of relevance; the ap8889 collection only contains binary relevance, however, so 1 is the only value you will ever see for relevant documents. A quick skim will indicate that the system on the left found two relevant documents in the top 10, whereas the system on the right found only one. P@10 (precision at a cutoff of 10 documents) is thus 0.2 on the left and 0.1 on the right for this query. To save you from flipping to the second and third pages, the system shows you p@30 (labeled "P30") for each list. AP is also computed by the system (and mislabeled as "MAP") by doing the AP computation down to 1000 documents, will will save you a lot of scrolling!

Okay, now that you understand how to operate the system and what it is displaying, you are ready to do the assignment. To save some time, we will use only the first 20 ap8889 queries. Select each, one at a time, and then select Compare to see the result from each system for that query. Use Excel (or any other system you want) to make a table on which you can record the AP value (you can ignore P@10 and P@30) for each of the first 20 queries. In each row, record three numbers: the query number, the AP for BM25a and the AP for BM25b.

Now that we have those numbers, we can use them to determine which system is "better". We can start by averaging the AP across the queries for each system to get a MAP (Mean Average Precision) for that each system. A higher MAP value would indicate that the system might be better. Another way to look at this is to count the number of queries for which the system on the left got higher, tied, or lower values of AP. Whichever system got queries in the "higher" category might be better. The pattern of higher / better / lower that you observe might simply result from chance, however. To check that, you can use a the Sign Test calculator at the University of Amsterdam at http://www.fon.hum.uva.nl/Service/CGI-Inline/HTML/Statistics.html. If you had 4 higher, 7 the same, and 9 lower you would enter 4 in "n+" (number where the first system wins) and 9 in "n-" (number where the second system wins) and the system would compute p<=0.267, which means that there's a 26.7% chance (those Dutch are very precise!) that this outcome resuled from chance. When the p value comes out below 0.05, we say that the result is statistically significant. You should also run a Student-t test (which is not so named because you are a student, but rather because the inventor published it under the pseudonym "Student" to avoid problems with his employer -- Wikipedia has the full story), for which you will need to enter the two columns of figures from your table because the t-test is computed on the (paired) differences in the AP values, not just on how many times each system won. If you have trouble getting to the t-test on the University of Amsterdam site, you could also use the T-Test for Dependent Means at http://www.socscistatistics.com/tests/signedranks -- the tests on the University of Amsterdam Site work fine, but its internal navigation links don't all work correctly). The t-test is generally more sensitive than the sign test, so it might find a difference (again, we say a statisticaly significant difference was detected when p<0.05) when the sign test did not.

Summarize all of your results (the table, the two MAP values, the p value from the sign test and the t-test, and your conclusion about which system was better or whether no statistically significant difference can be detected) in a text document (e.g., using Word). Then pick one case where the system with the lower MAP actually won in the pairwise comparison of AP and examine the results to see if you can explain why that happened. Were the relevant documents found by one system markedly longer (or shorter) than those found by the other? Did one system match certain words better than others? Are there other factors that might explain the differences? For this comparison, the conparison display of which documents moved up and which documents moved down may be useful. You may or may not be able to find an obvious explanation (depending on which query you pick), but by trying to do so you will get a better idea of what different ranking algorithms -- the combination of representation functions and comparison functions -- actually do. Summarize your observations on this in a paragraph or so (and don't spend more than about 30 minutes on this part of the assignment).

If you ran into any difficulties with this assignment that you were not able to sort out with my help, please make a note of them in what you submit. This is a new system, and thus a new assignment, and we want to work out any teething pains before we use this system for the term project!

Please also include with your assignment an ranked list of which additional readings you would be interested in doing for Modules 6, 7, 8, and 9. Please select at least 5 readings, include at least one from each module, and list them in most-preferred-first order. You can refer to readings by module number and reading number using the numbering on the assigned summaries page. I will send out assignments on Monday October 6 so that those will be doing readings in Module 6 will have time to prepare.

You should submit your assignment on ELMS. Instructions for doing this are on the course ELMS site.


Doug Oard
Last modified: Fri Sep 26 21:31:25 2014