Overall Analysis
Measures are tuned for Interactive UI
This implementation is targeted for interactive use in search engines. A search UI usually has the first few results shown in ranked order, with the option to go to the next few results. This UI is intended to show the first three ranked sentences at the top of an entry with the theme words highlighted. Users are not forgiving of mistakes in these situations. The first result is much more important than the second, and so forth. People rarely click through to the second page.
The measures of effectiveness are formulated with this in mind. We used three:
- A variant of Mean Reciprocal Rank (MRR).
- "Rating" is a measure we created to model the user's behavior in a summarization UI. Our MMR variant and Rating are defined in the next Post.
- Non-zero counts whether the algorithm placed any recommendations in the top three. "Did we even hit the dartboard?"
A separate problem is that sentences with more words can dominate the Sentence Length measures the length of the highest rated sentences. In this chart "Inverse Length" measures how well the algorithm countered the effects of sentence length.
Overall Comparison Chart
- Key to algorithm names: "binary_normal" means that "binary" was used to create each cell, while "normal" multiplied each term vector with the mean normalized term vector. If there is no second key, the global value was 1. See post #1 for the full list of algorithms.
This display is a normalized version of the mean results for all 24 algorithm pairs, with four different measures. In all four, higher is better. "Inverse Length" means "how well it suppresses the length effect", "Rating" is the rating algorithm described above, "MRR" is our variant implementation of Mean Reciprocal Rank, and >0 is the number of results where any of the first three were in the first three sentences. None of these are absolutes, and the scales do not translate between measures. They simply show a relative ranking for each algorithm pair in the four measures: compare green to green, etc. The next post gives the detailed measurements in real units.
Grand Revelation
The grand revelation is:
always normalize the term vector! All 5 local algorithms worked best with normal as the global algorithm. The binary function truncates the term counts to 1. Binary plus normalizing the term vector was by far the best in all three success measurements, and was middling in counteracting sentence length. AugNorm + normal was the highest achiever which compensates well for sentence length. TF + normal was the best overall for sentence length, but was only average for the three effectiveness measures.
No comments:
Post a Comment