The Measurements
In this Post we review the individual measures. These charts show each measure applied to all the algorithms, and also the basis statistics summary of the full dataset (not the per-algorithm aggregates). The measures are MRR, Rating, Sentence Length, and Non-zero. MRR is a common method for evaluating search results and other kinds of recommendations. The other three are fabricated for this analysis.
MRR and Rating (green and yellow) correlate very consistently in the graphic on Post #3. Rating tracks with MRR, but is more extreme. Note the wider standard deviation. This indicates that the Rating formula is a good statistic for modeling unforgiving users.
The mean sentence length in the corpus is (I think) 22 sentences. A mean of 60 for 3 out of 22 is much better than random recommendations.
Mean Reciprocal Rank
MRR is a common measure for search results. It attempts to model the unforgiving way in which users react to mistakes in the order of search results. It measures the position of a preferred search result in a result list. If the "right" answer is third in the list, the MRR is 1/3.
This statistic is the mean of three MRRs, one for each result based on how far it is from where it should be. If the second sentence is #2, that is a 1. If it is #1 or #3, that is 1/2. If the third is #3, that is a 1. If it is #2, that is 1/2 and if it is #1, that is 1/3. The measures go down to the 5th sentence.
Rating (0-5)
Rating is a heavily mutated form of "Precision@3". It tries to model how the user reacts in a UI that shows snippets of the top three recommended sentences. 0 means no interest. 1 means at least one sentence placed (the Non-zero measure). 2-5 measure how well the first three recommendations match the first and second spots. In detail:
5: first and second results are lede and subordinate
4: first result is lede
3: second result is lede
2: first or second are within first three sentences
1: third result is within first three sentences
0: anything else
4: first result is lede
3: second result is lede
2: first or second are within first three sentences
1: third result is within first three sentences
0: anything else
MRR and Rating (green and yellow) correlate very consistently in the graphic on Post #3. Rating tracks with MRR, but is more extreme. Note the wider standard deviation. This indicates that the Rating formula is a good statistic for modeling unforgiving users.
Sentence Length
This measures the mean sentence length of the top two recommended sentences. The sentence length is the number of nouns and verbs in the sentence, not the number of words. This indicates how well the algorithm compensates for the dominance of longer sentences.
Non-Zero
Every result that recommended the first, second or third sentence for one of the three top spots, by percentage.
The mean sentence length in the corpus is (I think) 22 sentences. A mean of 60 for 3 out of 22 is much better than random recommendations.
Precision v.s. Recall
In Information Retrieval jargon, precision is the accuracy of a ranking algorithm, and recall is the ability to find results. The three success measurements are precision measures. The "dartboard" measure is a recall measure.
From reading the actual sentences and recommendations, binary+normal and augNorm+normal had pretty good precision. These two also achieved the best recall at around 65%. This level would not be useful in a document summarization UI. I would pair this with a looser strategy to find related sentences by searching with the theme words.
Previous Example
In the example in Post #2, the three top-rated sentences were 4, 3, and 6. Since only one of three made the cut, the rating algorithm gave this a three. Note that the example was not processed with Parts-of-Speech removal and used the binary algorithm, and still hit the dartboard. This article is the first in the dataset, and was chosen effectively at random.
No comments:
Post a Comment