Came across these 12-year old stats recently:
Bad news I think…
1. If two groups of people construct thesauri in a particular subject area, the overlap of index terms will only be 60%.
2. Two indexers using the same thesaurus on the same document use common index terms in only 30% of cases.
3. The output from two experienced database searchers has only 40% overlap.
4. Experts’ judgements of relevance concur in only 60% of cases.
[Source: JAA Sillince, 1992, Literature searching with unclear objectives: a new approach using argumentation. On-line Review, 16 (6), 391-409]
I think that just goes to show that the quality and knowledge of your indexers (human or otherwise) is incredibly important.
For me the stat that really tells a tale (as far as I can see) is the 60% relevance overlap. However good and however knowledgeable the indexers are, they’re going to have different takes on things. Left-wing/right-wing, etc etc. Those takes can be influenced by any number of soft factors and tacit prejudices. The way these all seem to be negotiated in hard copy media is through argument. What we’re interested in is a) the facts, and critically b) various pundits opinions on and interpretations of those facts.
The tantalizing link is that social networks, clusters etc may offer a useful extra-level of classification. If we can categorize pundits, then we can categorize opinions, and so navigate a better understanding of the facts being presented to us.
I agree that the editorial bent of the indexer is important. They have huge influence on the language that is used by the organization which I would expect to have subtle but strong effects.
My one concern with a social network approach to indexing is that it can be hard enough to get the resources/time for one person to do it, let alone a network of them rating each other. Although, I suppose that is essentially what the Google search algorithm does in part.
In any case, thanks much for posting the stats in the first place!
An example of a working social network approach to indexing might be currently found in libraries and with bibliographic records generated by catalogers. There isn’t explicit rating per se, but the vast majority of libraries participate in one bibliographic utility or another and within that utility, catalogers (with the appropriate level of access) can modify/improve records generated earlier by other catalogers.
It’s an imperfect system, but an interesting model to look at I think. Catalogers (are supposed to) follow strict sets of rules and the same essential thesaurus (authority file of index terms) — but per your listed stat #2, it might not necessarily help. And anyway, the actual relevance of those index terms to anyone but other librarians is probably not much… 🙂
These results don’t seem right to me.
1. Disagreement in thesauri terms. This is probably true, but I expect the concepts are very similar. It is normal for terms to differ. This is “the vocabulary problem”.
2. Indexer agreement. This result is opposite from studies which show 85% agreement between professional indexers using the MEDLINE thesaurus. MEDLINE is very detailed, which makes exact agreement even harder and make 85% more impressive.
3. Search result overlap. This does not match results from TREC, where different engines with different searchers get fairly similar results.
4. Disagreement in relevance judgments. This is totally at odds with studies of TREC judges, which have extremely high agreement.
It’s good to have the balance of Walter’s comments, but whether the accuracy of indexing and retrieval is at 30%, 60%, or 85%, we’re still not addressing the business problems directly:
(1) The lack of precision of retrieval is incredibly costly, even if precision is 95%. Errors of fact and errors of ommission creep in … and proliferate. Human indexing by itself will never solve that problem, no matter how good the indexer is.
(2) There is no assurance that the information retrieved and selected is trustworthy. (Maybe Medline is an exception to that generalization.)
(3) The knowledge generated in the processes of indexing and retrieval is never integrated into an organizational resource.
I’m hoping that the Semantic Web and several other technologies and standards will help solve some of these problems. See, especially, Paul Ford’s superbly written piece, “A Response to Clay Shirky’s ‘The Semantic Web, Syllogism, and Worldview'” http://www.ftrain.com/ContraShirky.html