[Sigcr-l] Exhaustivity and specifity of indexing

Andrew Grove Andrew.Grove at microsoft.com
Fri Sep 22 19:42:01 EDT 2006


I'm finding similar data in studies of very large search logs.  1/3 of search strings are unique, another 1/3 are used 2 or 3 times.

Andrew

-----Original Message-----
From: sigcr-l-bounces at asis.org [mailto:sigcr-l-bounces at asis.org] On Behalf Of Barbara Kwasnik
Sent: Friday, September 22, 2006 3:07 PM
To: sigcr-l at asis.org; BH at db.dk; shumphrey at mail.nih.gov; Andrew Grove; humphrey at nlm.nih.gov; L.Will at willpowerinfo.co.uk
Subject: Re: [Sigcr-l] Exhaustivity and specifity of indexing

thanks for the interesting study, Susanne.

I quote your results:
"20 of these [49 terms] were used by only one indexer
11 were used by two indexers
8 were used by three indexers
4 were used by four indexers
3 were used by five indexers
1 was used by six indexers; this term was a checktag
7 were used by all seven indexers; 5 of these terms were checktags

The upshot of it is that the seven indexers agreed on only two of 49 substantive indexing terms, and at the other end, 20 of 49 substantive indexing terms were used uniquely."

THIS IS IN PERFECT ACCORD WITH MY DISSERTATION WORK IN WHICH ONLY ONE TERM FOR OFFICE DOCUMENTS  WAS USED BY ALL 8 OF MY PARTICIPANTS
("BOOKS") AND THE DISTRIBUTION WAS MORE OR LESS LIKE YOURS FOR THE REST
-- WITH THE VAST MAJORITY OF NAMES OFFERED FOR DOCUMENTS BEING UNIQUE.
SUSAN DUMAIS AND OTHERS HAD SIMILAR RESULTS IN THEIR ARTICLE "THE VOCABULARY PROBLEM..." (SOMETHING LIKE THAT, CIRCA 1985). THEY FOUND THAT THE CHANCES OF TWO PEOPLE COMING UP WITH THE SAME TERM WAS ONLY 1 IN 5. SO, I KNOW YOUR STUDY LOOKED AT HOW THEY ELIMINATED TERMS, RATHER THAN CAME UP WITH TERMS, BUT THE RESULTS DON'T SURPRISE ME AT ALL.
BARBARA



>>> Susanne M Humphrey <shumphrey at mail.nih.gov> 09/21/06 11:30 PM >>>
Note:  I originally replied only to Andrew, selecting Reply rather than Reply to All, by mistake.  So this is a correction that sends the reply to all.  I hope it's easy enough for people who don't care about this much detail to ignore this.  I just couldn't resist communicating this experiment to such a large audience, since I'm sorry I never got to publish it.
If the notion of idea-based indexing being of paramount importance and the methodology of having participants cross off bad terms rather than indexing the document isn't that novel (probably the former has been written about but I don't know where offhand), I'd appreciate knowing this.
smh

Andrew,

Thanks for interest in my experiment.  Below (under PROJECT DESCRIPTION) is a copy of an e-mail I sent to a colleague describing my experiment.  The intention was to advocate the notion of idea-based indexing, and specifically to suggest that my knowledge-based computer-assisted indexing system, MedIndEx, would promote idea-based indexing by virtue of having indexers fill out frames.  There was an outlier, indexer 01, who missed 5 of the 10 ideas.
Below that (under TERMS CATEGORIZED BY 10 IDEAS), is a copy of a file that breaks the document down into the ideas (e.g., primary problem).  Under each idea is a table of:

number of indexers keeping the term, the term inself, and indexer IDs of indexers keeping the term.

For each idea,
I also designated indexers who did not cover the idea at all, and indexers with the "best coverage" of the idea, meaning the indexer who used the most terms

At the end, I summarized "best coverage", "not covered", and ranked indexers from best coverage to least coverage, including the number of terms each indexer used.  In general, the best coverage, the most terms.  But there was an exception.  Indexer 02 had better coverage than indexer 05, but
02 used
fewer terms.

Early on, over the years, I submitted the experiment to a "call for projects"
to be performed by NLM Library Associates (interns) but nobody wanted to it, so I gave up.

Also, try as I might, there were still some terms I hadn't thought of in the exhaustive indexing.  As part of one of the instructions, I invited participants to submit additional indexing terms that weren't on the exhaustive list.  I probably should have gotten an indexer or two (non-participant of course) to help me with the exhaustive list.  These were the five terms:

5 Hodgkin's Disease/THERAPY 01 02 05 06 07
3 Neoplasms/THERAPY 02 05 07
1 Graft vs Host Disease/IMMUNOLOGY 05
1 Hematologic Diseases/THERAPY 07
1 HLA Antigens/ANALYSIS 05

I think missing the terms with subheading THERAPY, suggested by five of the indexers, is kind of serious.  The DRUG THERAPY subheading used by the other two indexers follows _______________________________________________
Sigcr-l mailing list
Sigcr-l at asis.org
http://mail.asis.org/mailman/listinfo/sigcr-l




More information about the Sigcr-l mailing list