[Sigcr-l] Exhaustivity and specifity of indexing

Fri Sep 22 18:07:03 EDT 2006

thanks for the interesting study, Susanne.

I quote your results: 
"20 of these [49 terms] were used by only one indexer
11 were used by two indexers
8 were used by three indexers
4 were used by four indexers
3 were used by five indexers
1 was used by six indexers; this term was a checktag
7 were used by all seven indexers; 5 of these terms were checktags

The upshot of it is that the seven indexers agreed on only two of 49
substantive indexing terms, and at the other end, 20 of 49 substantive
indexing terms were used uniquely."

THIS IS IN PERFECT ACCORD WITH MY DISSERTATION WORK IN WHICH ONLY ONE
TERM FOR OFFICE DOCUMENTS  WAS USED BY ALL 8 OF MY PARTICIPANTS
("BOOKS") AND THE DISTRIBUTION WAS MORE OR LESS LIKE YOURS FOR THE REST
-- WITH THE VAST MAJORITY OF NAMES OFFERED FOR DOCUMENTS BEING UNIQUE.
SUSAN DUMAIS AND OTHERS HAD SIMILAR RESULTS IN THEIR ARTICLE "THE
VOCABULARY PROBLEM..." (SOMETHING LIKE THAT, CIRCA 1985). THEY FOUND
THAT THE CHANCES OF TWO PEOPLE COMING UP WITH THE SAME TERM WAS ONLY 1
IN 5. SO, I KNOW YOUR STUDY LOOKED AT HOW THEY ELIMINATED TERMS, RATHER
THAN CAME UP WITH TERMS, BUT THE RESULTS DON'T SURPRISE ME AT ALL.
BARBARA

>>> Susanne M Humphrey <shumphrey at mail.nih.gov> 09/21/06 11:30 PM >>>
Note:  I originally replied only to Andrew, selecting Reply rather than
Reply to All, by mistake.  So this is a correction that sends the reply
to all.  I hope it's easy enough for people who don't care about this
much
detail to ignore this.  I just couldn't resist communicating this
experiment
to such a large audience, since I'm sorry I never got to publish it.
If the notion of idea-based indexing being of paramount importance and
the methodology of having participants cross off bad terms rather than
indexing the document isn't that novel (probably the former has been
written about but I don't know where offhand), I'd appreciate knowing
this.
smh

Andrew,

Thanks for interest in my experiment.  Below (under PROJECT DESCRIPTION)
is a copy of an e-mail I sent to
a colleague describing my experiment.  The intention was to advocate the
notion of idea-based indexing, and specifically to suggest that my
knowledge-based computer-assisted indexing system, MedIndEx, would
promote
idea-based indexing by virtue of having indexers fill out frames.  There
was
an outlier, indexer 01, who missed 5 of the 10 ideas.
Below that (under TERMS CATEGORIZED BY 10 IDEAS), is a copy of a file
that
breaks the document down into the ideas (e.g., primary problem).  Under
each idea is a table of:

number of indexers keeping the term, the term inself, and indexer IDs of
indexers keeping the term.

For each idea,
I also designated indexers who did not cover the idea at all, and
indexers
with the "best coverage" of the idea, meaning
the indexer who used the most terms

At the end, I summarized "best coverage", "not covered", and ranked
indexers
from best coverage to least coverage, including the number of terms each
indexer used.  In general, the best coverage, the most terms.  But there
was an exception.  Indexer 02 had better coverage than indexer 05, but
02 used
fewer terms.

Early on, over the years, I submitted the experiment to a "call for
projects"
to be performed by NLM Library Associates (interns) but nobody wanted to
it, so
I gave up.

Also, try as I might, there were still some terms I hadn't thought of in
the exhaustive indexing.  As part of one of the instructions,
I invited participants to submit additional indexing
terms that weren't on the exhaustive list.  I probably should have
gotten
an indexer or two (non-participant of course) to help me with the
exhaustive
list.  These were the five terms:

5 Hodgkin's Disease/THERAPY 01 02 05 06 07
3 Neoplasms/THERAPY 02 05 07
1 Graft vs Host Disease/IMMUNOLOGY 05
1 Hematologic Diseases/THERAPY 07
1 HLA Antigens/ANALYSIS 05

I think missing the terms with subheading THERAPY, suggested by five of
the indexers, is kind of serious.  The DRUG THERAPY subheading used by
the other two indexers follows