[Sigcr-l] Exhaustivity and specifity of indexing

Susanne M Humphrey shumphrey at mail.nih.gov
Thu Sep 21 23:30:53 EDT 2006


Note:  I originally replied only to Andrew, selecting Reply rather than
Reply to All, by mistake.  So this is a correction that sends the reply
to all.  I hope it's easy enough for people who don't care about this much
detail to ignore this.  I just couldn't resist communicating this experiment
to such a large audience, since I'm sorry I never got to publish it.
If the notion of idea-based indexing being of paramount importance and
the methodology of having participants cross off bad terms rather than
indexing the document isn't that novel (probably the former has been
written about but I don't know where offhand), I'd appreciate knowing this.
smh

Andrew,

Thanks for interest in my experiment.  Below (under PROJECT DESCRIPTION)
is a copy of an e-mail I sent to
a colleague describing my experiment.  The intention was to advocate the
notion of idea-based indexing, and specifically to suggest that my
knowledge-based computer-assisted indexing system, MedIndEx, would promote
idea-based indexing by virtue of having indexers fill out frames.  There was
an outlier, indexer 01, who missed 5 of the 10 ideas.
Below that (under TERMS CATEGORIZED BY 10 IDEAS), is a copy of a file that
breaks the document down into the ideas (e.g., primary problem).  Under
each idea is a table of:

number of indexers keeping the term, the term inself, and indexer IDs of
indexers keeping the term.

For each idea,
I also designated indexers who did not cover the idea at all, and indexers
with the "best coverage" of the idea, meaning
the indexer who used the most terms

At the end, I summarized "best coverage", "not covered", and ranked indexers
from best coverage to least coverage, including the number of terms each
indexer used.  In general, the best coverage, the most terms.  But there
was an exception.  Indexer 02 had better coverage than indexer 05, but 02 used
fewer terms.

Early on, over the years, I submitted the experiment to a "call for projects"
to be performed by NLM Library Associates (interns) but nobody wanted to it, so
I gave up.

Also, try as I might, there were still some terms I hadn't thought of in
the exhaustive indexing.  As part of one of the instructions,
I invited participants to submit additional indexing
terms that weren't on the exhaustive list.  I probably should have gotten
an indexer or two (non-participant of course) to help me with the exhaustive
list.  These were the five terms:

5 Hodgkin's Disease/THERAPY 01 02 05 06 07
3 Neoplasms/THERAPY 02 05 07
1 Graft vs Host Disease/IMMUNOLOGY 05
1 Hematologic Diseases/THERAPY 07
1 HLA Antigens/ANALYSIS 05

I think missing the terms with subheading THERAPY, suggested by five of
the indexers, is kind of serious.  The DRUG THERAPY subheading used by
the other two indexers follows the specificity rule better, but I should
have anticipated that THERAPY is reasonable if trying to think of terms
exhaustively.
Don't know if it's serious enough to invalidate the study, however.

I also have a table according to the number of indexers using terms, e.g.
(the first group is the terms all 7 indexers used, then the terms 6 indexers
used, then the terms 5 indexers used, then the terms 4 indexers used, ...):


7 Adult 01 02 03 04 05 06 07
7 Blood Transfusion/ADVERSE EFFECTS 01 02 03 04 05 06 07
7 Case Report 01 02 03 04 05 06 07
7 Female 01 02 03 04 05 06 07
7 Graft vs Host Disease/ETIOLOGY 01 02 03 04 05 06 07
7 Human 01 02 03 04 05 06 07
7 Male 01 02 03 04 05 06 07

6 Adolescence 01 02 03 05 06 07

5 Combined Modality Therapy 02 03 05 06 07
5 Erythema/ETIOLOGY 02 03 04 05 06
5 Hodgkin's Disease/THERAPY 01 02 05 06 07

4 Blood Platelets/TRANSPLANTATION 03 05 06 07
4 Erythema/PATHOLOGY 03 04 05 06
4 Graft vs Host Disease/PATHOLOGY 03 05 06 07
4 Hodgkin's Disease/IMMUNOLOGY 02 03 05 07

etc.

As expected, there was good agreement on check tags; still, indexer 04 missed
the age group (Adolescence).

And I have the instructions given to the indexers.  There are basically
4 instructions (read the article, how to consider each term, how to modify the
exhaustive list, and where to return the list with modifications).

The head of NLM's Index Section at that time recruited the indexers, and they
were coded from 01 - 07.  I never knew who they were.

I recruited the 7 searchers, some of whom were MDs at NLM and perhaps also in
the field of medical informatics, but I don't remember who
they were or their exact characteristics.  There may be a folder in my
office that does this, but I don't know where to begin to look.  I should
have kept track in a computer file, and figured out how to encrypt their
names of something.

I have similar files where the test subjects were searchers, for example,
here's beginning of TERMS CATEGORIZED BY 10 IDEAS for the searchers:

primary problem

5 Hematologic Diseases/COMPLICATIONS 02 03 05 06 07
3 Hematologic Diseases/DRUG THERAPY 03 05 06
5 Hematologic Diseases/IMMUNOLOGY 02 03 05 06 07
6 Hodgkin's Disease/COMPLICATIONS 01 02 03 05 06 07
5 Hodgkin's Disease/DRUG THERAPY 01 03 04 05 06
5 Hodgkin's Disease/IMMUNOLOGY 01 02 03 05 06
6 Leukemia/COMPLICATIONS 01 02 03 05 06 07
4 Leukemia/DRUG THERAPY 03 04 05 06
4 Leukemia/IMMUNOLOGY 02 03 05 06
6 Lymphoma/COMPLICATIONS 01 02 03 05 06 07
4 Lymphoma/DRUG THERAPY 03 04 05 06
4 Lymphoma/IMMUNOLOGY 02 03 05 06
6 Neoplasms/COMPLICATIONS 01 02 03 05 06 07
3 Neoplasms/DRUG THERAPY 03 05 06
3 Neoplasms/IMMUNOLOGY 03 05 06
most comprehensive 03 05 06
least comprehensive 01

Note there are no 0's in the first column, which means none of the terms
for the primary problem was crossed off by searchers.


I wouldn't mind teaming up with someone of high-caliber who might like to
co-author a paper, and probably do most of the rest of the work based on what
I have done so far.  I don't have time to collaborate with a junior person
alone, or, so far, to bring it to publication by myself.
I don't realistically expect to find somebody.

A limitation of the study is, of course, this is just one article.
but I think it does bring out some original (I think)
notions of breaking the document
down into generic ideas and distributing the terms among the ideas, and also
introducing the methodology of participants
crossing off terms they don't like, rather than having them all index
the document de novo, which would be much more work for them.

This might be more than you wanted to know.  I understand if you don't
really care about much of this.

--Susanne


PROJECT DESCRIPTION

This was an unpublished study I did a few years ago.

Briefly, what I did was index a particular document ridiculously
exhaustively.  Then I gave the document and my list of indexing terms to
two groups of test subjects:  7 indexers and 7 users.  Instructions for
both groups were to cross off the inappropriate indexing terms.  For
indexers, the criterion was normal indexing policy.  For users, the
criterion was, considering the indexing term to be a MEDLINE query, if
the test document were retrieved in response to this query, would you
view it as relevant (I encouraged a comprehensive view of relevance).


I did some work in tabulating and analyzing the results.  From this
work, I concluded that the most meaningful way to measure indexing
quality was in terms of agreement on coverage of ideas in the document
rather than agreement on indexing terms.

Here's the agreement measured in terms of actual indexing term:

54 different indexing terms were used; 5 of these were checktags

20 of these were used by only one indexer
11 were used by two indexers
8 were used by three indexers
4 were used by four indexers
3 were used by five indexers
1 was used by six indexers; this term was a checktag
7 were used by all seven indexers; 5 of these terms were checktags

The upshot of it is that the seven indexers agreed on only two of 49
substantive indexing terms, and at the other end, 20 of 49 substantive
indexing terms were used uniquely.

Then I analyzed the document into 10 ideas, and I distributed the 49
substantive (non-checktag) indexing terms that were used into these idea
categories.  In other words, I wasn't that concerned about the indexing
term used for expressing the idea, but just that the idea was somehow
covered.

Here are the results:

10 different ideas

3 of the ideas were covered by three indexers
1 of the ideas was covered by five indexers
2 of the ideas were covered by six indexers
5 of the ideas were covered by all seven indexers

If you remove the indexer who missed five ideas:

1 of the ideas was covered by four indexers
2 of the ideas were covered by six indexers
7 of the ideas were covered by all seven indexers

This doesn't seem too bad - all indexers covered 7 of the 10 ideas in
the document.  But actually, missing an idea can be quite detrimental to
retrieval, as it implies that no matter how good a search strategy you
employ, you can't retrieve the document in response to a request for
information on that idea.

But the actual result is that all indexers covered only half the ideas.

The way I look at it is that a retrieval system can compensate for
different term use by indexers, but nothing can be done when an idea is
missed.

What I am after in my MedIndEx system is to avoid missing the indexing
of ideas.  According to this study (limited as it may be, i.e., a single
document):

3 of 7 indexers missed none of the ideas.
4 of 7 indexers missed at least 10% of the ideas.
3 of 7 indexers missed at least 20% of the ideas.
2 of 7 indexers missed at least 30% of the ideas.
1 of 7 indexers missed at least half the ideas.

I think we can do better than 43% of the indexers missing 20-50% of the
ideas in a document.

I may be too idealistic, but I think that 80% of indexers shouldn't miss
more than 10% of the ideas.


TERMS CATEGORIZED BY 10 IDEAS

Under each idea:
no. of indexers keeping the term, the term, indexer ID

primary problem

0 Hematologic Diseases/COMPLICATIONS
0 Hematologic Diseases/DRUG THERAPY
2 Hematologic Diseases/IMMUNOLOGY 02 07
1 Hematologic Diseases/THERAPY 07
3 Hodgkin's Disease/COMPLICATIONS 03 04 06 *
2 Hodgkin's Disease/DRUG THERAPY 03 04 *
4 Hodgkin's Disease/IMMUNOLOGY 02 03 05 07 *
5 Hodgkin's Disease/THERAPY 01 02 05 06 07 *
0 Leukemia/COMPLICATIONS
0 Leukemia/DRUG THERAPY
0 Leukemia/IMMUNOLOGY
0 Lymphoma/COMPLICATIONS
0 Lymphoma/DRUG THERAPY
0 Lymphoma/IMMUNOLOGY
2 Neoplasms/COMPLICATIONS 04 06
1 Neoplasms/DRUG THERAPY 04
3 Neoplasms/IMMUNOLOGY 02 05 07
3 Neoplasms/THERAPY 02 05 07
best coverage 07

treatment of primary problem

0 Antineoplastic Agents/ADVERSE EFFECTS
2 Antineoplastic Agents, Combined/ADVERSE EFFECTS 03 04 *
5 Combined Modality Therapy 02 03 05 06 07
0 Hematologic Diseases/DRUG THERAPY
1 Hematologic Diseases/THERAPY 07
2 Hodgkin's Disease/DRUG THERAPY 03 04 *
5 Hodgkin's Disease/THERAPY 01 02 05 06 07 *
0 Leukemia/DRUG THERAPY
0 Lymphoma/DRUG THERAPY
1 Neoplasms/DRUG THERAPY 04
3 Neoplasms/THERAPY 02 05 07
best coverage 07

secondary problem (general)

0 Antineoplastic Agents/ADVERSE EFFECTS
2 Antineoplastic Agents, Combined/ADVERSE EFFECTS 03 04
0 Drug Therapy/ADVERSE EFFECTS
0 Hematologic Diseases/COMPLICATIONS
2 Hematologic Diseases/IMMUNOLOGY 02 07
3 Hodgkin's Disease/COMPLICATIONS 03 04 06 *
4 Hodgkin's Disease/IMMUNOLOGY 02 03 05 07 *
0 Leukemia/COMPLICATIONS
0 Leukemia/IMMUNOLOGY
0 Lymphoma/COMPLICATIONS
0 Lymphoma/IMMUNOLOGY
2 Neoplasms/COMPLICATIONS 04 06
3 Neoplasms/IMMUNOLOGY 02 05 07
0 Radiation Injuries
0 Radiotherapy/ADVERSE EFFECTS
1 not covered 01
best coverage 04

secondary problem (specific)

2 Anemia/CHEMICALLY INDUCED 03 04
1 Anemia/ETIOLOGY 02
3 Anemia/THERAPY 02 03 04
1 Thrombocytopenia/CHEMICALLY INDUCED 04
0 Thrombocytopenia/ETIOLOGY
1 Thrombocytopenia/THERAPY 03
4 not covered 01 05 06 07
best coverage 03

treatment of secondary problem

3 Anemia/THERAPY 02 03 04
4 Blood Platelets/TRANSPLANTATION 03 05 06 07
7 Blood Transfusion/ADVERSE EFFECTS 01 02 03 04 05 06 07
1 Blood Transfusion/METHODS 04
3 Erythrocytes/TRANSPLANTATION 03 05 07
1 Thrombocytopenia/THERAPY 03
best coverage 03

tertiary problem (general)

7 Blood Transfusion/ADVERSE EFFECTS 01 02 03 04 05 06 07
0 Graft vs Host Disease/CHEMICALLY INDUCED
3 Graft vs Host Disease/DIAGNOSIS 02 03 04
0 Graft vs Host Disease/DRUG THERAPY
7 Graft vs Host Disease/ETIOLOGY 01 02 03 04 05 06 07 *
1 Graft vs Host Disease/IMMUNOLOGY 05
4 Graft vs Host Disease/PATHOLOGY 03 05 06 07
2 Graft vs Host Disease/PREVENTION & CONTROL 02 04
1 Graft vs Host Disease/THERAPY 03
3 Immune Tolerance 02 03 07
0 Immunocompetence
0 Immunologic Deficiency Syndromes/CHEMICALLY INDUCED
1 Immunologic Deficiency Syndromes/ETIOLOGY 04
1 Immunologic Deficiency Syndromes/THERAPY 04
1 Lymphocytes/TRANSPLANTATION 07
0 T-Lymphocytes/TRANSPLANTATION
0 Mortality
2 Risk Factors 03 07
best coverage 07

tertiary problem (specific - immunology)

1 Graft vs Host Disease/IMMUNOLOGY 05
2 Hematologic Diseases/IMMUNOLOGY 02 07
0 Histocompatibility
1 Histocompatibility Testing 03
1 HLA Antigens/ANALYSIS 05
0 HLA-A1 Antigen
0 HLA-B8 Antigen
4 Hodgkin's Disease/IMMUNOLOGY 02 03 05 07 *
3 Immune Tolerance 02 03 07
0 Immunocompetence
0 Immunologic Deficiency Syndromes/CHEMICALLY INDUCED
1 Immunologic Deficiency Syndromes/ETIOLOGY 04
1 Immunologic Deficiency Syndromes/THERAPY 04
0 Leukemia/IMMUNOLOGY
0 Lymphoma/IMMUNOLOGY
3 Neoplasms/IMMUNOLOGY 02 05 07
2 not covered 01 06
best coverage 03

tertiary problem (specific - manifestation)

0 Dermatitis/CHEMICALLY INDUCED
0 Dermatitis/DIAGNOSIS
2 Dermatitis/ETIOLOGY 05 07 *
2 Dermatitis/PATHOLOGY 05 07
0 Erythema/CHEMICALLY INDUCED
1 Erythema/DIAGNOSIS 03
5 Erythema/ETIOLOGY 02 03 04 05 06 *
4 Erythema/PATHOLOGY 03 04 05 06
3 Skin/PATHOLOGY 03 04 05
0 Skin Diseases/CHEMICALLY INDUCED
0 Skin Diseases/DIAGNOSIS
2 Skin Diseases/ETIOLOGY 01 07 *
1 Skin Diseases/PATHOLOGY 07
0 Skin Manifestations
best coverage 05

diagnosis of tertiary problem

2 Biopsy 03 07
0 Dermatitis/DIAGNOSIS
2 Dermatitis/PATHOLOGY 05 07
1 Diagnosis, Differential 03
1 Erythema/DIAGNOSIS 03
4 Erythema/PATHOLOGY 03 04 05 06
3 Graft vs Host Disease/DIAGNOSIS 02 03 04
4 Graft vs Host Disease/PATHOLOGY 03 05 06 07
3 Skin/PATHOLOGY 03 04 05
0 Skin Diseases/DIAGNOSIS
1 Skin Diseases/PATHOLOGY 07
1 not covered 01
best coverage 03

treatment of tertiary problem

1 Blood Platelets/RADIATION EFFECTS 07
1 Blood Transfusion/METHODS 04
1 Erythrocytes/RADIATION EFFECTS 07
0 Graft vs Host Disease/DRUG THERAPY
2 Graft vs Host Disease/PREVENTION & CONTROL 02 04
1 Graft vs Host Disease/THERAPY 03
1 Immunologic Deficiency Syndromes/THERAPY 04
1 Lymphocytes/RADIATION EFFECTS 07
1 T-Lymphocytes/RADIATION EFFECTS 04
3 not covered 01 05 06
best coverage 07

best coverage, by indexer ID
03 xxxx (best coverage for four ideas)
04 x
05 x
07 xxxx

not covered, by indexer ID
01 xxxxx (five ideas not covered)
05 xx
06 xxx
07 x
ranking (most to least coverage), by indexer ID, e.g.:
indexer 03 kept 35 terms, had best coverage for four of them, and omitted no
idea
03 - 35  best=3D4, omitted=3D0
07 - 30  best=3D4, omitted=3D1
04 - 30  best=3D1, omitted=3D0
02 - 19  best=3D0, omitted=3D0
05 - 27  best=3D1, omitted=3D2
06 - 14  best=3D0, omitted=3D3
01 - 6   best=3D0, omitted=3D5

>X-IronPortListener: CES-Inbound
>X-SBRS: 0.2
>X-BrightmailFiltered: true
>X-Brightmail-Tracker: AAAAAA==
>X-IronPort-AV: i="4.09,194,1157342400";  d="scan'208"; 
a="26512040:sNHT53767840"
>From: Andrew Grove <Andrew.Grove at microsoft.com>
>To: "Humphrey, Susanne (NIH/NLM/LHC) [E]" <humphrey at nlm.nih.gov>, Birger 
Hjørland <BH at db.dk>, Leonard Will <L.Will at willpowerinfo.co.uk>, 
"sigcr-l at asis.org" <sigcr-l at asis.org>
>Date: Wed, 20 Sep 2006 10:51:18 -0700
>Thread-Topic: [Sigcr-l] Exhaustivity and specifity of indexing
>Thread-Index: AcbJFW0UeaOFVQojTZOvdii3SEmgNAACV58wAAfZ8vYAAL1IIAJfivxMAoccSBA=
>Accept-Language: en-US
>Content-Language: en-US
>X-MS-Has-Attach: 
>X-MS-TNEF-Correlator: 
>AcceptLanguage: en-US
>MIME-Version: 1.0
>X-OriginalArrivalTime: 20 Sep 2006 17:51:20.0794 (UTC) 
FILETIME=[617A83A0:01C6DCDD]
>X-Scanned-By: MIMEDefang 2.51 on 67.99.13.212
>X-Scanned-By: MIMEDefang 2.51 on 67.99.13.212
>X-MIME-Autoconverted: from quoted-printable to 8bit by mail.asis.org id 
k8KHhQtg018930
>X-Mailman-Approved-At: Wed, 20 Sep 2006 23:49:53 -0400
>Subject: Re: [Sigcr-l] Exhaustivity and specifity of indexing
>X-BeenThere: sigcr-l at asis.org
>X-Mailman-Version: 2.1.8
>List-Id: SIG Classification Research List <sigcr-l.asis.org>
>List-Unsubscribe: <http://mail.asis.org/mailman/listinfo/sigcr-l>, 
<mailto:sigcr-l-request at asis.org?subject=unsubscribe>
>List-Archive: <http://mail.asis.org/pipermail/sigcr-l>
>List-Post: <mailto:sigcr-l at asis.org>
>List-Help: <mailto:sigcr-l-request at asis.org?subject=help>
>List-Subscribe: <http://mail.asis.org/mailman/listinfo/sigcr-l>, 
<mailto:sigcr-l-request at asis.org?subject=subscribe>
>Content-Transfer-Encoding: quoted-printable
>
>Susanne,
>
>Thank you for the excellent discussion illustrating interdependencies between 
thesauri, taxonomies, indexing, classification, etc. and retrieval systems.
>
>One thing I might add -- in many business situations, the "thesaurus" is 
dynamic and constantly growing as demands of the business require.  In those 
cases, it's no easy matter to rely on a stable, developed thesaurus, evaluating 
or using the indexing and retrieval systems in context: the context constantly 
changes.  The challenge then becomes one of completely understanding those 
systems and developing a thesaurus in their context, rather than vice-versa.
>
>A request - is there any possibility the experiment could be brushed up and 
published?  Failing that, a matrix of the results with 7 indexers would be a 
handy graphic to illustrate the complexies and interdependencies.  It's maybe 
more relevant than ever now, especially with ad hoc IR on the scene as the 
current magic bullet.
>
>Thanks,
>Andrew
>
>-----Original Message-----
>From: Humphrey, Susanne (NIH/NLM/LHC) [E] [mailto:humphrey at nlm.nih.gov]
>Sent: Thursday, September 07, 2006 4:20 PM
>To: Andrew Grove; Birger Hjørland; Leonard Will; sigcr-l at asis.org
>Subject: RE: [Sigcr-l] Exhaustivity and specifity of indexing
>
>Note:
>I am using Outlook Web Access from home, which I find awkward, so I am afraid 
this is being addressed to some individuals as well as the list serve.  I don't 
know how to fix this.
>I hope at least it does reach the list serve and not just the individuals.
>Barbara K., did you receive it?
>If not, I guess I need you to tell me how to send it to the list serve properly 
(Yes, I should know)
>
>
>
>Let me jump into this with a specific example from PubMed that I focused on in 
an unpublished experiment many years ago:
>
>PMID- 2221937
>OWN - NLM
>STAT- MEDLINE
>DA  - 19901115
>DCOM- 19901115
>LR  - 20051116
>PUBM- Print
>IS  - 0003-987X (Print)
>VI  - 126
>IP  - 10
>DP  - 1990 Oct
>TI  - Transfusion-associated graft-vs-host disease in patients with
>      malignancies. Report of two cases and review of the literature.
>PG  - 1324-9
>AB  - Graft-vs-host disease can develop in immunosuppressed individuals who
>      receive blood-product transfusions that contain immunocompetent
>      lymphocytes. We report two cases of fatal transfusion-associated
>      graft-vs-host disease that developed in patients with Hodgkin's disease
>      who were undergoing therapy. We review all cases of this entity in
>      patients with malignancies, represented predominantly by patients with
>      hematologic malignancies. The groups at risk for development of
>      transfusion-associated graft-vs-host disease, the clinical presentation
>      and course, and methods of diagnosis are summarized. Prevention of this
>      highly fatal condition is possible by irradiation of blood products given
>      to patients at risk, but problems remain in determining the groups that
>      warrant such measures. Dermatologists need to have heightened awareness 
of
>      this entity to facilitate more complete diagnosis and allow establishment
>      of effective standards of care.
>AD  - Department of Dermatology, Harvard Medical School, Boston, Mass.
>FAU - Decoste, S D
>AU  - Decoste SD
>FAU - Boudreaux, C
>AU  - Boudreaux C
>FAU - Dover, J S
>AU  - Dover JS
>LA  - eng
>PT  - Case Reports
>PT  - Journal Article
>PT  - Review
>PL  - UNITED STATES
>TA  - Arch Dermatol
>JT  - Archives of dermatology.
>JID - 0372433
>SB  - AIM
>SB  - IM
>CIN - Arch Dermatol. 1990 Oct;126(10):1347-50. PMID: 2221941 MH  - Adolescent 
MH  - Adult MH  - Blood Transfusion/*adverse effects MH  - Female MH  - Graft vs 
Host Disease/*etiology/pathology MH  - Hodgkin Disease/*immunology MH  - Humans 
MH  - Immune Tolerance MH  - Male MH  - Skin Diseases/etiology/pathology RF  - 
50
>EDAT- 1990/10/01
>MHDA- 1990/10/01 00:01
>PST - ppublish
>SO  - Arch Dermatol. 1990 Oct;126(10):1324-9.
>
>My technique was to index this article ridiculously exhaustively and give this 
indexing to indexers and searchers and have them cross out the terms they 
thought shouldn't apply.  (Naturally, indexers crossed out many more terms than 
searchers.)
>
>Anyway, let's talk about the neoplasm concept.  At that time MeSH had:
>Neoplasms
>Hematologic Diseases
>Hodgkin's Disease
>
>Using today's MeSH, the terms would be:
>Neoplasms
>Hematologic Neoplasms
>Hodgkin Disease
>
>Let's talk in terms of today's MeSH.
>
>Note that the title says "patients with malignancies"
>Note the abstract has:
>patients with Hodgkin's disease
>because the two cases had this disease
>but it also has:
>patients with hematologic malignancies as this represents predominantly the 
patients.
>
>So is this article about neoplasms, hematologic malignancies, or Hodgkin 
disease?
>If you apply the specificity principle, the original indexer was correct in 
that the data pertained only to two cases with HD.
>However, does this represent the gist of the article?
>
>This illustrates that the retrieval system also has something to do with 
choosing the correct level of specificity.  The fact that PubMed retrieval does 
automatic explosion means that if a searcher enters the search term:
>
>Neoplasms
>
>this citation will be retrieved because the search automatically explodes the 
term (searches the union of the term and it's indentions), and Hodgkin Disease 
is in the Neoplasms tree.
>
>Thus, indexing with the most specific term is a good thing because searching 
the broader term Neoplasms will retrieve this citation as well, and you don't 
need to cover multiple levels of specificity because of this.
>
>However, if the retrieval system doesn't do this, then searching the broader 
Neoplasms would miss this citation.
>
>Hematologic Neoplasms is another story, however, because Hodgkin Disease is not 
in the Hematologic Neoplasms hierarchy.  Thus searching Hematologic Neoplasms 
will not retrieve this citation.  I would say that this citation is definitely 
relevant for Hematologic Neoplasms, but that indexing term is not there.
>
>In my experiment with 7 indexers, all did not cross out Hodgkin's Disease, but
>5 did not cross out Neoplasms, and 2 did not cross out Hematologic Diseases 
(remember Hematologic Neoplasms was not a term then).  Actually, because the 
indexing of hematologic malignancies should be Neoplasms + Hematologic Diseases, 
it seems fair to say that two of the non-crossing out of Neoplasms go with 
non-crossing out of Hematologic Diseases.  So in summary,
>
>7 used Hodgkin Disease
>2 used Hematologic Diseases (as coord with Neoplasms)
>5 used Neoplasms (2 as coord with Hematologic Diseases)
>
>Indexers #1 & 3 used Hodgkin's Disease
>Indexers #2 & 7 used Hodgkin's Disease + Hematologic Diseases + Neoplasms 
indexers #4-6 used Hodgkin's Disease + Neoplasms
>
>but interpreting through the current MeSH:
>
>7 used Hodgkin Disease
>2 used Hematologic Neoplasms
>3 used Neoplasms
>
>Specifically, indexers #1 & 3 used Hodgkin Disease Indexers #2 & 7 used Hodgkin 
Disease + Hematologic Neoplasms indexers #4-6 used Hodgkin Disease + Neoplasms
>
>So five of seven indexers assigned (i.e., did not cross out) not only Hodgkin's 
Disease but also at least one of the broader terms despite the specificity rule 
of indexing.  Two went one level broader to the hematologic malignancies, and 
three went two levels broader to just malignancies.
>
>This work was done in 1991, and frankly I don't know the state of the retrieval 
system at that time as to whether there was automatic explode or not.  But in 
any case Hodgkin's Disease was not in the Hematologic Diseases hierarchy either, 
but at least the indexing of "hematologic malignancies" required BOTH 
Hematologic Diseases AND Neoplasms, so under automatic pre-explode, the 
Neoplasms part would retrieve this article.  Today, it's worse because as I said 
above Hematologic Neoplasms hierarchy does not include Hodgkin Disease.
>
>Also, there was the matter of printed Index Medicus (which no longer exists).
>According to the original indexing, this citation appeared in IM only under 
Hodgkin's Disease.  This means a person looking in the printed index under 
Hematologic Diseases or under Neoplasms would not find this article, and in my 
opinion this omission would be quite significant.  That is, in perusing 
Hematologic Diseases and Neoplasms in print, this article would be quite 
relevant, but it would not be there, and the print searcher would have to know 
that such an article was printed ONLY under some more specific Hematologic 
Diseases or Neoplasms term (of which there are quite a few).
>
>So I guess what I am saying is that sometimes the gist of the article suggests 
multiple levels of specificity for indexing, and also the retrieval system might 
compensate when the specificity rule is strictly applied.
>
>I personally would argue for using all three terms in this case.
>
>I would say that this topic has to do with the indexing application more than 
with the thesaurus, in that all levels of specificity are represented in the 
thesaurus.  The issue is which levels the indexer selects.
>
>Susanne Humphrey
>humphrey at nlm.nih.gov
>
>
>
>
>
>
>
>
>-----Original Message-----
>From: Andrew Grove [mailto:Andrew.Grove at microsoft.com]
>Sent: Sat 8/26/2006 3:16 PM
>To: Birger Hjørland; Leonard Will; sigcr-l at asis.org
>Subject: Re: [Sigcr-l] Exhaustivity and specifity of indexing
>
>Birger, et al.:
>I've been following the discussion with great interest.  All relevant.  My 
goals for asking what is the more specific question were several:
>1.  Identify more clearly Kora's original information need in order to avoid 
misunderstanding it and spending time answering something different.  Motivated 
by personal time management objectives.
>2.  Identify deficiencies in the literature in order to identify opportunities 
for contribution to it.
>3.  Possibly identify related bodies of literature that might contribute to 
answering Kora's question(s).
>
>That said, I will add these brief comments.
>
>This sounds very much like the long-standing discussion in Taxonomy between 
"lumping" and "splitting".  As a practitioner, not a scholar, I make a pretty 
clear distinction between the two.  Both are useful for describing and 
retrieving information objects.  Classification ("lumping") provides relatively 
broad, general categories which serve to group similar objects (topics, 
concepts, provenance, purpose, etc.).  Indexing ("splitting") marks objects in a 
manner which distinguishes each from others which are similar but not the same.  
Because of the multiplicity of objects having the same or similar 
characteristics, indexing also serves to group them -- but at a very specific 
level.  Because of the multiplicity of objects alone, classification also serves 
to distinguish them -- but at broad and general levels.  A highly detailed 
classification, which an extended DDC could become, tends to dive into the realm 
of indexing languages.  A broad, general index language, which many are for 
pragmatic reasons based on collection size and scope, tends to "bubble up" into 
the realm of classification schemes.  The distinction between classification and 
indexing ends up becoming situational and very fluid.
>
>For what it's worth, I will suggest examination of the literature on Taxonomy, 
and the branches of Logic and Linguistics which deal, specifically, with the 
relationships of objects to each other.
>
>Most respectfully yours,
>Andrew
>
>-----Original Message-----
>From: Birger Hjørland [mailto:BH at db.dk]
>Sent: Saturday, August 26, 2006 11:34 AM
>To: Andrew Grove; Leonard Will; sigcr-l at asis.org
>Subject: SV: [Sigcr-l] Exhaustivity and specifity of indexing
>
>Answer to Andrew:
>Yes, I believe the literature is comprehensive and answers most questions. My 
point was that is Kore expected separate literatures about the specificity of 
indexing and classification, whereas I proposes that this is fundamentally the 
same. The next round was about the specificity about the indexing language 
versus the actual indexing/classification practice, where I suggested, following 
Cutter (1876) to index as specific as possible in tghe given system.
>
>kind regards Birger
>
>
>
>
>________________________________
>
>Fra: sigcr-l-bounces at asis.org på vegne af Andrew Grove
>Sendt: lø 26-08-2006 16:51
>Til: Leonard Will; sigcr-l at asis.org
>Emne: Re: [Sigcr-l] Exhaustivity and specifity of indexing
>
>
>
>Hello,
>
>I am resisting the urge to leap in too quickly here.  In Kora's original 
message, there's mention of literature on the subject but it does not suffice.  
So true, there is a wealth of literature on the subject.  So much of it in fact, 
I wonder in what manner it does not suffice.  What is, forgive me, the more 
specific question the literature does not answer?
>
>Most respectfully,
>Andrew
>
>Andrew Grove
>Program Manager, Taxonomy
>Knowledge Network Group
>Microsoft Corporation
>425 706-5557
>
>
>-----Original Message-----
>From: sigcr-l-bounces at asis.org [mailto:sigcr-l-bounces at asis.org] On Behalf Of 
Leonard Will
>Sent: Saturday, August 26, 2006 6:34 AM
>To: sigcr-l at asis.org
>Subject: Re: [Sigcr-l] Exhaustivity and specifity of indexing
>
>In message <FB64419FDA34834382771A20964B15CED27AF9 at amon.it.lth.se> on Thu, 24 
Aug 2006, Koraljka Golub <kora at it.lth.se> wrote
>>
>>Does anyone know of any references or have any opinion about
>>exhaustivity and specificity of classification, meaning assignment of
>>classes from a classification scheme.
>
>In message <73573C2DCB0154408D790B1E7EDB0C521B9E1F at ka-exch01.db.dk> on Sat, 26 
Aug 2006, Birger Hjørland <BH at db.dk> wrote
>>Dear Kora,
>>I believe, that you are making the wrong assumption that indexing and
>>classification is different in this respect. If you take a concept from
>>a controlled vocabulary (say, a thesaurus) this is in my opinion
>>similar to taking a class from a a clasification system (which also
>>represents a concept). So, the specificity of a term in a thesaurus
>>depends on the number of terms given and the specificity of a class in
>>a classification system depends on the number og classes given (the
>>more terms/classes, the greater the specificity of applying a given
>>term/class). It it worth considering however, that although the overall
>>specificity can be measured by counting the number of
>>descriptors/classes, any given system will have a greater specificity
>>in some areas compared to others (DDC, for example, is much more
>>specific in Christianity compared to other religions).
>
>I agree with what Birger says, but I think that Koraljka's question was not so 
much about the specificity provided in the scheme itself, but the specificity 
with which it is applied when classifying documents, i.e., for example, is it 
worth while to use the full specificity possible in DDC by adding all the 
possible common subdivisions, "divide-like"
>instructions and so on, or is it better to simplify by limiting class numbers 
to 3 (or 6 or whatever) digits?
>
>The answer to this must be that it depends on the material being classified. 
The aim should be to classify specifically enough to make it easy for the user 
to scan through the items in a class. I usually think of this as meaning that a 
class should contain between 10 and 50 items.
>If the collection is large, or concentrated in a single subject area, more 
specificity will be needed than if it is a small, general collection.
>
>Other considerations are:
>
>a. Allowing for growth of the collection. You don't want to have to go back and 
re-classify if more material is added in a given subject area.
>
>b. Compatibility with what is being done elsewhere. Do you share records, 
obtain them from elsewhere or merge them in a combined catalogue?
>
>c. Provision of access from concepts that are scattered by the classification. 
These may come later in the citation order of combining facets in a synthesised 
class number, and if the number is truncated they will be lost.
>
>d. Adequacy of the alphabetical index constructed to show where topics have 
been classed. It is seldom adequate to rely on the index published with the 
schedules, but far too often that is all that is provided. It will not show many 
synthesised numbers, and there is little point in creating these if you do not 
also create the means of finding them.
>
>Exhaustivity is more a matter of subject analysis of the documents. Do you 
identify and record topics that are only treated incidentally in a document, or 
do you restrict indexing and classification to the main topics only? There is no 
simple answer, so much depending on the nature of the collection, the users, and 
the purpose of the catalogue.
>
>Leonard Will
>
>--
>Willpower Information       (Partners: Dr Leonard D Will, Sheena E Will)
>Information Management Consultants              Tel: +44 (0)20 8372 0092
>27 Calshot Way, Enfield, Middlesex EN2 7BQ, UK. Fax: +44 (0)870 051 7276
>L.Will at Willpowerinfo.co.uk               Sheena.Will at Willpowerinfo.co.uk
>---------------- <URL:http://www.willpowerinfo.co.uk/> -----------------
>
>
>_______________________________________________
>Sigcr-l mailing list
>Sigcr-l at asis.org
>http://mail.asis.org/mailman/listinfo/sigcr-l
>
>_______________________________________________
>Sigcr-l mailing list
>Sigcr-l at asis.org
>http://mail.asis.org/mailman/listinfo/sigcr-l
>
>
>
>
>
>
>_______________________________________________
>Sigcr-l mailing list
>Sigcr-l at asis.org
>http://mail.asis.org/mailman/listinfo/sigcr-l
>
>
>_______________________________________________
>Sigcr-l mailing list
>Sigcr-l at asis.org
>http://mail.asis.org/mailman/listinfo/sigcr-l




More information about the Sigcr-l mailing list