[Sigcr-l] Exhaustivity and specifity of indexing

Humphrey, Susanne (NIH/NLM/LHC) [E] humphrey at nlm.nih.gov
Thu Sep 7 19:20:24 EDT 2006


Note:
I am using Outlook Web Access from home, which I find awkward, so I am afraid this
is being addressed to some individuals as well as the list serve.  I don't know
how to fix this.
I hope at least it does reach the list serve and not just the individuals.
Barbara K., did you receive it?
If not, I guess I need you to tell me how to send it to the list serve properly
(Yes, I should know)



Let me jump into this with a specific example from PubMed that I focused on in an
unpublished experiment many years ago:

PMID- 2221937
OWN - NLM
STAT- MEDLINE
DA  - 19901115
DCOM- 19901115
LR  - 20051116
PUBM- Print
IS  - 0003-987X (Print)
VI  - 126
IP  - 10
DP  - 1990 Oct
TI  - Transfusion-associated graft-vs-host disease in patients with
      malignancies. Report of two cases and review of the literature.
PG  - 1324-9
AB  - Graft-vs-host disease can develop in immunosuppressed individuals who
      receive blood-product transfusions that contain immunocompetent
      lymphocytes. We report two cases of fatal transfusion-associated
      graft-vs-host disease that developed in patients with Hodgkin's disease
      who were undergoing therapy. We review all cases of this entity in
      patients with malignancies, represented predominantly by patients with
      hematologic malignancies. The groups at risk for development of
      transfusion-associated graft-vs-host disease, the clinical presentation
      and course, and methods of diagnosis are summarized. Prevention of this
      highly fatal condition is possible by irradiation of blood products given
      to patients at risk, but problems remain in determining the groups that
      warrant such measures. Dermatologists need to have heightened awareness of
      this entity to facilitate more complete diagnosis and allow establishment
      of effective standards of care.
AD  - Department of Dermatology, Harvard Medical School, Boston, Mass.
FAU - Decoste, S D
AU  - Decoste SD
FAU - Boudreaux, C
AU  - Boudreaux C
FAU - Dover, J S
AU  - Dover JS
LA  - eng
PT  - Case Reports
PT  - Journal Article
PT  - Review
PL  - UNITED STATES
TA  - Arch Dermatol
JT  - Archives of dermatology.
JID - 0372433
SB  - AIM
SB  - IM
CIN - Arch Dermatol. 1990 Oct;126(10):1347-50. PMID: 2221941
MH  - Adolescent
MH  - Adult
MH  - Blood Transfusion/*adverse effects
MH  - Female
MH  - Graft vs Host Disease/*etiology/pathology
MH  - Hodgkin Disease/*immunology
MH  - Humans
MH  - Immune Tolerance
MH  - Male
MH  - Skin Diseases/etiology/pathology
RF  - 50
EDAT- 1990/10/01
MHDA- 1990/10/01 00:01
PST - ppublish
SO  - Arch Dermatol. 1990 Oct;126(10):1324-9.

My technique was to index this article ridiculously exhaustively and give this
indexing to indexers and searchers and have them cross out the terms they thought
shouldn't apply.  (Naturally, indexers crossed out many more terms than searchers.)

Anyway, let's talk about the neoplasm concept.  At that time MeSH had:
Neoplasms
Hematologic Diseases
Hodgkin's Disease

Using today's MeSH, the terms would be:
Neoplasms
Hematologic Neoplasms
Hodgkin Disease

Let's talk in terms of today's MeSH.

Note that the title says "patients with malignancies"
Note the abstract has:
patients with Hodgkin's disease
because the two cases had this disease
but it also has:
patients with hematologic malignancies as this represents predominantly
the patients.

So is this article about neoplasms, hematologic malignancies, or Hodgkin disease?
If you apply the specificity principle, the original indexer was correct in that
the data pertained only to two cases with HD.
However, does this represent the gist of the article?

This illustrates that the retrieval system also has something to do with choosing
the correct level of specificity.  The fact that PubMed retrieval does automatic
explosion means that if a searcher enters the search term:

Neoplasms

this citation will be retrieved because the search automatically explodes the term
(searches the union of the term and it's indentions), and Hodgkin Disease is in
the Neoplasms tree.

Thus, indexing with the most specific term is a good thing because searching
the broader term Neoplasms will retrieve this citation as well, and you don't
need to cover multiple levels of specificity because of this.

However, if the retrieval system doesn't do this, then searching the broader
Neoplasms would miss this citation.

Hematologic Neoplasms is another story, however, because Hodgkin Disease is not
in the Hematologic Neoplasms hierarchy.  Thus searching Hematologic Neoplasms
will not retrieve this citation.  I would say that this citation is definitely
relevant for Hematologic Neoplasms, but that indexing term is not there.

In my experiment with 7 indexers, all did not cross out Hodgkin's Disease, but
5 did not cross out Neoplasms, and 2 did not cross out Hematologic Diseases
(remember Hematologic Neoplasms was not a term then).  Actually, because the
indexing of hematologic malignancies should be Neoplasms + Hematologic Diseases,
it seems fair to say that two of the non-crossing out of Neoplasms go with
non-crossing out of Hematologic Diseases.  So in summary, 

7 used Hodgkin Disease
2 used Hematologic Diseases (as coord with Neoplasms)
5 used Neoplasms (2 as coord with Hematologic Diseases)

Indexers #1 & 3 used Hodgkin's Disease
Indexers #2 & 7 used Hodgkin's Disease + Hematologic Diseases + Neoplasms
indexers #4-6 used Hodgkin's Disease + Neoplasms

but interpreting through the current MeSH:

7 used Hodgkin Disease
2 used Hematologic Neoplasms
3 used Neoplasms

Specifically, indexers #1 & 3 used Hodgkin Disease
Indexers #2 & 7 used Hodgkin Disease + Hematologic Neoplasms
indexers #4-6 used Hodgkin Disease + Neoplasms

So five of seven indexers assigned (i.e., did not cross out) not only Hodgkin's
Disease but also at least one of the broader terms despite the specificity rule of
indexing.  Two went one level broader to the hematologic malignancies, and three went two levels broader to just malignancies.

This work was done in 1991, and frankly I don't know the state of the retrieval
system at that time as to whether there was automatic explode or not.  But in any
case Hodgkin's Disease was not in the Hematologic Diseases hierarchy either, but
at least the indexing of "hematologic malignancies" required BOTH Hematologic
Diseases AND Neoplasms, so under automatic pre-explode, the Neoplasms part would
retrieve this article.  Today, it's worse because as I said above Hematologic
Neoplasms hierarchy does not include Hodgkin Disease.

Also, there was the matter of printed Index Medicus (which no longer exists).
According to the original indexing, this citation appeared in IM only under
Hodgkin's Disease.  This means a person looking in the printed index under
Hematologic Diseases or under Neoplasms would not find this article, and in my
opinion this omission would be quite significant.  That is, in perusing
Hematologic Diseases and Neoplasms in print, this article would be quite
relevant, but it would not be there, and the print searcher would have to know
that such an article was printed ONLY under some more specific Hematologic Diseases
or Neoplasms term (of which there are quite a few).

So I guess what I am saying is that sometimes the gist of the article suggests
multiple levels of specificity for indexing, and also the retrieval system might
compensate when the specificity rule is strictly applied.

I personally would argue for using all three terms in this case.

I would say that this topic has to do with the indexing application more than
with the thesaurus, in that all levels of specificity are represented in the
thesaurus.  The issue is which levels the indexer selects.

Susanne Humphrey
humphrey at nlm.nih.gov








-----Original Message-----
From: Andrew Grove [mailto:Andrew.Grove at microsoft.com]
Sent: Sat 8/26/2006 3:16 PM
To: Birger Hjørland; Leonard Will; sigcr-l at asis.org
Subject: Re: [Sigcr-l] Exhaustivity and specifity of indexing
 
Birger, et al.:
I've been following the discussion with great interest.  All relevant.  My goals for asking what is the more specific question were several:
1.  Identify more clearly Kora's original information need in order to avoid misunderstanding it and spending time answering something different.  Motivated by personal time management objectives.
2.  Identify deficiencies in the literature in order to identify opportunities for contribution to it.
3.  Possibly identify related bodies of literature that might contribute to answering Kora's question(s).

That said, I will add these brief comments.

This sounds very much like the long-standing discussion in Taxonomy between "lumping" and "splitting".  As a practitioner, not a scholar, I make a pretty clear distinction between the two.  Both are useful for describing and retrieving information objects.  Classification ("lumping") provides relatively broad, general categories which serve to group similar objects (topics, concepts, provenance, purpose, etc.).  Indexing ("splitting") marks objects in a manner which distinguishes each from others which are similar but not the same.  Because of the multiplicity of objects having the same or similar characteristics, indexing also serves to group them -- but at a very specific level.  Because of the multiplicity of objects alone, classification also serves to distinguish them -- but at broad and general levels.  A highly detailed classification, which an extended DDC could become, tends to dive into the realm of indexing languages.  A broad, general index language, which many are for pragmatic reasons based on collection size and scope, tends to "bubble up" into the realm of classification schemes.  The distinction between classification and indexing ends up becoming situational and very fluid.

For what it's worth, I will suggest examination of the literature on Taxonomy, and the branches of Logic and Linguistics which deal, specifically, with the relationships of objects to each other.

Most respectfully yours,
Andrew

-----Original Message-----
From: Birger Hjørland [mailto:BH at db.dk] 
Sent: Saturday, August 26, 2006 11:34 AM
To: Andrew Grove; Leonard Will; sigcr-l at asis.org
Subject: SV: [Sigcr-l] Exhaustivity and specifity of indexing

Answer to Andrew: 
Yes, I believe the literature is comprehensive and answers most questions. My point was that is Kore expected separate literatures about the specificity of indexing and classification, whereas I proposes that this is fundamentally the same. The next round was about the specificity about the indexing language versus the actual indexing/classification practice, where I suggested, following Cutter (1876) to index as specific as possible in tghe given system. 
 
kind regards Birger
 
 
 

________________________________

Fra: sigcr-l-bounces at asis.org på vegne af Andrew Grove
Sendt: lø 26-08-2006 16:51
Til: Leonard Will; sigcr-l at asis.org
Emne: Re: [Sigcr-l] Exhaustivity and specifity of indexing



Hello,

I am resisting the urge to leap in too quickly here.  In Kora's original message, there's mention of literature on the subject but it does not suffice.  So true, there is a wealth of literature on the subject.  So much of it in fact, I wonder in what manner it does not suffice.  What is, forgive me, the more specific question the literature does not answer?

Most respectfully,
Andrew

Andrew Grove
Program Manager, Taxonomy
Knowledge Network Group
Microsoft Corporation
425 706-5557


-----Original Message-----
From: sigcr-l-bounces at asis.org [mailto:sigcr-l-bounces at asis.org] On Behalf Of Leonard Will
Sent: Saturday, August 26, 2006 6:34 AM
To: sigcr-l at asis.org
Subject: Re: [Sigcr-l] Exhaustivity and specifity of indexing

In message <FB64419FDA34834382771A20964B15CED27AF9 at amon.it.lth.se> on Thu, 24 Aug 2006, Koraljka Golub <kora at it.lth.se> wrote
>
>Does anyone know of any references or have any opinion about 
>exhaustivity and specificity of classification, meaning assignment of 
>classes from a classification scheme.

In message <73573C2DCB0154408D790B1E7EDB0C521B9E1F at ka-exch01.db.dk> on Sat, 26 Aug 2006, Birger Hjørland <BH at db.dk> wrote
>Dear Kora,
>I believe, that you are making the wrong assumption that indexing and 
>classification is different in this respect. If you take a concept from 
>a controlled vocabulary (say, a thesaurus) this is in my opinion 
>similar to taking a class from a a clasification system (which also 
>represents a concept). So, the specificity of a term in a thesaurus 
>depends on the number of terms given and the specificity of a class in 
>a classification system depends on the number og classes given (the 
>more terms/classes, the greater the specificity of applying a given 
>term/class). It it worth considering however, that although the overall 
>specificity can be measured by counting the number of 
>descriptors/classes, any given system will have a greater specificity 
>in some areas compared to others (DDC, for example, is much more 
>specific in Christianity compared to other religions).

I agree with what Birger says, but I think that Koraljka's question was not so much about the specificity provided in the scheme itself, but the specificity with which it is applied when classifying documents, i.e., for example, is it worth while to use the full specificity possible in DDC by adding all the possible common subdivisions, "divide-like"
instructions and so on, or is it better to simplify by limiting class numbers to 3 (or 6 or whatever) digits?

The answer to this must be that it depends on the material being classified. The aim should be to classify specifically enough to make it easy for the user to scan through the items in a class. I usually think of this as meaning that a class should contain between 10 and 50 items.
If the collection is large, or concentrated in a single subject area, more specificity will be needed than if it is a small, general collection.

Other considerations are:

a. Allowing for growth of the collection. You don't want to have to go back and re-classify if more material is added in a given subject area.

b. Compatibility with what is being done elsewhere. Do you share records, obtain them from elsewhere or merge them in a combined catalogue?

c. Provision of access from concepts that are scattered by the classification. These may come later in the citation order of combining facets in a synthesised class number, and if the number is truncated they will be lost.

d. Adequacy of the alphabetical index constructed to show where topics have been classed. It is seldom adequate to rely on the index published with the schedules, but far too often that is all that is provided. It will not show many synthesised numbers, and there is little point in creating these if you do not also create the means of finding them.

Exhaustivity is more a matter of subject analysis of the documents. Do you identify and record topics that are only treated incidentally in a document, or do you restrict indexing and classification to the main topics only? There is no simple answer, so much depending on the nature of the collection, the users, and the purpose of the catalogue.

Leonard Will

--
Willpower Information       (Partners: Dr Leonard D Will, Sheena E Will)
Information Management Consultants              Tel: +44 (0)20 8372 0092
27 Calshot Way, Enfield, Middlesex EN2 7BQ, UK. Fax: +44 (0)870 051 7276
L.Will at Willpowerinfo.co.uk               Sheena.Will at Willpowerinfo.co.uk
---------------- <URL:http://www.willpowerinfo.co.uk/> -----------------


_______________________________________________
Sigcr-l mailing list
Sigcr-l at asis.org
http://mail.asis.org/mailman/listinfo/sigcr-l

_______________________________________________
Sigcr-l mailing list
Sigcr-l at asis.org
http://mail.asis.org/mailman/listinfo/sigcr-l






_______________________________________________
Sigcr-l mailing list
Sigcr-l at asis.org
http://mail.asis.org/mailman/listinfo/sigcr-l





More information about the Sigcr-l mailing list