[Sigcr-l] Exhaustivity and specifity of indexing
Andrew Grove
Andrew.Grove at microsoft.com
Wed Sep 20 13:51:18 EDT 2006
Susanne,
Thank you for the excellent discussion illustrating interdependencies between thesauri, taxonomies, indexing, classification, etc. and retrieval systems.
One thing I might add -- in many business situations, the "thesaurus" is dynamic and constantly growing as demands of the business require. In those cases, it's no easy matter to rely on a stable, developed thesaurus, evaluating or using the indexing and retrieval systems in context: the context constantly changes. The challenge then becomes one of completely understanding those systems and developing a thesaurus in their context, rather than vice-versa.
A request - is there any possibility the experiment could be brushed up and published? Failing that, a matrix of the results with 7 indexers would be a handy graphic to illustrate the complexies and interdependencies. It's maybe more relevant than ever now, especially with ad hoc IR on the scene as the current magic bullet.
Thanks,
Andrew
-----Original Message-----
From: Humphrey, Susanne (NIH/NLM/LHC) [E] [mailto:humphrey at nlm.nih.gov]
Sent: Thursday, September 07, 2006 4:20 PM
To: Andrew Grove; Birger Hjørland; Leonard Will; sigcr-l at asis.org
Subject: RE: [Sigcr-l] Exhaustivity and specifity of indexing
Note:
I am using Outlook Web Access from home, which I find awkward, so I am afraid this is being addressed to some individuals as well as the list serve. I don't know how to fix this.
I hope at least it does reach the list serve and not just the individuals.
Barbara K., did you receive it?
If not, I guess I need you to tell me how to send it to the list serve properly (Yes, I should know)
Let me jump into this with a specific example from PubMed that I focused on in an unpublished experiment many years ago:
PMID- 2221937
OWN - NLM
STAT- MEDLINE
DA - 19901115
DCOM- 19901115
LR - 20051116
PUBM- Print
IS - 0003-987X (Print)
VI - 126
IP - 10
DP - 1990 Oct
TI - Transfusion-associated graft-vs-host disease in patients with
malignancies. Report of two cases and review of the literature.
PG - 1324-9
AB - Graft-vs-host disease can develop in immunosuppressed individuals who
receive blood-product transfusions that contain immunocompetent
lymphocytes. We report two cases of fatal transfusion-associated
graft-vs-host disease that developed in patients with Hodgkin's disease
who were undergoing therapy. We review all cases of this entity in
patients with malignancies, represented predominantly by patients with
hematologic malignancies. The groups at risk for development of
transfusion-associated graft-vs-host disease, the clinical presentation
and course, and methods of diagnosis are summarized. Prevention of this
highly fatal condition is possible by irradiation of blood products given
to patients at risk, but problems remain in determining the groups that
warrant such measures. Dermatologists need to have heightened awareness of
this entity to facilitate more complete diagnosis and allow establishment
of effective standards of care.
AD - Department of Dermatology, Harvard Medical School, Boston, Mass.
FAU - Decoste, S D
AU - Decoste SD
FAU - Boudreaux, C
AU - Boudreaux C
FAU - Dover, J S
AU - Dover JS
LA - eng
PT - Case Reports
PT - Journal Article
PT - Review
PL - UNITED STATES
TA - Arch Dermatol
JT - Archives of dermatology.
JID - 0372433
SB - AIM
SB - IM
CIN - Arch Dermatol. 1990 Oct;126(10):1347-50. PMID: 2221941 MH - Adolescent MH - Adult MH - Blood Transfusion/*adverse effects MH - Female MH - Graft vs Host Disease/*etiology/pathology MH - Hodgkin Disease/*immunology MH - Humans MH - Immune Tolerance MH - Male MH - Skin Diseases/etiology/pathology RF - 50
EDAT- 1990/10/01
MHDA- 1990/10/01 00:01
PST - ppublish
SO - Arch Dermatol. 1990 Oct;126(10):1324-9.
My technique was to index this article ridiculously exhaustively and give this indexing to indexers and searchers and have them cross out the terms they thought shouldn't apply. (Naturally, indexers crossed out many more terms than searchers.)
Anyway, let's talk about the neoplasm concept. At that time MeSH had:
Neoplasms
Hematologic Diseases
Hodgkin's Disease
Using today's MeSH, the terms would be:
Neoplasms
Hematologic Neoplasms
Hodgkin Disease
Let's talk in terms of today's MeSH.
Note that the title says "patients with malignancies"
Note the abstract has:
patients with Hodgkin's disease
because the two cases had this disease
but it also has:
patients with hematologic malignancies as this represents predominantly the patients.
So is this article about neoplasms, hematologic malignancies, or Hodgkin disease?
If you apply the specificity principle, the original indexer was correct in that the data pertained only to two cases with HD.
However, does this represent the gist of the article?
This illustrates that the retrieval system also has something to do with choosing the correct level of specificity. The fact that PubMed retrieval does automatic explosion means that if a searcher enters the search term:
Neoplasms
this citation will be retrieved because the search automatically explodes the term (searches the union of the term and it's indentions), and Hodgkin Disease is in the Neoplasms tree.
Thus, indexing with the most specific term is a good thing because searching the broader term Neoplasms will retrieve this citation as well, and you don't need to cover multiple levels of specificity because of this.
However, if the retrieval system doesn't do this, then searching the broader Neoplasms would miss this citation.
Hematologic Neoplasms is another story, however, because Hodgkin Disease is not in the Hematologic Neoplasms hierarchy. Thus searching Hematologic Neoplasms will not retrieve this citation. I would say that this citation is definitely relevant for Hematologic Neoplasms, but that indexing term is not there.
In my experiment with 7 indexers, all did not cross out Hodgkin's Disease, but
5 did not cross out Neoplasms, and 2 did not cross out Hematologic Diseases (remember Hematologic Neoplasms was not a term then). Actually, because the indexing of hematologic malignancies should be Neoplasms + Hematologic Diseases, it seems fair to say that two of the non-crossing out of Neoplasms go with non-crossing out of Hematologic Diseases. So in summary,
7 used Hodgkin Disease
2 used Hematologic Diseases (as coord with Neoplasms)
5 used Neoplasms (2 as coord with Hematologic Diseases)
Indexers #1 & 3 used Hodgkin's Disease
Indexers #2 & 7 used Hodgkin's Disease + Hematologic Diseases + Neoplasms indexers #4-6 used Hodgkin's Disease + Neoplasms
but interpreting through the current MeSH:
7 used Hodgkin Disease
2 used Hematologic Neoplasms
3 used Neoplasms
Specifically, indexers #1 & 3 used Hodgkin Disease Indexers #2 & 7 used Hodgkin Disease + Hematologic Neoplasms indexers #4-6 used Hodgkin Disease + Neoplasms
So five of seven indexers assigned (i.e., did not cross out) not only Hodgkin's Disease but also at least one of the broader terms despite the specificity rule of indexing. Two went one level broader to the hematologic malignancies, and three went two levels broader to just malignancies.
This work was done in 1991, and frankly I don't know the state of the retrieval system at that time as to whether there was automatic explode or not. But in any case Hodgkin's Disease was not in the Hematologic Diseases hierarchy either, but at least the indexing of "hematologic malignancies" required BOTH Hematologic Diseases AND Neoplasms, so under automatic pre-explode, the Neoplasms part would retrieve this article. Today, it's worse because as I said above Hematologic Neoplasms hierarchy does not include Hodgkin Disease.
Also, there was the matter of printed Index Medicus (which no longer exists).
According to the original indexing, this citation appeared in IM only under Hodgkin's Disease. This means a person looking in the printed index under Hematologic Diseases or under Neoplasms would not find this article, and in my opinion this omission would be quite significant. That is, in perusing Hematologic Diseases and Neoplasms in print, this article would be quite relevant, but it would not be there, and the print searcher would have to know that such an article was printed ONLY under some more specific Hematologic Diseases or Neoplasms term (of which there are quite a few).
So I guess what I am saying is that sometimes the gist of the article suggests multiple levels of specificity for indexing, and also the retrieval system might compensate when the specificity rule is strictly applied.
I personally would argue for using all three terms in this case.
I would say that this topic has to do with the indexing application more than with the thesaurus, in that all levels of specificity are represented in the thesaurus. The issue is which levels the indexer selects.
Susanne Humphrey
humphrey at nlm.nih.gov
-----Original Message-----
From: Andrew Grove [mailto:Andrew.Grove at microsoft.com]
Sent: Sat 8/26/2006 3:16 PM
To: Birger Hjørland; Leonard Will; sigcr-l at asis.org
Subject: Re: [Sigcr-l] Exhaustivity and specifity of indexing
Birger, et al.:
I've been following the discussion with great interest. All relevant. My goals for asking what is the more specific question were several:
1. Identify more clearly Kora's original information need in order to avoid misunderstanding it and spending time answering something different. Motivated by personal time management objectives.
2. Identify deficiencies in the literature in order to identify opportunities for contribution to it.
3. Possibly identify related bodies of literature that might contribute to answering Kora's question(s).
That said, I will add these brief comments.
This sounds very much like the long-standing discussion in Taxonomy between "lumping" and "splitting". As a practitioner, not a scholar, I make a pretty clear distinction between the two. Both are useful for describing and retrieving information objects. Classification ("lumping") provides relatively broad, general categories which serve to group similar objects (topics, concepts, provenance, purpose, etc.). Indexing ("splitting") marks objects in a manner which distinguishes each from others which are similar but not the same. Because of the multiplicity of objects having the same or similar characteristics, indexing also serves to group them -- but at a very specific level. Because of the multiplicity of objects alone, classification also serves to distinguish them -- but at broad and general levels. A highly detailed classification, which an extended DDC could become, tends to dive into the realm of indexing languages. A broad, general index language, which many are for pragmatic reasons based on collection size and scope, tends to "bubble up" into the realm of classification schemes. The distinction between classification and indexing ends up becoming situational and very fluid.
For what it's worth, I will suggest examination of the literature on Taxonomy, and the branches of Logic and Linguistics which deal, specifically, with the relationships of objects to each other.
Most respectfully yours,
Andrew
-----Original Message-----
From: Birger Hjørland [mailto:BH at db.dk]
Sent: Saturday, August 26, 2006 11:34 AM
To: Andrew Grove; Leonard Will; sigcr-l at asis.org
Subject: SV: [Sigcr-l] Exhaustivity and specifity of indexing
Answer to Andrew:
Yes, I believe the literature is comprehensive and answers most questions. My point was that is Kore expected separate literatures about the specificity of indexing and classification, whereas I proposes that this is fundamentally the same. The next round was about the specificity about the indexing language versus the actual indexing/classification practice, where I suggested, following Cutter (1876) to index as specific as possible in tghe given system.
kind regards Birger
________________________________
Fra: sigcr-l-bounces at asis.org på vegne af Andrew Grove
Sendt: lø 26-08-2006 16:51
Til: Leonard Will; sigcr-l at asis.org
Emne: Re: [Sigcr-l] Exhaustivity and specifity of indexing
Hello,
I am resisting the urge to leap in too quickly here. In Kora's original message, there's mention of literature on the subject but it does not suffice. So true, there is a wealth of literature on the subject. So much of it in fact, I wonder in what manner it does not suffice. What is, forgive me, the more specific question the literature does not answer?
Most respectfully,
Andrew
Andrew Grove
Program Manager, Taxonomy
Knowledge Network Group
Microsoft Corporation
425 706-5557
-----Original Message-----
From: sigcr-l-bounces at asis.org [mailto:sigcr-l-bounces at asis.org] On Behalf Of Leonard Will
Sent: Saturday, August 26, 2006 6:34 AM
To: sigcr-l at asis.org
Subject: Re: [Sigcr-l] Exhaustivity and specifity of indexing
In message <FB64419FDA34834382771A20964B15CED27AF9 at amon.it.lth.se> on Thu, 24 Aug 2006, Koraljka Golub <kora at it.lth.se> wrote
>
>Does anyone know of any references or have any opinion about
>exhaustivity and specificity of classification, meaning assignment of
>classes from a classification scheme.
In message <73573C2DCB0154408D790B1E7EDB0C521B9E1F at ka-exch01.db.dk> on Sat, 26 Aug 2006, Birger Hjørland <BH at db.dk> wrote
>Dear Kora,
>I believe, that you are making the wrong assumption that indexing and
>classification is different in this respect. If you take a concept from
>a controlled vocabulary (say, a thesaurus) this is in my opinion
>similar to taking a class from a a clasification system (which also
>represents a concept). So, the specificity of a term in a thesaurus
>depends on the number of terms given and the specificity of a class in
>a classification system depends on the number og classes given (the
>more terms/classes, the greater the specificity of applying a given
>term/class). It it worth considering however, that although the overall
>specificity can be measured by counting the number of
>descriptors/classes, any given system will have a greater specificity
>in some areas compared to others (DDC, for example, is much more
>specific in Christianity compared to other religions).
I agree with what Birger says, but I think that Koraljka's question was not so much about the specificity provided in the scheme itself, but the specificity with which it is applied when classifying documents, i.e., for example, is it worth while to use the full specificity possible in DDC by adding all the possible common subdivisions, "divide-like"
instructions and so on, or is it better to simplify by limiting class numbers to 3 (or 6 or whatever) digits?
The answer to this must be that it depends on the material being classified. The aim should be to classify specifically enough to make it easy for the user to scan through the items in a class. I usually think of this as meaning that a class should contain between 10 and 50 items.
If the collection is large, or concentrated in a single subject area, more specificity will be needed than if it is a small, general collection.
Other considerations are:
a. Allowing for growth of the collection. You don't want to have to go back and re-classify if more material is added in a given subject area.
b. Compatibility with what is being done elsewhere. Do you share records, obtain them from elsewhere or merge them in a combined catalogue?
c. Provision of access from concepts that are scattered by the classification. These may come later in the citation order of combining facets in a synthesised class number, and if the number is truncated they will be lost.
d. Adequacy of the alphabetical index constructed to show where topics have been classed. It is seldom adequate to rely on the index published with the schedules, but far too often that is all that is provided. It will not show many synthesised numbers, and there is little point in creating these if you do not also create the means of finding them.
Exhaustivity is more a matter of subject analysis of the documents. Do you identify and record topics that are only treated incidentally in a document, or do you restrict indexing and classification to the main topics only? There is no simple answer, so much depending on the nature of the collection, the users, and the purpose of the catalogue.
Leonard Will
--
Willpower Information (Partners: Dr Leonard D Will, Sheena E Will)
Information Management Consultants Tel: +44 (0)20 8372 0092
27 Calshot Way, Enfield, Middlesex EN2 7BQ, UK. Fax: +44 (0)870 051 7276
L.Will at Willpowerinfo.co.uk Sheena.Will at Willpowerinfo.co.uk
---------------- <URL:http://www.willpowerinfo.co.uk/> -----------------
_______________________________________________
Sigcr-l mailing list
Sigcr-l at asis.org
http://mail.asis.org/mailman/listinfo/sigcr-l
_______________________________________________
Sigcr-l mailing list
Sigcr-l at asis.org
http://mail.asis.org/mailman/listinfo/sigcr-l
_______________________________________________
Sigcr-l mailing list
Sigcr-l at asis.org
http://mail.asis.org/mailman/listinfo/sigcr-l
More information about the Sigcr-l
mailing list