[Sigia-l] using thesauri to improve search

Leonard Will L.Will at willpowerinfo.co.uk
Mon Jun 10 18:08:24 EDT 2002


In message <3D63713E at webmail.drexel.edu> on Mon, 10 Jun 2002, Michael 
Fry <mwf24 at drexel.edu> wrote
>
>So far, we've been working toward a faceted thesaurus that, where 
>appropriate, breaks multi-word concepts or phrases (e.g., "drug 
>addiction" and "low income youth") into discrete facets.

Be careful to recognise the distinction between "concepts" and 
"phrases".
Drug addiction is a single concept, which you may choose to label with a 
multi-word term such as "drug addiction" or a single word term such as 
"addiction". Low income youth, on the other hand, is a combination of 
two concepts: "people with low incomes" and "young people", which are 
distinct and each of which can have its own scope note in a thesaurus. 
You may choose a single or multi-word term to label each of these 
concepts; that does not affect the distinction between the concepts 
themselves.

>(FYI, we'll probably also build a browsable version of the terms, but 
>the initial goal is to improve search.)

It would be a good idea to do so, and to make this an integral part of 
the search interface, as Avi Rappoport has suggested. If you are using a 
controlled vocabulary I think it is best to let the users see what it 
is, so that they can choose the terms that best match their needs. Your 
search interface will be two-stage:

1. Map from the terms the user thinks of to the terms of the controlled 
vocabulary. You can use free text search techniques for this, including 
string matching, stemming, truncation and so on, to display possible 
terms from the controlled vocabulary for the user to choose from, with 
the option of navigating to broader or narrower terms, selecting 
subtrees, related terms and so on. The system should help the user to 
define and isolate the concepts in the enquiry and to combine terms to 
express each of them.

2. Use the chosen controlled vocabulary terms to retrieve documents of 
interest.

>For example:
>
>populations
> <by age>
>  adolescents
>  adults
>  youth
> <by economic status>
>  low income populations
>  middle class populations
>  working class populations
> <by condition>
>  drug addiction
>  glycemia
>  malnutrition

As an aside, this does not seem to be a very good example. Perhaps 
"populations" is the most used term in the subject area of the 
thesaurus, but I would prefer "people" as the broader term of 
"adolescents" and "adults". These are _kinds of_ people, not kinds of 
population.

"Youth" is not a kind of person or population; "young people" would be 
better, unless you mean "children" (scope notes presumably clarify 
this).

"Drug addiction" and "malnutrition" are not kinds of populations or 
kinds of people. They are either "social problems" or "medical 
conditions" or both, and should go in those facets. "Glycemia" is a 
medical condition. You would either have to list these concepts under 
more appropriate headings, or change the terms to something like "drug 
addicts", "people with glycemia" and "malnourished people", if that is 
what you mean.

>Specifically, if 'drug addiction' isn't submitted as a phrase (i.e. 
>wrapped in quotes), how does the search software, Inktomi, know that 
>users are looking for the 'drug addiction' term in our vocabulary?

I don't know how Inktomi works, but if it just presents the usual little 
dumb box saying "type your search here", then you will have to go 
through the two-stage process I have noted above to guide the user to 
use the controlled vocabulary properly.

>Is there something about search engine software that I'm 
>underestimating?

No, I think that most search engine interfaces are designed for simple 
text searches and don't allow for the greater power and functionality 
that controlled vocabularies provide.

>Do we have to design a more complex search UI in order to facilitate 
>the translation?

Yes

>Should we be building a vocabulary that's pre-coordinated rather than 
>post-coordinated?

This may be helpful for browsing, but not for specific searches. You can 
have both by building a combined thesaurus and a faceted classification 
using the same terms. I would like to see more use made of the kind of 
interfaces that Avi Rappoport mentions to allow intelligent searching 
using a faceted thesaurus, with feedback at each stage so that the user 
can refine the search according to the results.

Leonard Will
-- 
Willpower Information       (Partners: Dr Leonard D Will, Sheena E Will)
Information Management Consultants              Tel: +44 (0)20 8372 0092
27 Calshot Way, Enfield, Middlesex EN2 7BQ, UK. Fax: +44 (0)870 051 7276
L.Will at Willpowerinfo.co.uk               Sheena.Will at Willpowerinfo.co.uk
---------------- <URL:http://www.willpowerinfo.co.uk/> -----------------



More information about the Sigia-l mailing list