[Sigia-l] RE: using thesauri to improve search

Tue Jun 11 08:30:23 EDT 2002

> -----Original Message-----
> From: sigia-l-request at asis.org [mailto:sigia-l-request at asis.org]
> Sent: Monday, June 10, 2002 3:36 PM
> To: sigia-l at asis.org
> Subject: Sigia-l digest, Vol 1 #119 - 30 msgs
> 
> 
> Send Sigia-l mailing list submissions to
> 	sigia-l at asis.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://mail.asis.org/mailman/listinfo/sigia-l
> or, via email, send a message with subject or body 'help' to
> 	sigia-l-request at asis.org
> 

mwf24 at drexel.edu wrote:
> The plan is for these terms to be assigned to relevant 
> documents, so that a 
> document might be simultaneously indexed with "drug 
> addiction," "adolescents" 
> and "low income populations," and thus be retrieved whenever 
> somebody's query 
> includes the appropriate, matching concepts.
> 
> The problem (I think) is that users aren't likely to search with the 
> vocabulary we build, and aren't likely to explicitly specify 
> phrases in their 
> queries. So what we've got is a post-coordinate vocabulary 
> trying to match up 
> with a mix of pre- and post-coordinate query types.
> 
> If that's the case--and ignoring issues of synonymy for the 
> moment--how do we 
> map multi-word, multi-concept queries such as [drug addiction 
> teens] to the 
> appropriate, individual indexing terms, i.e. "drug addiction" and 
> "adolescents"? Specifically, if 'drug addiction' isn't 
> submitted as a phrase 
> (i.e. wrapped in quotes), how does the search software, 
> Inktomi, know that 
> users are looking for the 'drug addiction' term in our vocabulary?

You have several discrete challenges - all associated with the major goal of
"understanding what the user wants." In this example, you need to determine
that a search for "drug addicted teens" should find items that are
cross-sected by the facets "drug addiction" and "adolescents." Having some
experience with AltaVista, I can explain how they do it - and some of their
limitations.

1. Understand the searcher's original search terms. In this case knowing
that a search for "drug addicted teens" is really looking for "drug
addicted" and "teens." This is done thru phrase detection. AltaVista
provides a phrase dictionary - you provide all the phrases you want to
detect, and - if the user types in those terms in that order - it will find
it and treat it as if the user explicitly "phrased" it with qutes, etc.

2. Understand that "drug addicted" is (almost) the same as "drug addiction."
This is done thru a stemming dictionary - cross-referencing different forms
of the same root word. In this case, "drug addiction" and "drug addicted"
(or even "addiction" and "addicted") can be stemming translations of each
other. This would expand the search to:
("drug addicted" OR "drug addiction") AND "teens"

3. Understand that "teens" is a synonym for "adolescents." This is where the
thesaurus comes in. You can create a thesaurus where "teens" and
"adolescents" are synonyms, thus expanding your search further to:
("drug addicted" OR "drug addiction") AND ("teens" OR "adolescents").
And actually, stemming and synonyms would further expand your search terms
(the singular "teen" for example).

Again, back to my experience with AV - it supports each one of those
linguistic tools, and I'd imagine many other search engines can do the same.
The limitations (so far) are:
* It does not cascade the results of one linguistic expansion into the next
- so if you have phrase detection first (and find "drug addicted") that
phrase doesn't get sent to the stemming or synonym tool - just the
individual terms.
* It doesn't permit phrases in the stemming and synonym dictionaries - so
you can have a synonym match of "drug=narcotic" but not "drug=unperscribed
narcotic."

Mario Sanchez