[Sigia-l] costs of classification

Laura Norvig lauran at etr.org
Fri Nov 8 13:01:38 EST 2002


Well, I find myself in the position of being able to use PARTS of an 
existing thesaurus (ERIC), but needing to add significant chunks of 
vocabulary (in this case related to community service, volunteering).

It's difficult because existing thesauri from the library world 
change slowly, whereas the nomenclature of the people in the field we 
are trying to serve changes FAST!

At some point I would like to be able to measure the ROI (there's 
that pesky term again) of manually assigning keywords vs. leveraging 
some kind of natural language processing. The latter is always going 
to fall somewhat short unless it builds in a way to point to keywords 
that are not present in the actual document.

However, it's not really the assigning of keywords that is 
cost-intensive, it's the building and maintaining of the thesaurus.

All that aside, I do want to mention that the little product MultiTes 
has been quite handy for building the thesaurus. Too bad it's 
PC-based and I have to run it on Virtual PC.

Laura

>On Mon, 4 Nov 2002, Laura Norvig wrote:
>
>>  Christina: thanks for pointing out the difference between
>>  successfully using facets to describe a domain like wine, vs. a large
>>  domain like, um, the entire web.
>>
>>  That got me thinking about Marti's Flamenco project. The art (or
>>  architecture) described there is a larger info space than wine, but
>>  still somewhat contained, and perhaps was able to use existing
>>  keywords adapted from LC.
>>
>>  What I'm wondering is, how many hours did it take to develop the
>>  thesaurus and the facet structure and catalog the records? How do we
>>  get project managers to build in enough time to do such detailed work
>>  on web projects?
>
>The cost grows with the size of the required thesaurus, but so do
>the chances that you will be able to use an existing thesaurus.  Much of
>the need for building local thesauri is based on the more specific focus
>of smaller sites, in terms of subject coverage and/or audience.
>
>Andrew




More information about the Sigia-l mailing list