[Sigia-l] Re: translating taxonomies (responses)

Peter Van Dijck peter at poorbuthappy.com
Tue Nov 9 02:51:29 EST 2004


Below are the responses I received on and offlist, for prosperity.

Thanks to all and cheers!
Peter

----------
Dear Peter,
The ISO standard is possibly the most authoritative source, but you might
find more up to date information at www.ifla.org/VII/s29/wgmt.htm concerning
the work of a multilingual thesaurus working group. If you are Dutch (as
your name suggests) and if you are based in the Netherlands (which your URL
hides) you will see that the Chairman is Gerhard Riesthuis - which might be
helpful.
You might also be interested to know that the British Standards Institution
is currently updating its two standards on monolingual and multilingual
thesaurus construction. (These two standards were originally transcribed
into the two current ISO standards). The new work will be a single standard
- but in five parts. The part covering multilingual thesauri (now named
Interoperability and discussing the general problem of mapping) is nearly
complete.
You might also be interested to look at the ILO and EUROVOC thesauri to see
how they are coping with multilingualism, but less "disciplined" taxonomies
may be more problematical.
Best wishes,
Alan

Alan Gilchrist
----------
If you are trying to make sense of the standards, then I would highly
recommend:

Jean Aitchison, Alan Gilchrist, David Bawden. Thesaurus construction and 
use: a

practical manual. 4th ed. ASLIB: London, 2000.   xiv, 218 pp.
ISBN 0-85142-446-5

There was also a conference sponsored by Multites at the end of last
year, where many speakers addressed the issues of tranlation. The
presentations, which include some case studies, are available from:
http://www.multites.com/conference03.htm

Regards
Vanessa
---
On Amazon, they have "Thesaurus Construction and Use: A Practical 
Manual" for sale for

70$. ("usually ships within 1 to 3 weeks")
http://www.amazon.com/exec/obidos/tg/detail/-/1579582737
It came recommended to make sense of the standards. According to the 
TOC, it has 6

pages on multilingual thesauri.
-----------
Hi, Peter:

I can think of 2 areas in LIS that might be fruitful for you:

1. Stuff on indexing and abstracting.  Cleveland and Cleveland, in their 
book
"Introduction to Indexing and Abstracting," have an early chapter that 
breaks
down some of the problems you encounter translating concept lists into
controlled vocabulary terms.  Similarly, Lancaster's venerable book,
"Vocabulary Control for Information Retrieval," discusses some core 
challenges
in taxonomy design and implementation.

2. Metadata crosswalk studies.  I have my students read a 1998 NISO 
piece by St.
Pierre and LaPlant which summarizes many of the issues:

http://www.niso.org/press/whitepapers/crsswalk.html

I did some work, some years ago, in mapping LC subject headings to terms 
in the
Alcohol and Other Drug Thesaurus.  A few things I noted:

1. As you say, not all categories exist in all cultures.  The INternational
Society for Knowledge Organization has numerous papers talking about 
problems
where a word in 1 language has no correlate in another.

Even within the same language, not all communities find specific 
categories of
interest.  In my case, LCSH had a heading for "Product label," which was
perfectly fine for general libraries.  But addiction research requires a
distinction between a product label (which tells you how much riboflavin 
you're
getting in your corn flakes) and a warning label (akin to that soft voice on
the TV adds that whispers, beneath the pastoral images of sunlit seas, that
some people taking this medication have  experienced loss of their limbs).

A lot of library taxonomies tend to be less specific than we'd sometimes 
like,
simply because, once you start making categories at a low level, it becomes
very difficult to keep them up.  Having a term for "dog" is one thing, 
and even
"retriever," but when you start creating terms like "Nova Scotia duck 
tolling
retriever," you become obligated to have terms for all those other kinds of
retrievers.

2. Taxonomic relationships for human reading are sometimes very 
different from
those for machine use.  The Library of Congress Subject Headings, for 
instance,
will show you all kinds of wild and wooly "narrower terms," under the
assumption that a human eye is looking down the list and selecting the 1 
or 2
that seem interesting.  "Children," for instance, has about 25 narrower 
terms,
some of which are ... well, weird.

Taxonomies that support electronic databases, on the other hand, tend to 
be more
rigorously constructed.  NLM's medical subject headings, for instance, 
(MeSH)
are designed so that broad terms can be "exploded," enabling all the
subordinate terms to be searched as well.  When you're jumping from a 
hit list
of 2 to a hit list of 35, you want more assurance that the system is 
retrieving
something remotely related to your query.

My GUESS is that mapping between 2 taxonomies for machine use would be very
difficult, but could be very rewarding, while mapping to a human-centered
taxonomy would be easier, but also of limited use.

The work of Gerhard Riesthuis (inter-language thesaurus design) and of 
Rebecca
Green (formalizing relationships for cross-database and cross-domain
information retrieval) would probably be helpful to you.  Google them both.

As you can tell, I have a pile of classification and indexing papers 
that I'm
trying to avoid marking......

Cheers,
Grant
------------
Here's a citation to one of the best articles I've read on this subject:

	Multilingual Thesaurus Construction (by Michele Hudon)
	Information Services and Use (ISSN: 0167-5265), Vol.17 No.2/3, 1997,
p.111-123.

It's not just a language translation issue. The way people categorize
information also varies between languages and cultures, so the fundamental
organization schemes may need to be different.

See also: Women, Fire, and Dangerous Things by George Lakoff

Peter Morville
-----------------
Hello Peter et al.:

First, sorry for my english (I´m spanish)

Second, about translating taxonomies, maybe I can contribute with
another point, mentioned here by Richard Wiggins few days ago: search
log analysis.

When translating, if you take into account search engine optimization,
is advisable to check for most common searched expressions related to
your own´s web content (that great Pareto´s law and Zipf changed my
life too, Richard). The best tool for doing so (if you know anything
better tell me please) is Google Adwords, which permits me to facet
the list of most common searched keywords by country and language,
something really important here in this issue.

I´ll use an example:

Let´s see http://www.comunitatvalenciana.com/ , an spanish regional
tourist website. You can see translations on the top with flags (no
need to criticize the site, I can see your hunting expressions, is
just an example  ;-)

The question is: when I want this web to be found by spanish users, I
can find lots of really specific expressions, for local or regional
targets, talking about local tourism, because they already know about
the local names: "tourism beaches calpe (calpe is a little town)",
"renting costablanca villas calpe alicante (alicante is a province)"
and things like that.

When I focus on French market, or China, (french or chinese looking
for turism in this part of Spain, Valencia, to the south from
Barcelona) I can see a trend: the further the country is from my web
(Spain), the more general the search keywords are: "beaches Spain",
"tourism spain", not so much about Calpe the village, or Alicante the
province/region/County. People don´t know much about specific
locations, so they look for general locations and more general
keywords.

So in this example, I ask to myself if I have to translate the
taxonomies literally, or put instead (or besides), a more general
keyword, sinonym, whatever. We use these synonims on metatags and
within the text too.

It could be the same for other subjects, so better take a look of
Google Adwords, and then decide. Looks reasonable?

My two cents. Cheers,

Jorge Serrano Cobos






More information about the Sigia-l mailing list