[Sigia-l] Distributed thesaurus?

Sun Sep 22 08:58:12 EDT 2002

* Lars Marius Garshol
|
| This is what topic maps do. You've just described a straightforward
| topic map application. (I call it that because you are restricting
| the association types and probably also the topic types.)

* Eric Scheid
| 
| Tell us more about this, specifically the linking of topic maps
| together.

I'm happy to. Apologies for the length of this reply. I hope it's
worth reading even so.

Topic maps have a well-defined process for what is called merging,
that is, taking two topic maps and automatically producing a single
coherent topic map from them. Using this you can import pieces of
topic maps into other topic maps, merge full topic maps, and so on.

The result is that if you maintain a thesaurus about one subject and I
maintain another about a slightly overlapping, but different, subject
we can have our thesauri match up provided we either take care to use
the right declarations for merging or are willing to invest human
effort in merging them.

The merging rules are actually quite simple: every topic in a topic
map represents a single subject. The topic is just a construct used in
topic maps to represent subjects, which are the things we really want
to talk about. A subject can be anything, you, me, this email, this
mailing list, the concept of "love", etc.

The key issue in merging is knowing when two different topics (whether
in the same topic map or in different topic maps) represent the same
subject. If they do, you want to merge the topics so that they become
a single topic.

There are two ways to declare what the subject of your topic is:

  - if the subject is, say, a web page, you just point to it with a
    URI and say "that's my subject". The URI is then known as the
    subject address.

  - if the subject is a person or something else which is not digital
    (say "love") you obviously can't point at it. What you can do is
    to use a subject indicator, which is an information resource that
    describes the subject to a human. The URI of that resource then
    becomes a subject identifier since it identifies your subject.

The idea is that whenever two topics have the same subject identifier
or address they represent the same subject and must be merged. If they
don't they may *still* represent the same subject, but now you need
either application-specific information or human intervention to
determine it.

Of course, the problem here is that unless the parties merging topic
maps have had prior communication so that they use the same subject
identifiers their topic maps will not merge. There is an activity
known as the "published subjects" activity which is working on this
problem, essentially by leveraging existing code sets (ISO 3166, 639,
15924, Ethnologue, UNSPSC, UN Locode, etc etc etc) to create large
published subject sets with subject identifiers for all kinds of
subjects.

The idea is also that this allows anyone who wants to to sit down and
publish subject indicators for any domain they are interested in. This
requires very little effort, but can yield enormous interoperability
benefits since anyone can then use those indicators to identify their
subjects, without having to communicate directly.

In fact, I personally do merging of topic maps almost every single
day. I do a lot of work on automatically generating topic maps, and
when doing so merging of previously generated topic maps, data from
different sources etc is enormously useful.

I could go on about this, but I'll stop here or nobody will ever read
all of this. :)

| Most of the obvious links I find for topic maps deals with the
| internal self-contained usage scenarios, not interoperability.

Admittedly, the topic maps literature is thin on documentation of that
issue. I'm hoping my upcoming XML.com article will alleviate that at
least slightly. (Which means this discussion is really useful for
me. :-)

| Also, are there defined protocols for exchanging chunks of these
| topicmap/thesaurus structures, or is it assumed that if I want to
| examine your topic map I'll download the whole thing?

Nobody expects you to do that, but on the other hand there is no
standard "Remote Topic Protocol" at the moment. Various experiments
have been done, but so far there is nothing really ready and tested.

I know Robert Barta of Bond University has experimented with that, and
I have the design and implementation of such a protocol in my head,
but there are quite a few things in the writing-down queue ahead of
it. I have the XTM Fragment Interchange part written down.

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC        <URL: http://www.garshol.priv.no >