[Sigia-l] metadata redux
Karl Fast
karl.fast at pobox.com
Mon Jul 21 23:33:39 EDT 2003
> Yes, as long as you can trust it.
Clifford Lynch wrote an excellent and thoroughly readable paper
about trust in information retrieval systems on the Web.
Highly recommended:
When Documents Deceive: Trust and Provenance as New Factors
for Information Retrieval in a Tangled Web
JASIST, 52(1), 12-17
http://www.cs.ucsd.edu/~rik/others/lynch-trust-jasis00.pdf
To quote a key piece:
Traditional information retrieval systems make several fundamental
environmental assumptions that are so basic it sounds strange and a
little crazy to question them. In particular:
(1) The documents that an IR system .sees. (e.g., in the indexing,
retrieval, or ranking process) are the same ones that a user
would retrieve if he or she chose to select those documents. How
could it be otherwise? These documents are part of a database
that is an integral component of the information retrieval
system, and the system is internally consistent; every read
operation on a given document should produce the same result.
(2) Metadata (surrogate records) for documents can be taken at face
value as honest attempts to accurately describe documents, and
should be treated this way in retrieval systems. A retrieval
system either works with documents or with surrogates; if it
works with surrogates, the relationship between surrogate and
document is outside the scope of the IR system proper. For all
practical purposes, the surrogates are the documents in this
scenario.
These two assumptions are just different aspects of the same general
view of the world. In one case, the creation/ extraction/computation
of metadata is done within the IR system as part of indexing or
retrieval (indexing is just precomputation for retrieval in some
sense); in the other case, the development of metadata (or at least
the first step) takes place .outside. the IR system, and it is
assumed that it is done in a disinterested and accurate fashion
(bibliographic citations, abstracts, etc), whether by computer
algorithms or human beings. It is considered legitimate to discuss
how much access or retrieval quality is lost by replacing documents
with these externally produced surrogates (e.g., debates about full
text versus surrogate retrieval), but the assumption is always that
the creators of surrogates do the best job they can, subject perhaps
to some fundamental constraints about economics, time, protection of
intellectual property, computational resources, size of surrogate,
etc.
These core design assumptions are *completely at odds* with the
realities of the distributed information environment found on the
World Wide Web today.
--karl
More information about the Sigia-l
mailing list