[Sigia-l] Comparing search engines

Listera listera at rcn.com
Thu Jun 19 21:30:10 EDT 2003


"Lyle_Kantrovich at cargill.com" wrote:

> Not sure that I understand exactly what you're asking.  You point out a
> lot of aspects of the "system":

I'm actually referring to the mundane issue of a DB returning predictable
results. In other words, if you had a million loose, unstructured PDFs and
text docs and did a search, you really don't know a priori what you'll get.
But if you had a DB with million records all indexed according to *some*
criteria, the results are predictable, as you are querying against known
indexes or discreet columns of data. The latter is pre-structured, the
former is not.

Generally speaking, a 'search engine' is a disguised (SQL) query editor to
the DB, if all the data is in one, as the original post indicated. By and
large SQL is standardized, so the results are predictable. But a 'search
engine' to an unstructured pile of data depends largely on the nature of the
algorithms used, and thus the results can be quite variable: Google's
results are not the same as AltaVista's.

Now, you bring up the issue of user intensions, which obviously is the holy
grail. In a general context, I don't believe it's solvable. If you have a
very large pool of input you can try to identify patterns that are
statistically meaningful, like Google. If you have a small, closed or an
intranet system, then 'best bets' may have some usefulness. I prefer a
two-stage approach, where the first inquiry is further clarified by an
additional set of user selectable filters. (I find, for example, Getty
archives' <http://creative.gettyimages.com/source/home/home.asp> secondary
Clarification filters very useful.)

The fundamental issue of 'search engines' is that no algorithm can
substitute for our brains. But an engine that can statistically poll a very
large pool of brains can get pretty close. Unfortunately, if you are not
Google or Yahoo, the size of that sampling pool is almost never large enough
to produce meaningful results.

This analogy is useful to understand search engine issues: There's almost an
infinite variety of news on the Internet and I use an RSS news
aggregator/reader as well as GoogleNews to parse through dozens of news
sources daily. But, still, my 'search engine' to (general) new is the New
York Times. 

Ziya
Nullius in Verba 





More information about the Sigia-l mailing list