[Sigia-l] search results and thesauri
Andrew McNaughton
andrew at scoop.co.nz
Sat May 25 06:29:58 EDT 2002
On Sat, 25 May 2002, Ziya Oz wrote:
> >> Sets are pre-calculated and thus set operations are instantaneous, requiring
> >> zero search of actual records. Once a small number is reached, a button
> >> would perform the actual search across records.
> >
> > This works for a very small number of terms, but the server load and page
> > size required to deliver this pre-calculated info to a client side tool
> > increases dramatically with the number of terms involved.
>
> Actually, that's not really a bottleneck at all (unless, of course, you were
> Google :-). I designed the app above to work with up to 65,000
> terms/categories. But it could be more. Actual word-set being used at the
> time was somewhere around 7,000.
I meant the number of terms suggested as related to the user's last
search. I took it that you were pre-calculating the size of all set
intersections amongst these. combinatorial functions grow quickly which
was my concern.
> Before a 'document' was included into the DB, it was parsed. All distinct
> words minus common terms like "the" and others from an in-house list were
> discarded. What was left was matched against the existing category list; if
> there were new words in this doc, they were added to the list. Each word
> represented a set. Each set contained a pointer to the record/doc that
> contained that word.
Ie a typical inverted index with stopwords as in most small to medium
search engines.
> So when the user entered a word, all documents that contained that word
> could be returned instantaneously and *without* doing a DB search. Set
> operations like union and intersect are virtually instantaneous and, again,
> they require no trip to the DB.
Whether you think of it as a DB or not, you do have to look up an index.
The lookup you do is essentially the same as for any other search except
that you only retrieve (the size of) a list of document id's, not titles,
summaries and so forth. To get the size of the intersection, your search
engine would still be considering all members of the set internally, even
if it didn't return this info.
For smaller collections, it's retrieving the details that bites, so the
cost of getting the intersection set sizes is minimal, depending how many
times you do it. If however you want to pre-calculate for all
combinations of a set of terms this gets ridiculous, so you'd have to do
it on demand.
> Since each set 'knows' what and how many documents are included in it, you
> can return that info immediately to the user. If they perform Boolean ops on
> sets, again, you can return the resulting set operation result immediately
> to the browser. (Of course, in a client/server situation this is
> transparent, extremely fluid and pretty much in real time.)
If you have an open connection it's quick, but still substantially
increases the number of searches done by the server since you are only
doing the cheap part of the search this is OK.
> Now when the user is done with all the selections/combinations and ends up
> with a small number of indicated docs, then he can submit it for the actual
> search of the database records. Until then, it's mostly passing a small
> amount of numbers back and forth, between the browser and the app/server.
Keeping connections open and computing intersection sizes on the fly is
the key, and is what I had missed in writing my previous post. I was
thrown off by your mentioning pre-computed results. I take it that
referred to your inverted index?
> Anyhow, the trick here is that you never touch the DB until and unless you
> need to. Sets allow you to create virtual collections of taxonomies mapped
> against actual records with just a few bytes of info on each. Needless to
> say there are some things to watch out for in this architecture, depending
> on what exactly you might be doing.
Am I still missing something? Are you making a distinction between these
sets and your DB? Wherever you're storing these sets, it probably
ammounts to much the same B-Tree index of set elements?
Andrew
More information about the Sigia-l
mailing list