No subject


Tue Dec 6 21:10:36 EST 2011


various combinations/scenarios, without loading actual records at the DB or
transporting them to the browser. It's like 'free' dotcom money :-)

> If however you want to pre-calculate for all combinations of a set of terms
> this gets ridiculous, so you'd have to do it on demand.

There's no pre-calculation at all, other than creating the sets. Set
operations are fast enough on the fly.
 
> If you have an open connection it's quick, but still substantially
> increases the number of searches done by the server since you are only
> doing the cheap part of the search this is OK.

Exactly. All the 'searching' you're doing here is not on actual records but
on measly pointers within sets. That's about as efficient as you're likely
to get.
 
> Keeping connections open and computing intersection sizes on the fly is
> the key, and is what I had missed in writing my previous post.  I was
> thrown off by your mentioning pre-computed results.  I take it that
> referred to your inverted index?

The sets, containing pointers to records.

>> Anyhow, the trick here is that you never touch the DB until and unless you
>> need to. Sets allow you to create virtual collections of taxonomies mapped
>> against actual records with just a few bytes of info on each. Needless to
>> say there are some things to watch out for in this architecture, depending
>> on what exactly you might be doing.
> 
> Am I still missing something?  Are you making a distinction between these
> sets and your DB?

Yes.

> Wherever you're storing these sets, it probably ammounts to much the same
> B-Tree index of set elements?

What's costly in dealing with a DB is the creation of indexes and
loading/reading through records. The point is to avoid this as long as you
can. A set is a flat array in memory that simply contains pointers to actual
records. Loading, say, 3 million records into memory is prohibitive, loading
a few thousand sets, containing nothing but pointers few bytes each is
painless. Loading the actual records when the set is pared down to optimal
size is instantaneous, because unlike searching through an index, the
pointers (in the resulting set) take you directly to the records.

Now, you can keep the sets in the main DB or in an app server if you have a
persistence layer or in a totally separate DB, if you want maximum
performance.

Remember I had to design this architecture in the mid 90s when RAM, CPU,
storage and bandwidth were at a premium. I had to find a very fast way of
dealing with potentially very large number of records. Sets and set
operations allowed me to do just that.

Best,

Ziya




More information about the Sigia-l mailing list