[Sigia-l] Information Visualization

Tue Nov 11 16:55:55 EST 2003

> > the database has 1.26 million citation records...
> 
> You gotta be kiddin'. This wouldn't even register on my
> 'large-scale' radar screen. I was referring to datasets with
> BILLIONS of records, in the multiple terabyte range.

Several comments:

(a) Yes, 1.26 million is small compared to billions. But having
    millions of records is much more common than billions.

    It's like search engines. Google deals with billions of pages and
    millions of queries/day. Most of us don't have the search
    problems that Google has.

    If you can scale to millions, or tens of millions of records,
    you're going to solve a lot of problems for a lot of people.

(b) You don't need to display everything. Just because you have a
    billion records doesn't mean you need to display them.
    The trick is in selecting the subset of records to display.

    That's more a data-mining issue than a core infoviz problem.

(c) Real progress is being made. The authorlink case I cited has
    in five years gone from needing a Cray to handle 500 documents,
    all the way to processing a million+ records in real time on
    average hardware (not sure the exact specs).

Don't dismiss the authorlink achievement just because it's merely
1.26 million records. This is one of the largest author indexes in
the world. The problem it solves is non-trivial.

The ISI citation index lists every author who has ever published in
the scientific literature (so far as ISI knows). It is the largest
such index in the world. The only author index that might be larger
is the author authority file at the Library of Congress.

Criticizing authorlink for scaling to a mere 1.26 million records
obscures the real message: for the problem it is trying to solve
there is *no need* for it to scale any higher.

I agree that small datasets are not interesting.

And I also agree that billions of records is interesting and cool
and we're a long way from cracking that nut.

But I *disagree* with the suggestion that (a) infoviz is a failure
unless it can deal with billions, and (b) that solving a problem
with a million records isn't really useful and represents trivial
problems.

--karl