[Sigia-l] Findability

Simon Wistow simon at thegestalt.org
Tue Jan 28 12:05:31 EST 2003


On Tue, Jan 28, 2003 at 11:23:49AM -0500, Listera said:
> > Since more and more web pages are becoming coherent 'sites' rather than
> > just collections of individual pages I think this is going to be more of
> > a problem.
> 
> Could you elaborate on this?

[ This is all just my opinion and observations, Your Mileage May Vary,
the value of your shares may go up and down. Your gnome is at risk if
you don't keep up repayments etc etc]

Well, basically way back when, webpages used to be just that - pages. 

Take my site for example

	http://thegestalt.org/simon/

full of angry, incoherent rants (warning, lots of swearing and
bitterness, probably not work safe) all contained within individual
pages. There may be cross links between them but, essentially, all the
information about any one given topic is on one page. All those
different pages could be distributed on different servers or under other
URLs and it wouldn't make a jot of difference to the coherency of the
site.

However AN Other site, mostly larger corporate sites may have, but even
something like http://thegestalt.org/flash/, are a deliberate collection
of pages on the same topic. The information is split apart for better
readability (or findability, he says, ducking :) but collected together
in the site.

[ ASIDE : sorry if this is written in baby talk, I'm kind of clarifying
this for myself in my head as I go along ]

As such, a search engine that works by looking for phrases on individual
pages isn't going to be as good as one that trys to understand the
meaning of the whole site. However this is Hard (in the computer science
meaning as well as the more usual English one). Humans do it even better
than algorithms but they're slow (puny humans!) and can't do as much as
an crawling robot.

So you're faced with a problem - more sites are being created that split
information apart in this way, or even that create content on the fly
from hidden information stores which makes it difficult for search
crawlers to extract relevant information. On the other hand the
web is getting much bigger, almost exponentially, which makes it hard
for humans to categorize it.

Interestingly Google partnered with DMOZ (http://www.dmoz.org) which
attempts to provide an open version of Yahoo!'s directory by using the
human equivalent of distributed computing - chucking loads of people at
the problem. Million monkeys etc etc.

Like I said, it was just a thought I had and an impression I've gained
over the last few years.

Simon








More information about the Sigia-l mailing list