[Sigia-l] Don't submit websites to search engines?

Tue May 18 14:51:51 EDT 2004

Alex Wright:

> reputable sources like Yahoo! and Brightplanet have pegged the size of the
> Deep Web at upwards of 100 billion documents.

Based on what evidence? It's easy to throw around round numbers. Case in
point:

Brightplanet (which bills itself  as "widely credited with popularizing the
'Deep Web'² and which syndicated your article for their site), for instance,
is in the business of selling tools to mine what it's hyping as the "Deep
Web."

At one point at their site, they claim:

"These generally topical databases contain from 10 to 500 times more content
than can be obtained through standard search engines."

Then it's:

"The Deep Web is made up of hundreds of thousands publicly accessible
databases and is approximately 500 times bigger than the surface Web."

The spread between 10 to 500 times is gigantic. So gigantic to be
meaningless. Next thing we "know from reputable sources" that there's a
"Deep Web" fully 500 times larger and that SEs index less than 1% of the
Web. All this is conjecture.

Nobody is questioning the fact that a significant portion of the web is
unindexed and inaccessible and that it would be better if it weren't. I'm
all for Brighplanet, Yahoo or anyone else for that matter making things
accessible, but let's not get lost in the hype of numbers thrown about with
abandon.

To fetch data, we are moving from primitive "window scraping" to published
SOA/web services APIs and normalizing and rationalizing info access. There
will be a more fundamentally sound public architecture to access dynamic
data and, hopefully, proprietary/ad hoc solutions of data aggregation will
be history. The problem is not one of technology.

Ziya
Nullius in Verba