[Sigia-l] The Deep Web

Christina Wodtke cwodtke at eleganthack.com
Wed Mar 10 19:04:04 EST 2004


great article on some of the IR problems the Web presents, and some of the
ways people are trying to solve it.


In search of the deep Web
The next generation of Web search engines will do more than give you a
longer list of search results. They will disrupt the information economy.
- - - - - - - - - - - -
By Alex Wright



March 9, 2004  |  When Yahoo announced its Content Acquisition Program on
March 2, press coverage zeroed in on its controversial paid inclusion
program, whereby customers can pony up in exchange for enhanced search
coverage and a vaunted "trusted feed" status. But lost amid the inevitable
search-wars storyline was another, more intriguing development: the
unlocking of the deep Web.

Those of us who place our faith in the Googlebot may be surprised to learn
that the big search engines crawl less than 1 percent of the known Web.
Beneath the surface layer of company sites, blogs and porn lies another,
hidden Web. The "deep Web" is the great lode of databases, flight schedules,
library catalogs, classified ads, patent filings, genetic research data and
another 90-odd terabytes of data that never find their way onto a typical
search results page.

Today, the deep Web remains invisible except when we engage in a focused
transaction: searching a catalog, booking a flight, looking for a job.
That's about to change. In addition to Yahoo, outfits like Google and IBM,
along with a raft of startups, are developing new approaches for trawling
the deep Web. And while their solutions differ, they are all pursuing the
same goal: to expand the reach of search engines into our cultural, economic
and civic lives.

As new search spiders penetrate the thickets of corporate databases,
government documents and scholarly research databanks, they will not only
help users retrieve better search results but also siphon transactions away
from the organizations that traditionally mediate access to that data. As
organizations commingle more of their data with the deep Web search engines,
they are entering into a complex bargain, one they may not fully understand.

Case in point: In 1999, the CIA issued a revised edition of "The Chemical
and Biological Warfare Threat," a report by Steven Hatfill (the bio-weapons
specialist who became briefly embroiled in the 2001 anthrax scare). It's a
public document, but you won't find it on Google. To find a copy, you need
to know your way around to the U.S. Government Printing Office catalog
database.

The world's largest publisher, the U.S. federal government generates
millions of documents every year: laws, economic forecasts, crop reports,
press releases and milk pricing regulations. The government does maintain an
ostensible government-wide search portal at FirstGov -- but it performs no
better than Google at locating the Hatfill report. Other government branches
maintain thousands of other publicly accessible search engines, from the
Library of Congress catalog to the U.S. Federal Fish Finder.

"The U.S. Government Printing Office has the mandate of making the documents
of the democracy available to everyone for free," says Tim Bray, CTO of
Antarctica Systems. "But the poor guys have no control over the upstream
data flow that lands in their laps." The result: a sprawling pastiche of
databases, unevenly tagged, independently owned and operated, with none of
it searchable in a single authoritative place.

If deep Web search engines can penetrate the sprawling mass of government
output, they will give the electorate a powerful lens into the public
record. And in a world where we can Google our Match.com dates, why
shouldn't we expect that kind of visibility into our government?

When former Treasury Secretary Paul O'Neill gave reporter Ron Suskind 19,000
unclassified government files as background for the recently published
"Price of Loyalty," Suskind decided to conduct "an experiment in
transparency," scanning in some of the documents and posting them to his Web
site. If it weren't for the work of Suskind (or at least his intern), Yahoo
Search would never find Alan Greenspan's scathing 2002 comments about
corporate-governance reform.

The CIA and Dick Cheney notwithstanding, there is no secret government
conspiracy to hide public documents from view; it's largely a matter of
bureaucratic inertia. Federal information technology organizations may not
solve that problem anytime soon. The deep Web search engines may just solve
it for them.

For almost as long as there has been a Web, there have been Web search
engines. So one might reasonably ask why the deep Web has remained out of
view for so long.

Traditionally, Web search engines have grown their databases through simple
brute force. All the major search engines survey the Web by dispatching
legions of simple programs known as spiders, crawlers, robots or harvesters
to trace their way through the endless chains of hyperlinks that tie Web
pages together.

That method works well for the static HTML pages and predictable URLs that
make up the upper strata of the Web. But the deep Web resides mostly in
databases, shielded by a lattice of registration gateways, session cookies
and dynamically generated links. Unless an organization consciously chooses
to share its data, by opening up an API or Web services feed -- the way
Amazon books show up in a Google search -- then the data will likely remain
unseen to most users.

New search engines now under development are exploring methods for
penetrating the database barriers. BrightPlanet has developed a formula for
brokering queries across multiple deep Web data sources at once, aggregating
the results and letting users compare changes to those results over time -- 
a process known as "differencing."

That capability has attracted considerable interest from certain government
agencies that shall remain nameless. "Some of our clients are spooky," says
BrightPlanet COO Duncan Wittes. Other BrightPlanet customers include state
governments, competitive intelligence researchers, and political campaigns
whose "oppo" teams may want not only to search for what a candidate has said
but also for what he or she may have "unsaid" over time.

Soon-to-launch Dipsie is pursuing an alternative approach to unlocking the
dynamic Web, by deploying a kind of souped-up spider that penetrates
barriers like forms, drop-down lists, dynamically generated URLs and session
cookies. Dipsie's spider works by emulating a "well-formed user" that, from
the Web site's point of view, behaves just like a real flesh-and-mouse user,
enabling the spider to cache the kind of data typically visible only to a
human user.

Other search developers, including IBM, Google and Intelliseek, are
exploring their own approaches to mining the deep Web. But in the wake of
this week's announcement, Yahoo is now the elephant in the living room.

Yahoo won't discuss the specifics of how its search algorithms work. But the
company does acknowledge that its Content Aggregation Program will give
paying customers a more direct pipeline into its search database. Yahoo
Search vice president Tim Cadogan says, "Ultimately we want to search the
whole Web for free," but he nonetheless sees the CAP program as a way of
enabling "direct, structured relationships with content providers" to
"deliver a higher-quality search experience for users."

It takes a fine ear for P.R. nuance to distinguish "higher-quality search
experience" from "better results." Yahoo has issued copious disclaimers
assuring non-paying customers that they will receive the same algorithmic
treatment as paying ones. But the company acknowledges that paying customers
will likely benefit from a "quality review" designed to help companies
improve their chances of showing up in search results.

"Cadogan claims that people who send money can't count on getting better
results," Bray says. "Do you believe that? I don't."

Every year, the University of California at Davis pays the publisher John
Wiley about $14,000 for a subscription to the Journal of Comparative
Neurology, which publishes breaking research in its field. That may sound
like a steep price tag for what is essentially a magazine subscription, but
it's a tiny dollop of the $20 million the U.C. libraries spend every year on
scholarly journals.

Scientific, technology and medical publishing constitutes an $11 billion
industry. And like the rest of the publishing business, scholarly publishers
have undergone massive consolidation in the past two decades. Once the
province of small university presses and boutique academic imprints,
scholarly journals now emanate from giant publishing conglomerates such as
Elsevier, Thompson and Blackwells.

"The well-established subscription model that evolved around print journals
is a cash cow," says Peter Lyman, professor at the UC-Berkeley School of
Information Management and Systems. "One that the publishers are terrified
of damaging accidentally, through online publishing."

But unlike trade-book publishers, who count on Amazon and Barnes & Noble to
move physical units of the latest Harry Potter tome, scholarly publishers
rely increasingly on electronic journal subscriptions and paid search
services to fuel their revenues. Their customers -- mostly academic
institutions and research organizations -- insist on providing Web access to
journal content. To meet that demand while protecting their valuable data
stores, the large publishers have responded by rolling out private
permission-based search gateways to the contents of their journals, usually
under highly restrictive license terms and tightly managed IP access.

But those pricey journal databases now compete for attention -- and search
queries -- from students and faculty with ready access to Google, Yahoo and
the rest. And while the public search engines may not find every article in
the journal literature, a growing portion of published research also finds
its way out onto the Web.

For example, when gene researchers identify a new DNA sequence, they usually
submit the sequence to the National Institutes of Health's GenBank -- a
public deep Web resource -- before submitting it to journals for
publication.

Legislation pending in Congress would ensure that all research funded by
federal taxpayers be made available free of charge to the public, over the
Internet. Meanwhile, new cooperative academic initiatives like the Public
Library of Science and the National Science Digital Library are trying to
expand access to scholarly research, opening up more indirect competition
for the proprietary publishing systems.

And as more scholarship finds its way onto the Web, page-ranking algorithms
are also providing an alternative quality rating system to the traditional
scholarly peer review that journals have always employed.

While page ranking won't replace the scholarly review process anytime soon,
the expansion of public Web search engines will put downward pressure on the
premium that publishers can command. "I don't think [page ranking] is more
reliable," says Lyman, "but I do think it's perceived as legitimate. The
cost of creating formally quality-controlled information may drive people to
consider lower-cost alternatives."

Lyman adds, "When the public begins to use and accept non-qualified
information -- relying on Google or other things to perform that function,
like Technorati -- there are beginning to be quality mechanisms out there
that are user-centric or generated by users,"

How will scholarly publishers react to the encroaching competition from deep
Web search engines? "The publishing industry is not famous for being
progressive, forward thinking or fast moving," Bray says. "But if they
ignore [deep Web search], they could find themselves in a situation like the
record companies, where someone finds a way to subvert them."

- - - - - - - - - - - -


The deep Web contains some 500 times more data than the surface Web; but to
regard the deep Web as simply a bigger and better version of the current Web
is to overlook the essential feature of databases, which is structure. Most
of the deep Web is structured or semi-structured data, as opposed to the sea
of flotsam HTML that bobs across the surface Web.

"Once you get into the deep Web, all of these data sources often have much
more metadata available," says Bray. "This could be a huge opportunity for
companies looking at new ways of presenting search results."

Deriving search results from structured data sets will open up new
possibilities for search engines. In all likelihood, search engines will
gradually abandon the flat listings-style result pattern you see on a
typical 12-page Google result. (And who ever gets to the 12th page, anyway?)
Not only could deep Web search engines present more useful and manipulable
views into structured data but, given some basic lingua franca of structural
vocabularies, they could also aggregate those results in endlessly
permutable combinations.

"It's ridiculous to think that the one-dimensional result list is going to
be the universal paradigm for all imaginable searches forever," Bray says.
"If you type 'bicycle' into Google, you get a list of results having to do
with bicycles. But that result is, in a very important way, a lie. It
ignores the fact that some of these things are about bicycle racing, some
are about bicycle manufacturing. It ignores things that Google might not
even know about."

As deep Web search engines unearth the structures of large data sets and
make those structures visible across organizations, they will create a
powerful incentive for organizations to invest in more consistent,
predictable structures (a trend already manifest in the growth of Web
services and in Yahoo's search quality guidelines). In exchange for the
benefits of increased exposure, these organizations will yield another level
of autonomy.

While government and academic institutions may generate the greatest volume
of deep Web content, corporations undoubtedly generate the most monetary
value in Web data: customer databases, product catalogs, technical knowledge
bases and myriad other data sources with quantifiable business value.

Over the last decade, companies have invested heavily in Web infrastructure,
including countless local search engines. While many companies already
outsource their public Web site search functions to companies like Google,
many also have developed specialized search engines for their own deep Web
data, like technical support databases.

Those investments make plenty of sense when that data won't readily show up
in a public Web search. But as deep Web searchers penetrate these gateways,
will companies continue to see the value of investing in their own public
interfaces?

In the near term, deep Web search engines will likely dampen company
expenditures on local search initiatives. But in the longer term, the
changes may prove more far reaching. "The quality and ubiquity of Web search
engines hides the fact that most organizations have really crappy search
mechanisms," Bray says. "I think that's creating a tension within
organizations."

As public search engines continue to supplant the role of organizations' own
information-retrieval systems -- be they search databases, call centers or
sales engineers -- once internal-facing systems will assume increasingly
outward-facing roles. "When the ability to develop different messages for
different audiences is curtailed by universal availability," says Gartner
analyst Whit Andrews, "the nature of the message, its format and associated
issues become paramount.

No one expects IT departments to go out of business, but the external
pressures of deep Web search will almost certainly force long-term changes
in the role, structure and autonomy of local IT organizations as they
gradually lose direct control over customer transactions.

- - - - - - - - - - - -


Every search query is a unit of desire. Search companies, like all
businesses, exist by transforming desire into hard currency. As deep Web
search engines insinuate themselves into deeper and deeper levels of
organizations, they will not only offload search traffic, they will trigger
a series of massive disruptions in the information economy.

If you buy the Cluetrain maxim that "hyperlinks subvert hierarchy," then
surely deep Web search engines will amplify that subversion. As search
engines extend their reach deeper into and across organizations, the
boundaries between those organizations will feel more fluid -- both to
consumers and to the organizations themselves. The first thing most of us
notice may be better search results.

Somewhere inside that complex apparatus of desire and fulfillment, a
transformation is taking place, one whose effects we can barely foresee.


- - - - - - - - - - - -

      About the writer
      Alex Wright is a writer and user experience architect in San
Francisco, Calif.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 179 bytes
Desc: not available
Url : http://mail.asis.org/mailman/private/sigia-l/attachments/20040310/eeea9444/attachment.gif 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 180 bytes
Desc: not available
Url : http://mail.asis.org/mailman/private/sigia-l/attachments/20040310/eeea9444/attachment-0001.gif 


More information about the Sigia-l mailing list