ART:Lawrence (NYT), Web Size

Fri Jul 9 14:54:22 EDT 1999

In the interests of getting the figures right, here is a summary I wrote
for Information Today.  I have a copy of the Nature article because I was
asked to comment on it by a reporter.  Note that I couldn't resist
editorializing, though.:

Newsbreak:  New Study of WWW search engine coverage.

By  Susan Feldman

A new study of coverage of the indexable web by web search engines states
that, as of Feb., 1999,  only 16% of the web is indexed by the combined
search engines.  The study, by Steve Lawrence and C. Lee Giles of NEC
Research Institute appeared in the July 8, 1999 issue of Nature (p107-109).
Lawrence and Giles made headlines a year ago with their study of overlap
among search engines which showed that each web search engine indexed a
fairly discrete corner of the WWW, with little overlap among them.  In that
study, the Lawrence and Giles reported that the combined coverage by all
web search engines was about 60% of the web.  Their conclusion from
comparing the two studies is that the web search engines are not keeping
pace with the growth of the web.

Some figures:
Using randomly generated web addresses, they estimate a total of 16 million
web servers in existence.  Of these, they estimate that roughly 2.8 million
are publicly accessible and present indexable information for web search
engines to collect.  There are, they say, 800 million publicly indexable
web pages, accounting for 6 terabytes of text (not image) data.  Much of
the web is not indexable, since it resides behind query boxes, in
non-indexable databases, or specifies that web crawlers and spiders may not
index the server's contents (robots exclusion policy).   (A study we did at
Datasearch in 1997 indicated that approximately 50% of the web was not
indexable)

The authors first used random web URL's to estimate total web servers in
existence.  They predict that  there are approximately 16 million at
present. The profile of the web indexed by all search engines together was:

 83% commercial sites

  6% scientific or educational sites

  1.5% pornographic

  1.2% government

  2.8% health

  2.3% personal

  1.4 community

  0.8% religion

  1.9% societies.

Metadata:  Only about a third of web servers contain metadata on their home
pages.  And only 0.3% used the Dublin Core.  The lack of standardized tags
was quite evident:  they found 123 distinct tags.

One disturbing, but not surprising finding is that "popular sites"-sites
which have many links to them-are much more likely to be indexed than sites
which have few links to them.  Since web spiders follow links in order to
discover new sites, it is harder for a site with no links to it to be found
in a web crawl.  The study also found that search engines are behind,
taking months to index a new page.  The average median age of "new" pages
was 57 days.  Despite its smaller size, they found that Infoseek has a
higher probability of indexing random new sites.   This bears out another
recent study at the Wharton School which called Infoseek an "overachiever".

Ranking the search engines
Lawrence and Giles used 1,050 real queries from NEC researchers in order to
test web engine coverage.  Of the 16% of the web covered by the web
engines, here's a breakdown of coverage by each web search engine:

Northern Light  38.3%
Snap    37.1%
Alta Vista      37.1%
HotBot  27.1%
Microsoft       20.3%
Infoseek        19.2%
Google  18.6%
Yahoo   17.6%
Excite  13.5%
Lycos   5.9%
Euroseek5.2%

While these results are startling, they may not give a complete picture of
web contents or research.  Giles and Lawrence used basic Boolean queries
which required  exact  matches.  In other words, they asked for a Boolean
AND.  They turned off truncation, and "transformed queries to the advanced
syntax for Alta Vista".  As any researcher knows, insisting on exact
matches greatly diminishes the set of retrieved documents.  It increases
precision, but decreases recall.  Turning off truncation, as well as
concept searching further diminishes the recall.  Thus, we might expect
that coverage of these topics might be considerably larger than this study
would indicate.  In addition, the study appeared to classify as
"science/education" only sites which were university, college, or research
laboratory sites.  This eliminates large valuable archives from publishers,
scholarly societies, or commercial entities such as the Special Collection
at Northern Light.

"One of the great promises of the web is that of equalizing the
accessibility of information", conclude the authors.  But the search
engines "typically index a biased sample of the web", they state.  They
point to the overemphasis on popular pages, or pages with many links, and
suggest that valuable new research does not get found by the researcher who
needs it because of this propensity.  Tools such as Direct Hit, or Google
use popularity of number of links measures to improve the precision and
quality of their searches.

Giles and Lawrence call for more equal and better coverage for research and
educational information.  The question of what to include in order to serve
the public is an old one.  Librarians deal with it constantly, under the
rubric of "selection".  Today's search engines appear to be working toward
providing less information of higher quality-providing some good
answers-instead of complete coverage, no matter the quality.  This is a
direct response to screams from the public of information overload.
Perhaps there is a place for broad coverage in narrow fields for those who
want "all" the answers instead of just some good ones.
        __________________________________

        Susan Feldman, Pres.     Datasearch
        sef2 at cornell.edu                 170 Lexington Dr.
        607-257-0937 (phone/fax)         Ithaca NY 14850
                        www.datasearch1.com
        __________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 6806 bytes
Desc: not available
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/19990709/e3758f8d/attachment.bin>