[Eurchap] What is the Invisible Web? A Crawler Perspective

Thu Jun 17 11:22:34 EDT 2004

Accepted abstract from the forthcoming ASIST-AoIR workshop
http://www.asis.org/Chapters/europe/announcements/AoIR.htm

What is the Invisible Web? A Crawler Perspective

Natalia Arroyo
Laboratorio de Internet. CINDOC-CSIC
Joaquin Costa, 22. 28002 Madrid. SPAIN

The invisible Web, also known as the deep Web or dark matter, is an
important problem for Webometrics due to difficulties of conceptualization
and measurement. The invisible Web has been defined to be the part of the
Web that cannot be indexed by search engines, including databases and
dynamically generated pages. Some authors have recognized that this is a
quite subjective concept that depends on the point of view of the observer:
what is visible for one observer may be invisible for others. In the
generally accepted definition of the invisible Web, only the point of view
search engines has been taken into account. Search engines are considered to
be the eyes of the Web, both for measuring and searching. 

In addition to commercial search engines, other tools have also been used
for quantitative studies of the Web, such as commercial and academic
crawlers. Commercial crawlers are programs developed by software companies
for other purposes than Webometrics, such as Web sites management, but might
also be used for crawling Web sites and reporting on their characteristics
(size, hypertext structure, embedded resources, etc). Academic crawlers are
programs developed by academic institutions for measuring Web sites for
Webometric purposes. 

In this paper, Sherman and Price's "truly invisible Web" is studied from the
point of view of crawlers. The truly invisible Web consists of pages that
cannot be indexed for technical reasons. Crawler parameters are
significantly different to search engines, due to different design purposes
resulting in different technical specifications. In addition, huge
differences among crawlers on their coverage of the Web have been
demonstrated in previous investigations. Both aspects are clarified though
an experiment in which different Web sites, including diverse file formats
and built with different types of Web programming, are analyzed, on a set
date, with seven commercial crawlers (Astra SiteManager, COAST WebMaster,
Microsoft Site Analyst, Microsoft Content Analyzer, WebKing, Web Trends and
Xenu), and an academic crawler (SocSciBot). Each Web site had been
previously copied to a hard disk, using a file-retrieving tool, in order to
compare them with the data obtained by crawlers. The results are reported
and analyzed in detail to produce a definition and classification of the
invisible Web for commercial and academic crawlers.