Jepsen, ET; Seiden, P; Ingwersen, P; Bjorneborn, L "Characteristics of scientific web publications: Preliminary data gathering and analysis" JASIST 55 (14). DEC 2004

Because of the increasing presence of scientific
publications on the Web, combined with the existing difficulties in
easily verifying and retrieving these publications, research on
techniques and methods for retrieval of scientific Web publications is
called for. In this article, we report on the initial steps taken toward
the construction of a test collection of scientific Web publications
within the subject domain of plant biology. The steps reported are those
of data gathering and data analysis aiming at identifying characteristics
of scientific Web publications. The data used in this article were
generated based on specifically selected domain topics that are searched
for in three publicly accessible search engines (Google, AllTheWeb, and
AltaVista). A sample of the retrieved hits was analyzed with regard to
how various publication attributes correlated with the scientific quality
of the content and whether this information could be employed to harvest,
filter, and rank Web publications. The attributes analyzed were inlinks,
outlinks, bibliographic references, file format, language, search engine
overlap, structural position (according to site structure), and the
occurrence of various types of metadata. As could be expected, the ranked
output differs between the three search engines. Apparently, this is
caused by differences in ranking algorithms rather than the databases
themselves. In fact, because scientific Web content in this subject
domain receives few inlinks, both AltaVista and AllTheWeb retrieved a
higher degree of accessible scientific content than Google. Because of
the search engine cutoffs of accessible URLs, the feasibility of using
search engine output for Web content analysis is also discussed.

