More on search engines with scientometric purposes
Isidro F. Aguillo
isidro at CINDOC.CSIC.ES
Wed Jan 10 08:24:21 EST 2001
Following recent messages by Ronald Rousseau and with his cooperation I
would like to share you some information about Internet search engines
as a source of data for scientometric analysis.
Perhaps if enough people are interested a special session about
cybermetric methodology could be suggested to ISSI Sydney conference
organizers, in a similar fashion to Colima's one.
In the last months we tried to find out more about the inconsistencies
of the search engines
in order to provide information about "stability" and usefulness of
these tools for quantitative
We present some "tricks" valid in the time we tested (winter 2001).
** FAST (www.alltheweb.com). This large database (perhaps the largest)
is updated each two months, more or less, so you have a long "time
window" for calculating statistics without worries about the stability
of results. There is no warning about when the database is updated, but
that could be solved if you monitor the results of a control word.
The advanced option is by far the best screen for working as you have
access to (implicit) boolean queries and other interesting delimiters
(our term for field restricted searching).
The size (number of pages)and visibility (number of links received) of a
website are easily
derived from using "in the URL" and "in the link to the URL" options
respectively. Selecting "must" and "must not" you can exclude
self-citations, so true "sitations" are obtained (this boolean option
works fine according to our results). There is a very useful trick
regarding domain delimiter (there are two boxes, one to domains included
and another to those excluded), as both boxes allows to use several
(many) domains using blank space as separator (an implicit "OR"
** ALTAVISTA (www.av.com). Also very large and perhaps the most
intuitive for our work, but unfortunately the most inconsistent engine.
All the screens give different results, but as the advanced search
provides the highest score and also works with explicit boolean
operators it is the
recommended option. However, use of these operators is not recommended
for large samples as they provide wrong results. The database is altered
(mainly due to saturation-avoiding procedures) in irregular periods,
usually even several times during the same day. As a rule we suggest to
obtain the sample in the same half day. The delimiters are well
developed (domain, host, url and link) but they are not to be trusted if
you combine (boolean) them or use it for comparing several samples
obtained during different days.
** INKTOMI (www.hotbot.com and www.iwon.com). In theory both provide
access to the same database, but the results are not exactly the same.
We recommend using the advanced option of Iwon that is larger than
standard one. Our personal feeling is that stability is longer than
Altavista, but only extends to one or two days. Both support (with some
level of confidence) booleans, and they use domain and linkdomain as
more important delimiters. To obtain the numerical results in Hotbot you
must go to the "second" (next) screen.
** GOOGLE (www.google.com). The second (?) largest database, the best
search results but very few cybermetric options (almost no boolean
support): The link operator refers mainly to the page, not to the
complete website and for using site delimiter you must add a search term
(our best results are obtained with +www, the plus sign is very
** INFOSEEK (www.go.com). The smaller of the reviewed databases,
although with a recent high increase of size. Valid mainly for
comparative purposes. Site and url are used as domain delimiters and
link for citations, with implicit boolean support (in the past boolean
do not work fine with large results).
** NORTHERN LIGHT (www.northernlight.com). Large database, with
increased cybermetric options, even undocumented, as they not cited in
the help provided (!). In the Power Search option you need to exclude
"special collection" databases first, but then it is possible to use
link (not documented but working fine) and Url delimiters. You can even
use domain filters (undocumented too), as each country has a different
control number in the query string (we can provide a listof such country
numbers). Caution: minus sign (-) no longer works fine, so you should
We can provide additional information based on our experiences to those
of you interested in using search engines, but caution is recommended as
changes are very common and unexpected.
Happy New Year and see you all in Sydney,
Isidro F. AGUILLO isidro at cindoc.csic.es
CINDOC-CSIC Tel: +34-91-563.54.82
Joaquin Costa, 22 Móvil: +34-630.858997
28002 Madrid. ESPAÑA/SPAIN Fax: +34-91-564.26.44
Cybermetrics, e-Journal (http://www.cindoc.csic.es/cybermetrics)
More information about the SIGMETRICS