Web robot accuracy analysis: suggestions invited

Sat Dec 17 10:38:19 EST 2005

Dear Sigmetrics,

We are doing tests on the accuracy of a robot that is trawling
the web looking for freely available full-texts of articles
from the ISI index.

Any metyhdological comments/suggestions would be much appreciated.

Stevan Harnad

---------- Forwarded message ----------
Date: Sat, 17 Dec 2005 15:26:32 +0000 (GMT)
From: Stevan Harnad <harnad at ecs.soton.ac.uk>
To: Chawki Hajjem <hajjem at vif.com>
     Anurag Acharya <acha at cs.ucsb.edu>, acha at google.com,
     Lee Giles <giles at ist.psu.edu>, oaci-working-group at mailhost.soros.org
Subject: Control tests to do for OA robot accuracy

Dear Chawki,

I am writing this in English so our collaborators understand what
we are doing. Here are the tests and controls that need to be done
to determine both the robot's accuracy in detecting and estimating
%OA and the causality of the obesrved citation advantage:

(1) When you re-do the searches in Biology and Sociology (to begin with:
other disicplines can come later), make sure to (1a) store the number as
well as the URLs of all retrieved sites that match the reference-query and
(1b) make the robot check the whole list (up to at least the prespecified
N-item limit you used before) rather than the robot's stopping as soon as
it thinks it has found that the item is "OA," as in your prior searches.

That way you will have, for each of your Biology and Sociology ISI
reference articles, not only their citation counts, but also their
query-match counts (from the search-engines) and also the number and
ordinal position for every time the robot calls them "OA." (One item
might have, say, k query-matches, with the 3rd, 9th and kth one judged
"OA" by the robot, and the other k-3 judged non-OA.)

Both the number (and URLs) of query-matches and the ordinal position
of the first "OA"-call, and the total number and proportion of OA-calls
will be important test data to make sure that the OA citation advantage
is *not* just a query-match-frequency and/or query-match frequency plus
false alarm artefact. (The potential artefact is that the putative OA
advantage is not an OA advantage at all, but merely a reflection of the
fact that more highly cited articles are more likely to have online
itsems that *cite* them, and that these online items are the ones the
robot is *mistaking* for OA full-texts of the *cited* article itself.)

(2) As a further check on robot accuracy, please use a subset
of URLs for articles we *know* to be OA (e.g., from PubMed
Central, Google Scholar, Arxiv, CogPrints) and try both the search-engines
(for % query-matches) and the robot (for "%OA") on them. That will give
another estimate of the *miss* rate of the search-engines as well
as of the robot's algorithm for OA.

(3) While you are doing this, in addition to the parameters that
are stored with the reference (the citation count, the URLs for every
query-match by the search, the number, proportion, and ordinal position
of those of the matches that the robot tags as "OA"), please also store
the citation impact factor of the journal in which the reference article
was published. (We will use this to do subanalyses to see whether the pattern
is the same for high and low impact journals, and across disciplines; we will
also look at it separately, for %OA among articles at different citation
levels (1, 2-3, 4-7, 7-15, 16-31, 32-63, 64+), again within and across
years and disciplines.)

(4) The sampling for Biology and Sociology should be based on *pairs*
within the same journal/year/issue-number: Assuming that you will be
sampling 500 pairs (i.e., 1000 items) in each discipline (1000 Biology,
1000 Sociology), please first pick a *random* sample of 50 pairs for
each year, and then, within each pair, pick, at *random*, one OA and one non-OA
article per same issue. Use only the robot's *first* ordinal OA as your criterion
for "OA" (so you use the same methodology as the robot had used); the criterion
for non-OA is, as before: none found among all of the matches). If you feel you
have the time, it would also be informative to check the 2nd or 3rd "OA"
if the robot found more than one. That too would be a good control datum,
for evaluating the robot's accuracy under different conditions (number
of matches; number/proportion of them judged "OA").

    http://eprints.ecs.soton.ac.uk/11687/
    http://eprints.ecs.soton.ac.uk/11688/
    http://eprints.ecs.soton.ac.uk/11689/

(5) Count also the number of *journals* for which the robot judges that
it is at or near 100% OA (for those are almost certainly OA journals
and not self-archived articles). Include them in your %OA counts,
but of course not in your OA/NOA ratios. (It would be a good
idea to check all the ISI journal names against the DOAJ OA journals
list -- about 2000 journals -- to make sure you catch all the OA
journals.) Keep a count also of how many individual journal *issues*
has either 100% OA or 0% OA (and were hence eliminated from the OA/NOA
citation ratio). Those numbers will also be useful for later analyses and
estimates.

With these data we will be in a much better position to estimate
the robot's accuracy and the cause of the OA citation advantage.

Chrs, Stevan