Control tests to do for OA robot accuracy

Stevan Harnad harnad at ECS.SOTON.AC.UK
Sat Dec 17 13:15:03 EST 2005


On Sat, 17 Dec 2005, Tim Brody wrote:

> Stevan Harnad wrote:
>
> > Both the number (and URLs) of query-matches and the ordinal position
> > of the first "OA"-call, and the total number and proportion of OA-calls
> > will be important test data to make sure that the OA citation advantage
> > is *not* just a query-match-frequency and/or query-match frequency plus
> > false alarm artefact. (The potential artefact is that the putative OA
> > advantage is not an OA advantage at all, but merely a reflection of the
> > fact that more highly cited articles are more likely to have online
> > itsems that *cite* them, and that these online items are the ones the
> > robot is *mistaking* for OA full-texts of the *cited* article itself.)
>
> Does the robot check the URL (i.e. download the page and perform some
> level of check on it)? I had assumed Chawki's robot did this, otherwise
> discerning between the paper and a reference to the paper only from the
> search engine result is nigh on impossible.

Of course:

    "The robot's search algorithm was the following: (1) Send request to
    ISI database for metadata of article (firstauthor name and article
    title). (2) Send request (name, title) to: Yahoo, Metacrawler,
    Vivissimo, Eo, AlltheWeb and Altavista. (3) Extract external
    (irrelevant) links. (4) Remove duplicate URLs. (5) Sort URLs
    to process PDF and PS files first (probable full-texts). (5)
    Convert files (PDF, PS, Latex, HTML, XML, RTF, and Word) to text.
    (6) Parse files to test for full-text of reference article
    (name/title in first 20% of text, references in last 20%).  (7)
    If, in parsing HTML file, title found but not full text, extract
    and follow links in file further as references possibly leading
    to the full text (to depth of 3 levels). (8) Sort articles by
    discipline/journal/issue/year; calculate percent OA articles within
    each; then by discipline/journal; and finally for each discipline. (9)
    Sort articles by discipline/journal/issue/year, calculate citation
    ratio as (OA-NOA/NOA) within each, then by discipline/journal and
    finally for each discipline. (10) Exclude data for all journals
    that are 100% OA (OA journals) from both the article counts and the
    citation counts (as we are only doing within-journal comparisons for
    NOA journals); exclude data from all single issues that are 100% OA
    (to eliminate denominators)."
    http://eprints.ecs.soton.ac.uk/11688/

> Given you're searching from the ISI database some simple tests of the
> full-text are:
> 1) Size of document (I'd say must be at least 5 pages long)
> 2) Title on the first page [will still match publication list]
> 3) Authors on the first page [ditto]
> 4) Use the years (or other bibliographic part) from the ISI reference
> list as a key (if the same years - in order - are present in the document)
>
> Also I'd suggest capturing how many URL's are PDFs, postscript, etc. If
> you're getting a lot of HTML matches then I suggest they're probably not
> author self-archived!

All being done, Tim, rest assured!

> > (5) Count also the number of *journals* for which the robot judges that
> > it is at or near 100% OA (for those are almost certainly OA journals
> > and not self-archived articles). Include them in your %OA counts,
> > but of course not in your OA/NOA ratios. (It would be a good
> > idea to check all the ISI journal names against the DOAJ OA journals
> > list -- about 2000 journals -- to make sure you catch all the OA
> > journals.) Keep a count also of how many individual journal *issues*
> > has either 100% OA or 0% OA (and were hence eliminated from the OA/NOA
> > citation ratio). Those numbers will also be useful for later analyses and
> > estimates.
>
> Only a few hundred journals in ISI are OA, although I don't know if ISI
> publishes that list (might be something Ulrichs would give).

ISI did a few articles about their OA subset, so the list might be
in there.  But a DIFF with the DOAJ http://www.doaj.org/ list will do
just as well as it is *highly* unlikely that an ISI-indexed OA journal
will not have registered as OA with DOAJ.

    McVeigh, M. E. (2004) Open Access Journals in the ISI Citation
    Databases: Analysis of Impact Factors and Citation Patterns Thomson
    Scientific, October 2004

    Pringle, J. (2004) Do Open Access Journals have Impact?  Nature,
    Web Focus: access to the literature, May 7, 2004
    http://www.nature.com/nature/focus/accessdebate/19.html

    Testa, J. and McVeigh, M. E. (2004) The Impact of Open Access
    Journals: A Citation Study from Thomson ISI (pdf 17pp)
    Author eprint, 14 April 2004
    http://www.isinet.com/media/presentrep/acropdf/impact-oa-journals.pdf

Chrs, Stevan



More information about the SIGMETRICS mailing list