Scientometric OAI Search Engines

Tue May 4 18:15:49 EDT 2004

I would like to recommend two excellent, insightful articles by Peter Suber
in The SPARC Open Access Newsletter, issue #73
http://www.earlham.edu/~peters/fos/newsletter/05-03-04.htm

    "The case for OAI in the age of Google"
    http://www.earlham.edu/~peters/fos/newsletter/05-03-04.htm#oai-google
    "Two distractions"
    http://www.earlham.edu/~peters/fos/newsletter/05-03-04.htm#distractions

I have deleted the lion's share of the text of these articles below, because most
if it is so on-target. The few commented excerpts here are just to amplify a few
points for emphasis:

About whether to provide OA by self-archiving in OAI-compliant or in
just a plain-vanilla website, I agree completely with Peter that the most
important thing by far is to provide OA at all! Making it OAI-compliant
is only a perk (though a desirable and easy to provide perk!).

Google Rules! and will not need much tweaking either to (1) restrict
search to, or add extra weight to OAI-compliant contents, or (2) to
restrict ranking to, or add extra weight to citation links (usually in
articles in OAI-compliant contents) over ordinary links.

The easiest way to integrate this in one's mind is to realize that
OAI-compliance is not dependent on *location* but on *metadata-tagging*:
OAI-compliance just means tagging "author," "title," "journalname",
"date" etc., as such. And of course the all-important tag "peer-reviewed
postprint" or "unrefereed preprint". With those tags to go by, google,
could in principle pick out what is and is not a journal article, and
could serve you journal articles only, if you please, ranked only by the
(citation) links between them -- and with inverted full-text search!

Without the tags, google will still find an article if its OA, and
PageRank will get it to the top of the hit list, but it definitely will
not be the same as searching all and only journal articles. There will
be other look-alike junk too, and PageRank will not be infallible in
shuffling it to the bottom of the hit list.

(Sooner or later PageRank is bound to be helped out also by
download-sniffers that weight hits also by the correlation
between downloads and citations 6-24 months later:
http://citebase.eprints.org/analysis/correlation.php ).

So OA is good, OAI-tagged is better.

>    When asked about archiving inertia, some faculty say that putting
>    an eprint on a personal web site is just as good as putting it in an
>    OAI-compliant archive.  Google will find an eprint on a personal web
>    site and make it visible to those who might need it for their research

Google will find the eprint if it is out there, and probably put it on
the top of the hit list if that was what you were looking for. But on
a keyword full-text search on a topic, rather than a targetted search
for a specific article, google will still deliver a lot of junk too,
and PageRank will not ensure that all the OA articles come on top and
the junk below if the OA articles are not OAI-tagged as such.

>    So, is Google good enough?  If not, why not?

>    The OA-OAI proponent might concede that eprints on personal web
>    sites can be OA.  "But OA-OAI archiving enhances visibility more
>    than Google indexing does."

OAI-compliance does substantially advance visibility for items in
full-text keyword searches because it can be restricted to the journal
article literature alone, whereas google will also retrieve great
quantities of junk hits that even PageRank will not reliably relegate to
the bottom. This is a quantitative question. It can be tested. But I very
much doubt that if fed sample keyword searches, PageRank will successfully
sort it all into journal-articles, top, vs. all-other-stuff, bottom.

We tend to intuitively evaluate the (truly uncanny) power of google based
on the "I'm feeling lucky" case, in which there is just one target, or a
few, that I'm looking for, and google miraculously delivers them to me
at or near the top of its huge hit-list. But what no one has tested is
how google fares quantitatively when there are *many* relevant targets
(as in a literature search in ISI's web of science), but mixed in with
a lot of look-alike junk (which is absent from the ISI database): What
percentage of the many targets actually appears on the top of the hit
list, interspersed with what percent of junk, and what percentage of the
targets ends up missed, interspersed instead with the much larger mass
of junk further down? This can't be determined from experience with
the one/few case alone. (We tend to forget or miss even with one/few,
when the target is *not* anywhere near the top of our list, rare as this
may be! We operate with the usual "confirmation bias," and that's how
we remember the experience.)

Nor is it clear -- even if google can successfully partition
journal-article vs. other hits on full-text searches -- whether the rank
order PageRank delivers within the journal-article segment is optimal
(until/unless there is enough OA literature that is citation-linked to
allow citation-link ranks to prevail, as in http://citebase.eprints.org/).

>    Google has a large and useful cache that greatly mitigates the damage
>    of link rot.

Yes, but as far as I know, that cache only lasts a month, and then it
rots too, and the only recourse left becomes the WayBack Machine! (Please
correct me if I am wrong about this.)

>    It's true that OAI tools will provide better visibility to those
>    who search by citations.  But talented Google searchers will prefer
>    to search by content-based keywords, not by citations.  If they do,
>    then they will likely find the same articles by a different route,
>    though they will be combined with all the other articles that also
>    satisfy the keywords.  Insofar as the size of the hit list is a
>    problem, see the next OAI argument.

I have addressed the problem of the size of the hit list above, and alas
I am not sure that what follows below solves the problem: It is not the
size of the hit-list that is a problem. The problem is when a keyword
search nets, say, 100 relevant journal articles in ISI's web-of-science
and (let us say, for the sake of argument), all 100 also happen to be
OA: Will the first 100 items in a webwide google search be all and only
those 100 relevant journal articles? If not, how bad will the mix
be? That is the real question with webwide keyword based searches on
content that is mostly not journal articles, but without any tag for
what *is* journal articles.

>    In Google, you may get more hits than you could ever scan, and many
>    of them will be worse than useless, but Google's PageRank algorithm
>    does a pretty good job of putting the ones you want near the top.

How good a job, when it is a keyword search and the list of relevant hits is
large, but the background noise is far larger? This could be quantitatively
benchmarked using a known set of OA articles.

>    Another place where Google has the advantage is full-text indexing.

Agreed! A *huge* advantage (though there is no reason the OAI engines can't
add this too -- or even co-opt google to do it en passant). But it is full-text
search of the journal literature, with many relevant hits, that is the critical
test for google.

>    Depositing eprints in OAI-compliant archives makes those eprints
>    fodder for all future OAI-compliant data services.  Depositing eprints
>    on a personal web site makes them fodder for all future iterations
>    and rivals of Google.  We don't have to wait for these services to
>    emerge, or to reach a certain level of adequacy, before we provide
>    OA to our eprints.  On the contrary, we should provide OA to our
>    work right now and let evolving data services compete to improve
>    upon the visibility and longevity of our work for the rest of time.

Hear hear! (But providing OAI-compliant OA involves so little additional effort
compared to providing OA alone, for so much more benefit, that it hardly seems
sensible not to bother!)

>    First, providing OA does not require publisher setbacks.  Second,
>    undermining toll-access (TA) publishers does not necessarily
>    advance OA.

I could hardly agree more!

http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0008.gif
http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0009.gif

>    We know that providing OA does not require publisher setbacks
>    because some publishers are providing OA and others are considering
>    experiments with it.

And let us not leave out the third and most important category of
publishers! The "gold" publishers are the OA publishers/journals
who have adopted the OA publishing model or are experimenting with it
(c. 5%). But the "green" publishers/journals are the ones who have *not*
adopted the OA publishing model, hence are not providing OA themselves,
yet they have given their authors the "green light" to go ahead and
do provide the OA themselves, by self-archiving. And these now constitute
58% of publishers and 83% of journals (including the OA journals, of
course). So only 42% of publishers and 17% of journals are still "gray"
on OA!

http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0036.gif
http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0037.gif

>    The same conclusion follows from the fact
>    that OA and TA can coexist, as we know from present experience.
>    We can discuss the long-term prospects for their coexistence, but it
>    seems very likely that they will coexist for the indefinite future
>    while only their proportions will vary.  OA progress is entirely
>    compatible with TA survival.

Again, I couldn't agree more! No need for authors to speculate about
hypothetical future developments in journal publishing in the online age:
What authors need to do is to self-archive their own articles!

    "The Green Road to Open Access: A Leveraged Transition"
     http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3378.html
     http://www.ecs.soton.ac.uk/~harnad/Temp/greenroad.html
     http://www.ecs.soton.ac.uk/~harnad/Tp/resolution.htm#4.2
     http://www.eprints.org/self-faq/#31.Waiting

>    TA publishers are not the enemy.  They are only unpersuaded.
>    Even when they are opposed, and not merely unpersuaded, they are only
>    enemies if they have the power to stop OA.  No publisher has this
>    power, or at least not by virtue of publishing under a TA business
>    model.  If we have enemies, they are those who can obstruct progress
>    to OA.  The only people who fit this description are friends of OA who
>    are distracted from providing OA by other work or other priorities.

Hear, hear! We have met the enemy, and he is us! The solution? We already
know the remedy too: Swan & Brown (2004) "asked authors to say how they
would feel if their employer or funding body required them to deposit
copies of their published articles in one or more... repositories. The
vast majority... said they would do so willingly."

    http://www.eprints.org/signup/sign.php

>    (2) Don't be distracted by public debate.
>
>    stick to the primary work of delivering OA... delivering OA is more
>    important than persuading publishers to join us in delivering OA... We
>    can provide OA without their consent, cooperation, or assistance.
>    The unpersuaded are not enemies.  Persuasion can fail while OA
>    succeeds.  We don't need unanimity; we need OA.

Hear, hear! From Peter's lips to ears of the research community, their
institutions and their funders!

    Relevant prior threads:

    "Re: proposed collaboration: google + open citation linking"
    http://www.openarchives.org/pipermail/oai-general/2001-June/000035.html

    "Economic effects of link-based search engines on e-journals"
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/0894.html

    "A Search Engine for Searching Across Distributed Eprint Archives"
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/0927.html

    "Testing the citation-ranking search engine: Citebase"
     http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2121.html

    "Scientometric OAI Search Engines"
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2237.html

    "Need for systematic scientometric analyses of open-access data"
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2521.html

    "How to compare research impact of toll- vs. open-access research"
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2858.html

Stevan Harnad

NOTE: A complete archive of the ongoing discussion of providing open
access to the peer-reviewed research literature online (1998-2004)
is available at the American Scientist Open Access Forum:
        To join the Forum:
http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html
        Post discussion to:
    american-scientist-open-access-forum at amsci.org
        Hypermail Archive:
    http://www.cogsci.soton.ac.uk/~harnad/Hypermail/Amsci/index.html

Unified Dual Open-Access-Provision Policy:
    BOAI-2 ("gold"): Publish your article in a suitable open-access
            journal whenever one exists.
            http://www.earlham.edu/~peters/fos/boaifaq.htm#journals
    BOAI-1 ("green"): Otherwise, publish your article in a suitable
            toll-access journal and also self-archive it.
            http://www.eprints.org/self-faq/
    http://www.soros.org/openaccess/read.shtml
    http://www.eprints.org/signup/