How to Compare IRs and CRs - or maybe how not to?

Sun Feb 10 11:37:43 EST 2008

On Sat, 9 Feb 2008, Armbruster, Chris wrote:

> I also have my doubts that IRs, federated IRs and OAI-PMH will do the
> job...

But what job, exactly, is it that you doubt they can do, Chris? Because
searching over nonexistent content cannot be done by anyone or anything!

> but CRs are also sometimes no better. Even assuming that content
> is self-archived, will it be found?

This is rather like asking: "But assuming we have a cure for cancer,
how will we distribute it?"

The immediate goal is to find a cure for cancer. Let's wait till we have
one before assuming we have a nontrivial distribution problem too!

(Fortunately, in the case of OA, we already know the cure: mandate
self-archiving in OAI-compliant IRs.)

The reason the OA target content cannot be found today is that most
if it isn't there; hence no resource is (or needs to be) developed and
implemented today on the assumption that the target content is all or
mostly there, free for all on the web, and that the only thing we are
missing is a reliable way to find it.

What we need is that nonexistent content, not the content-finder.

(In a parallel reply to David Wojick I address the question of free
content in the "deep web," not indexed by Google: The solution there,
too, is to bring it to the reachable, surfable surface, by mandating
that it be deposited in the researcher's OAI-compliant Institutional
Repository [IR].)

> Consider this: It is often assumed that what stands in the way of
> enhanced functionality and quality is the lack of journal articles
> available in open access. However, a critical experiment has shown that
> databases already have problems with coverage even if items are available
> in open access.  It has been found (Bergstrom/Lavaty 2007) that for
> 33 key economic journals, ninety percent of articles in the most-cited
> journals had been self-archived and about fifty percent of articles in
> less-cited journals were also available freely online. All of the freely
> available articles were found through Google. Using Google Scholar, they
> found about 10% less. However, when using OAIster they found only 1/4 of
> the freely available articles and results were only marginally better
> for SSRN and RePEc searches.

(0) (It is noteworthy that the B&L study is in Economics, which, along
with Physics and Computer Science, make up the three disciplines that have
been spontaneously self-archiving for over a decade and a half now. But
the OA problem is with all the other disciplines: They have not followed
this admirable example. Nor have even these three laudable disciplines
come anywhere near depositing 100% of their annual article output.)

(1) But I'm not sure what, exactly, your point is, Chris: If all of
the free articles were indeed found with Google, then find them with
Google! OAIster and Google Scholar will get them too, once they are
deposited in mandated OAI-compliant IRs, as proposed, rather than on
arbitrary websites, as now.

(2) Of course a specific-item Google search only works if you know that
the item is on the web, and you know some or all of the boolean search
words that will pick it out, Google-style. No use expecting much of
that content to pop up in a generic-topic Google search, where you have no
idea know what is and isn't out there.

(3) The remedy for that is to have all of it in OAI-compliant IRs. Then
you can restrict the boolean full-text Google search to OA content, and
OA content alone, instead of searching for it in a haystack of at least
30 billion web pages (in Feb 2007).

Here is the sort of thing it would be absurd to expect to succeed
today, on the full web -- but would be a trivial piece of cake if
the full texts of all 2.5 million articles published annually in the
planet's 25,000 were self-archived in an OAI-compliant IR:

(i) Do a generic boolean search, GB, using content terms, on a dedicated
database, such as PubMed.

(ii) Then take the references for all the P PubMed hits, and first do a
specific-item boolean search, SB, for each of them, item by item, by
reference term, on the full web via Google.

(iii) Lets say the SB search on the web finds W of those P hits as
full texts on the web. W/P is the proportion of the Pubmed hits that is
currently available free on the web (apart from the "deep web"
unreachable by Google).

(iv) Now re-do the generic boolean search GB (i.e., using content
terms rather than each items reference) this time directly on the web,
via Google.

(v) Of course the result will be a huge and unnavigable mess, despite
the miracle of PageRank. PageRank is good enough for rank-ordering the
single targeted item reference search, but not for the generic boolean
search GB on content terms.

(vi) Why not? Two obvious reasons: (i) The target content that is there,
is embedded in too large a mess of irrelevant content and (ii) most of
the target content is not there.

(vii) Remedy: (i) get all of the target content out there in OAI-compliant
IRs so that (ii) the search can be restricted to all and only the relevant
content, as it is in PubMed.

But before someone draws it, let me point out that the *wrong* conclusion
to draw from this is that therefore the target content should be
deposited directly in a CR like PubMed Central, rather than in each
author's own institutional IR!

That would be like trying to inventory what each person on the planet
spends, on what, daily, by asking them to deposit each separate individual
purchase, item by item, directly into the central thematic database
or databases corresponding to the category or categories in which each
purchase falls (milk, dairy, food, toothpaste, movies, airplane tickets,
travel, leisure, traffic tickets...) rather than simply inventorying all
their individual purchases in their own local bank account, and letting
software harvest and classify the items centrally in the various ways
it sees fit.

It is not too far-fetched to say that that sort of direct centralism
would be tantamount to each author's having to deposit each publication
into a CR corresponding to every possible keyword and boolean combination
(when even depositing each publication into just the CR that provides the
"closest match" would be absurd).

Again, local deposited in each researcher's own OA-compliant IR, and
then central harvesting by whatever central services we want to build on
the distributed IRs is the natural, obvious, and optimal solution that
will scale systematically to all of OA output worldwide.

> Given the high propensity of economists
> to self-archive and the availability of institutional and disciplinary
> repositories, the differences between Google and the non-commercial
> solutions are so dramatic as to warrant the conclusion that the
> non-commercial solutions, whatever their merits, have only very limited
> potential.

But :

    (a) The "availability" of IRs and CRs has not been availed of
    spontaneously by the vast majority of researchers, in all disciplines:
    That's why IRs and CRs are largely empty (relative to their annual
    target content). And that's why institutional and funder self-archiving
    mandates are needed.

     (b) Although the spontaneous self-archiving propensity of economists
     (and physicists and computer scientists) is indeed much higher than
     that of most other disciplines, this admirable propensity predated
     the OA/OAI/IR era, and continues along the same spontaneous lines
     established over a decade and a half ago (direct deposit in the
     Arxiv CR for physicists, deposit in local or central "working papers"
     series, harvested by the Repec CR in economics, and arbitrary local
     website deposit, harvested by the Citeseer CR in computer science).
     So that's another reason institutional and funder self-archiving
     mandates are needed.

(I am not quite sure what Google and commercial/noncommercial solutions
has to do with this. OA is not trying to compete with google. Google's
fine. It's trying to generate OA content, in OAI-compliant OA IRs and
CRs. Once it's all there, we can see whether boolean search with Google,
restricted to OA content alone, well be enough for search and navigation
purposes. If not, further resources will be developed. There's plenty
of scope for creativity there; the only thing missing is the content to
apply it to.)

Stevan Harnad
AMERICAN SCIENTIST OPEN ACCESS FORUM:
http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html
     http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/

UNIVERSITIES and RESEARCH FUNDERS:
If you have adopted or plan to adopt a policy of providing Open Access
to your own research article output, please describe your policy at:
     http://www.eprints.org/signup/sign.php
     http://openaccess.eprints.org/index.php?/archives/71-guid.html
     http://openaccess.eprints.org/index.php?/archives/136-guid.html

OPEN-ACCESS-PROVISION POLICY:
     BOAI-1 ("Green"): Publish your article in a suitable toll-access journal
     http://romeo.eprints.org/
OR
     BOAI-2 ("Gold"): Publish your article in an open-access journal if/when
     a suitable one exists.
     http://www.doaj.org/
AND
     in BOTH cases self-archive a supplementary version of your article
     in your own institutional repository.
     http://www.eprints.org/self-faq/
     http://archives.eprints.org/
     http://openaccess.eprints.org/