University Institutional Repository impact on citation of journal articles

Fri Nov 23 08:39:50 EST 2007

On Fri, 23 Nov 2007, Chiner Arias, Alejandro wrote:

> My interest is on any study that specifically concentrates on
> "Institutional Repository" self-archiving. It could be an added bonus
> if the study is making a distinction between pre-prints and post-prints,
> publisher's proof copy or publisher's published, or between immediate
> OA and embargo with metadata exposure only.
>
> For the purpose of IR advocacy, studies on "Open Access" advantage do
> little to persuade those who are already using central repositories
> outside the institution.

Ah, now I understand. Such a study is possible, if done by hand
(harvesting OA articles by robot, then hand-sifting them into (1) IR,
(2) CR, and (3) ordinary website content, and then comparing the
citation counts of each with the citation counts of matched non-OA
articles in the same journal issue).

I do not believe, however, that it would be worth the effort such a
study would entail. It is unlikely that the existence or size of the OA
advantage will co-vary much with the form of deposit. More important,
even it does, it is extremely unlikely that that would be because of
something *intrinsic* about the kind of repository: It would simply
reflect the accidental historic OA content situation today, with most
articles (85%) not yet being made OA by their authors in *any* of the
three ways, but with one CR (Arxiv) having the (incidental) advantage of
a historic (15-year) head-start in self-archiving in its subject-matter
(physics), and one other CR (PubMed Central) having the (incidental)
advantage of being coupled with a widely-used subject-specific *non-OA*
index (PubMed) that covers all and only its subject matter
(biomedicine).

It is quite possible that articles self-archived in those two CRs (and
*only* those two CRs, as there are no other such special cases) will
have somewhat higher citation counts than articles self-archived in IRs
or on ordinary websites today, because those two CRs (Arxiv and PubMed
Central) have strong direct user traffic of their own, in one case
because it is a CR of very long standing with a much larger than
baseline share of the OA content (Arxiv) and in the other because it is
associated with a particularly strong and heavily used host (PubMed
Central and PubMed), whereas the distributed IR harvesters (OAIster --
as well as google and google scholar) are all still struggling with very
low-percentage OA content overall (15%), mixed across all subjects.

The two high-percentage CRs, being restricted to a well-stocked subject
area, currently have the advantage that you can not *only* search in
them for a specific item or author, known in advance -- in that
capability they have no advantage at all over IRs or websites -- but you
can also search for keywords within a subject area, where those two CRs
do not have the liability of bringing in a lot irrelevant noise, or
drawing a near-blank, as searching over all IRs or websites with OAIster
of google does -- *today*.

The emphasis is on *today*, because it should be obvious that the
advantage of Arxiv is not its centrality but the fact that it hosts most
of the relatively high percentage of OA content in its subject. PubMed
Central does not yet have much OA content yet, but it has the advantage
of being associated with PubMed (which has *all* the content of its
subject, mostly non-OA), and restricts search to that content alone.

So there are two completely independent issues here: (1) percentage OA
content and (3) subject-specific search. I think it is obvious that OA
content is the decisive factor. For if the content in all subjects were
already 100% OA *and* in IRs, it would be a relatively simple matter to
optimize OAIster and google-scholar, the OA IR harvesters, to search
over all and only a given subject subject matter -- especially with the
help of the IRs' OAI metadata tags, which include the department of the
author (and sometimes even, unnecessarily, subject-descriptor tags). In
other words, restricting content by subject is a minor,
harvesting/tagging issue, not a deposit-locus issue, whereas generating
OA content in the first place is the major problem (and major obstacle
to access, usage and citations).

But from what I've said so far, it still sounds as if it makes no
difference whether the OA content is deposited in a CR, an IR or a
website, just as long as it's OA, and there for the harvesting. This
focus on the locus of the article misses the most fundamental point,
which is the *source* of the article: For all articles have *authors*
and (just about) all authors have institutions. So it is 85% of authors
who are not yet self-archiving, but (just about) every one of those
authors also has an institution that is likewise losing a good deal from
the fact that they are not self-archiving and -- most important -- is in
a position to *mandate* that their institutional authors self-archive in
their own institutional repository, in a position to monitor that they
do so, and in a position to reward compliance with the usual rewards for
enhanced research impact. Institutions tile all of research output
space. CRs do not; CRs make much more sense as harvesters. Moreover,
neither CRs nor "subjects" (disciplines) are entities -- the way their
authors' institutional employers are -- with any means to require or
reward self-archiving. CRs rely entirely on authors' spontaneous
inclination to self-archive -- and that, apart from the prominent but
lonely exception of (certain parts of) physics, keeps hovering at 15%.

Self-archiving mandates are at long last starting to be adopted, by both
institutions and funders, but here too we have to be careful to think
through the strategic question of the *locus* that the mandate should
dictate for the self-archiving. It's fairly obvious that it makes no
sense for institutions to mandate that their own authors deposit in CRs
rather than in their own IRs: Apart from the fact that CRs do not yet
exist in most fields, institutions are not in a position to monitor
compliance for all possible CRs, nor do they stand to benefit nearly as
much, in terms of institutional visibility and record-keeping, if they
mandate CR deposit willy-nilly rather than local deposit in their own
IRs. From an institutional point of view, a local IR deposit mandate
makes most sense, leaving CRs to be the harvesters they ought to be,
rather than the loci of deposit.

What about the funder's standpoint? The biggest funder mandate momentum
today is in biomedicine, inspired by Harold Varmus and PubMed Central,
so most of the biomedical funder mandates have stipulated depositing in
PubMed Central. This will soon substantially increase the content of
PubMed Central, and the percentage OA in biomedicine. But there are
several problems: (1) Not all biomedical research is funded by the
mandating funders and, (2) not all research is funded at all, and, most
important, (3) not all research is biomedical. Unlike institutions,
funders in general (and biomedical research funders in particular) *do
not tile all of research space*. Nor are they one single entity.
Moreover, all the advantages -- for funders -- that accrue from
mandating OA would remain if funders mandated each researcher's own *IR*
as the locus of the deposit! That way funding mandates could reinforce
institutional mandates, helping to tile all of research output space;
and, if the funders desire, the content can be harvested into designated
CRs as they see fit.

The web is not a central locus, it is a distributed network of local
websites. Nor is Google a central locus: it is a harvester of
distributed content. This distributed-content/central-harvesting+search
principle seems to have evolved naturally on the web. There is every
reason for OA IRs and CRs to build upon it, rather than unimaginatively
regressing to a time when central benefits could only be had if content
had a central locus.

So this is all a lengthy way of explaining why any incidental citation
count advantages that some CRs might enjoy over IRs today would not at
all mean what we might naively be tempted to interpret them to mean:
that it is better to deposit in a CR than an IR. Rather, they mean that
one CR (Arxiv) happens to have a 15-year head start and another (PubMed
Central) may soon have a funder kick-start -- but any resultant citation
advantages of CRs are just search advantages that are just as possible
with distributed archiving, harvesting, and adapted search engines
(e.g. citebase), and not at all intrinsic to deposit locus. The optimal
locus for deposit is the IR. Both institutions and funders should mandate
IR deposit. And CRs should be harvesters, not primary deposit loci.

Coda: The query was about whether the OA citation advantage is greater
for articles in OA CRs, OA IRs or OA websites, but one might just as
well have asked about articles in OA journals! Eysenbach found that the
OA citation advantage was greater for PNAS articles archived on the PNAS
journal website than for those self-archived in the author's IR or
website. This too is merely an artifact of the fact that so little
content on the web is OA today, whereas the PNAS website is a high-profile
locus for direct visits and local search. But this local advantage would
vanish if all articles were OA (somewhere on the Web), as then it would
make as little sense to seek an article by directly visiting the PNAS
website as by directly visiting any particular IR: Search and harvesting
is a global matter, over distributed content; only deposit is a local
matter. And the optimal locus for OA content, the one that scales to
all of research space, is each author's own OAI-compliant IR.

> -----Original Message-----
> From: Stevan Harnad
> Sent: 22 November 2007 21:16
> To: SIGMETRICS (ASIS&T Special Interest Group on Metrics)--
> listserv.utk.edu
> Subject: Re:University Institutional Repository impact on citation of
> journal articles
>
> On Tue, 20 Nov 2007, Chiner Arias, Alejandro wrote:
>
>> Does article self-archiving in an Institutional Repository increase
>> citation of the articles that are later published in peer-reviewed
>> scholarly journals?
>
> Yes:
>
>     Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage
> Statistics
>     as Predictors of Later Citation Impact. Journal of the American
>     Association for Information Science and Technology (JASIST), 57
>     (8). pp.  1060-1072. http://eprints.ecs.soton.ac.uk/10713/
>
> See also the work of Kurtz et al, and of Moed, on the "Early
> Advantage,"
> in the OpCit Bibliography that you cite below.

Stevan Harnad
AMERICAN SCIENTIST OPEN ACCESS FORUM:
http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html
     http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/

UNIVERSITIES and RESEARCH FUNDERS:
If you have adopted or plan to adopt an policy of providing Open Access
to your own research article output, please describe your policy at:
     http://www.eprints.org/signup/sign.php
     http://openaccess.eprints.org/index.php?/archives/71-guid.html
     http://openaccess.eprints.org/index.php?/archives/136-guid.html

OPEN-ACCESS-PROVISION POLICY:
     BOAI-1 ("Green"): Publish your article in a suitable toll-access journal
     http://romeo.eprints.org/
OR
     BOAI-2 ("Gold"): Publish your article in an open-access journal if/when
     a suitable one exists.
     http://www.doaj.org/
AND
     in BOTH cases self-archive a supplementary version of your article
     in your own institutional repository.
     http://www.eprints.org/self-faq/
     http://archives.eprints.org/
     http://openaccess.eprints.org/