Open Access: sample size, generalizability and self-selection

Sat Nov 27 10:03:32 EST 2010

On Thu, Nov 25, 2010 at 11:26 PM, Philip Davis <pmd8 at cornell.edu> wrote:

Stevan,
> You seem to have conceded that your small sample size critique does not
> hold.

Phil,

I'm not sure why you speak of "concession" since we are talking here about
data, not our respective opinions or preferences.

Your one-year sample was definitely too small and too early to conclude that
there was not going to be an OA citation advantage. However, I agree
completely that a null effect after three years does count as a failure to
replicate the frequently reported OA citation advantage, because many of
those reports were themselves based on samples and time-intervals of
comparable size.

What this means is that your three-year outcome would definitely qualify for
inclusion in a meta-analysis of all the tests of the OA citation
advantage<http://opcit.eprints.org/oacitation-biblio.html>-- whether
their outcomes were positive, negative or null -- with a higher
weighting (for sample size, interval and power) than the one-year study
would have done.

What it certainly does not mean, however, is that your null outcome has now
demonstrated that all the positive outcomes were just a result of a
self-selection artifact (which your randomization has now eliminated)!

The reason -- to remind you again -- is that your study did not replicate
the self-selection bias on its sample population (neither the 1-year sample
nor the 3-year sample). Without that, it has definitely not been
demonstrated that randomization eliminates the OA advantage. Your study has
just shown (as a few other studies have already done) that in some samples,
no OA citation advantage is found.

These null studies have been a small minority, and their sample sizes have
been small -- small not only in terms of the number of articles and journals
sampled but (perhaps even more important) small in terms of the number of
fields sampled.

All of these variables can be duly taken into account in a meta-analysis,
should someone take up Alma Swan <http://eprints.ecs.soton.ac.uk/18516/>'s
call to do one: See Gene
Glass<http://scholarlykitchen.sspnet.org/2010/03/11/rewriting-the-history-on-access/>'s
comments on this.

Let's move to your new concern about generalizability:
>

The concern that tests for the OA citation advantage should be done across
all fields is not a new one; and it continues to be a valid one. One cannot
draw conclusions about all or even most unless the samples are
representative of all or most fields (i.e., not just big enough and long
enough, but broad enough).

While I can't claim negative results across all fields and across all times,
> our randomized controlled trials (RCTs) did involve 36 journals produced by
> 7 different publishers in the medical, biological, and multi-disciplinary
> sciences, plus the social sciences and humanities.

That's correct. And in any meta-analysis that would duly be taken that into
account.

 The nature of the RCTs means a lot of human intervention goes in to set up
> and run the experiments.  In comparison, retrospective observational studies
> (the studies you cite as comparisons) are largely automated and are able to
> gather a huge amount of data quickly with little human intervention.  Yet,
> if you are basing your comparison solely on number of journals and number of
> articles, then you are completely missing the rationale for conducting the
> RCTs in the first place:
>

The many published comparisons are based on comparing OA and non-OA articles
within the same journal and year. That's the studies that are simply testing
whether there is an OA citation advantage.

But the comparison you are talking about is the comparison between
self-selected and imposed OA. For that, you need to have a (sufficiently
large, long, broad and representative) sample of self-selected and imposed
OA. Then you have to see which hypothesis the outcome supports:

According to the self-selection artifact hypothesis, the self-selection
sample should show the usual OA citation advantage whereas the imposed
sample should show no citation advantage (or a significantly smaller one).
This would show whether (and to what degree) the OA citation advantage is
the result of a self-selection bias.

But if the outcome is that the OA citation advantage is the same whether the
OA is self-selected or imposed, then this shows that the self-selection
artifact hypothesis is incorrect.

Your study has shown that in your sample (consisting of OA imposed by
randomization), there is no OA citation advantage (only an OA download
advantage). But it has not shown that there is any OA self-selection
advantage either. Without that, there is only the non-replication of the OA
citation advantage.

Recall that the few other studies that have failed to replicate the OA
citation advantage were all based on self-selected OA. So it does happen,
occasionally, that a sample fails to find an OA citation advantage. To show
that this is because the randomization has eliminated a self-selection bias
requires a lot more.

And meanwhile, we too have tested whether the OA advantage is an artifact of
a self-selection bias, using mandated OA as the means of imposing the OA,
instead of randomization. This allowed us to test a far bigger, longer, and
broader sample. We replicated the widely reported OA citation advantage for
self-selected OA, and found that OA imposed by mandates results in just as
big an OA citation advantage.

(Yassine Gargouri will soon post the standard error of the mean, based on
our largest sample, for subsamples of the same size as yours; this will give
an idea of the underlying variability as well as the probability of
encountering a subsample with a null outcome. I stress, though, that this is
not a substitute for a meta-analysis.)

By design, RCTs are better at isolating possible causes, determining the
> direction of causation, and ruling out confounding variables.  While it is
> impossible to prove cause and effect, RCTs generally provide much stronger
> evidence than retrospective observational studies.
>

What you are calling "retrospective observational studies" is self-selected
OA studies. Randomized OA is one way to test the effect of self-selection;
but mandated OA is another. Both can also use multiple regression to control
for the many other variables correlated with citations (and have done so),
but mandates have the advantage of generating a much bigger, longer, and
broader sample. (As mandates grow, it will become easier and easier for
others to replicate our findings with studies of their own, even estimating
the time it takes for mandates to take effect.)

(I expect that the next redoubt of self-selectionists will be to suggest
that there is a "self-selection bias" in mandate adoption, with elite [more
cited/citeable] institutions more likely to adopt an OA mandate. But with a
range that includes Harvard and MIT, to be sure, but also Queensland
University of Technology and University of Minho -- the world's and europe's
first institutions to adopt a university-wide mandate [Southampton's
departmental mandate having been the first of all] it will be easy enough
for anyone who is motivated to test these increasingly far-fetched bias
hypotheses to control for each institution's pre-mandate citation rank:
There are now over a
hundred<http://www.eprints.org/openaccess/policysignup/>mandates to
choose from. No need to wait for RCTs there...)

Your last concern was about a self-selection control group:
>
> There were not a lot of cases of self-archiving in our dataset.  Remember
> that we were not studying physics and that our studies began in 2007.  You
> will note that I report a positive citation effect in my Appendix, but
> because the act of self-archiving was out of our control, we could not
> distinguish between access and self-selection as a definitive cause.  I also
> report in my dissertation that articles selected and promoted by editors
> were more highly-cited, but it appears that editors were simply selecting
> more citable articles (e.g. reviews) to promote.
>

Yes, there are big problems with gathering the self-selection control data
for randomized OA studies. That's why I recommend using mandated OA instead.
And you can find plenty of it in all fields, not just physics.

Stevan Harnad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20101127/16b84d37/attachment.html>