Predicting later citation counts from very early data
Stevan Harnad
harnad at ECS.SOTON.AC.UK
Mon Apr 21 07:47:13 EDT 2008
On 20-Apr-08, at 9:47 AM, Peter Suber wrote:
> Hi Stevan: Yesterday I tried to send the message below to the OACI
> list. But I got an error message suggesting that the list has been
> discontinued.
>
> Instead of predicting citations from early downloads, as you've
> done, this team predicts citations from properties of the article.
> Prediction of citation counts for clinical articles at two years
> using data available within three weeks of publication:
> retrospective cohort study, BMJ, February 21, 2008. http://dx.doi.org/10.1136/bmj.39482.526713.BE
> Conclusion: Citation counts can be reliably predicted at two years
> using data within three weeks of publication.
Hi Peter,
I am forwarding your post instead to the Sigmetrics list: SIGMETRICS at LISTSERV.UTK.EDU
This interesting article finds that there are a number of metrics
immediately upon publication that predict citations two years later
(using multiple regression analysis).
1274 articles from 105 journals published from January to June 2005,
randomly divided into a 60:40 split to provide derivation and
validation datasets. 20 article and journal features, including
ratings of clinical relevance and newsworthiness, routinely collected
by the McMaster online rating of evidence system, compared with
citation counts at two years. The derivation analysis showed that the
regression equation accounted for 60% of the variation (R2=0.60, 95%
confidence interval 0.538 to 0.629). This model applied to the
validation dataset gave a similar prediction (R2=0.56, 0.476 to 0.596,
shrinkage 0.04; shrinkage measures how well the derived equation
matches data from the validation dataset). Cited articles in the top
half and top third were predicted with 83% and 61% sensitivity and 72%
and 82% specificity. Higher citations were predicted by indexing in
numerous databases; number of authors; abstraction in synoptic
journals; clinical relevance scores; number of cited references; and
original, multicentred, and therapy articles from journals with a
greater proportion of articles abstracted. Conclusion: Citation
counts can be reliably predicted at two years using data within three
weeks of publication.
This finding reinforces the importance of taking into account as many
predictor metrics as possible, though a number of the metrics do seem
specific to clinical medical articles. The (apparently already known)
high correlation with physician ratings for clinical relevance is a
variable specific to this field. (The metrics used are listed at the
end of this message.)
We might perhaps make a distinction between static and dynamic
metrics. This study was based largely on static metrics, in that they
are fixed as of the day of publication. Dynamic metrics like early
downloads (which have also been found to predict later citations) were
not included (the Perneger study was cited but the Brody et al study
was not), nor were early citation growth metics (also predictive of
later citations).
Perneger TV. Relation between online "hit counts" and subsequent
citations: prospective study of research papers in the BMJ. BMJ
2004;329:546-7. doi:10.1136/bmj.329.7465.546
Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage Statistics
as Predictors of Later Citation Impact. Journal of the American
Association for Information Science and Technology (JASIST) 57(8) pp.
1060-1072. http://eprints.ecs.soton.ac.uk/10713/
Journal impact factor was not included either, because it was not
available for a large number journals in the sample.
To my mind, the article reinforces the importance of validating all
these metrics, not just against one another, but against peer
evaluations, in all fields, as in the RAE 2008 database:
Harnad, S. (2007) Open Access Scientometrics and the UK Research
Assessment Exercise. In Proceedings of 11th Annual Meeting of the
International Society for Scientometrics and Informetrics 11(1), pp.
27-33, Madrid, Spain. Torres-Salinas, D. and Moed, H. F., Eds. http://eprints.ecs.soton.ac.uk/13804/
Stevan Harnad
----------------------------------------------------------------------------------------------------------
Predictor variables Hypothesised influences:
Article specific from external sources:
No of authors More authors
Residence of first author in North America North America
No of pages Longer article
No of references in bibliography More references
No of participants More participants
Structured abstract Structured abstracts
Length of abstract Longer
Multicentre studies If multicentred
Original article rather than systematic review If systematic review
Dealing with therapy If therapy
Article specific from internal sources:
No of disciplines chosen relevant to article (breadth of interest)
More disciplines
Average relevance scores over all raters Higher scores
Average newsworthiness scores over all raters Higher scores
Average time taken by raters to rate article More time
Whether article was selected for abstraction in 1 of 3 synoptic
journals If yes
No of views per email alert sent More views per alert
Journal specific using internal data:
Proportion of articles that passed criteria (2005) Higher proportion
Proportion abstracted by 3 synoptic journals Higher proportion
Journal specific using external data:
No of databases that index journal More databases
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20080421/03acc334/attachment.html>
More information about the SIGMETRICS
mailing list