Predicting later citation counts from very early data
John McDonald
John.McDonald at LIBRARIES.CLAREMONT.EDU
Sun Apr 27 00:33:45 EDT 2008
If you are interested in the use of the Negative Binomial Regression
model in citation analysis, see my paper:
McDonald, JD (2007) Understanding journal usage: A statistical analysis
of citation and use. JASIST, 58:1, p.39-50.
http://resolver.caltech.edu/CaltechLIB:2006.001
And the first (and possibly only other) example of its use in
bibliometrics:
Van Dalen, HP and Henkens K. (2001). What makes a scientific article
influential? The Case of Demographers. Scientometrics, 50:3 (455-482).
http://www.akademiai.com/content/r1601w1nx453281n/fulltext.pdf
John McDonald
Assistant Director
User Services & Technology Innovation
Libraries of the Claremont Colleges
800 N. Dartmouth Avenue
Claremont, CA 91711
909-621-8014
From: ASIS&T Special Interest Group on Metrics
[mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Phil Davis
Sent: Monday, April 21, 2008 6:24 AM
To: SIGMETRICS at listserv.utk.edu
Subject: Re: [SIGMETRICS] Predicting later citation counts from very
early data
Take a look at Fig 2, the residual plot of the regression analysis.
There is clearly something else going on in the data that the authors
are unable to model, as the distribution of the outliers should be
equally dispersed around the zero line.
Some of the data transformations in this paper are unorthodox (like
taking square root instead of natural log) and have have no theoretical
basis. Since they have a lot of zeros in their dataset, a Negative
Binomial Regression model may have been a better choice than a linear
model. I'm surprised the reviewers of this article (or the staff
statistician at BMJ) didn't see Figure 2 as a red flag that their model
was problematic. Notice that they defined any article that received
more than 150 citations as an 'outlier' and threw it out for the simple
reason that it reduced the power of their prediction -- this reinforced
the notion that their model is problematic!
Still, the key message in this article -- that characteristics of an
article (like article length, number of authors, etc.) can predict
citations is not new (see Stewart, 1983). The authors don't seem to be
aware of the field of bibliometrics from a quick glance of their
bibliography.
Stewart, J. A. (1983). Achievement and Ascriptive Processes in the
Recognition of Scientific Articles. Social Forces, 62(1), 166-189.
--Phil Davis
Stevan Harnad wrote:
On 20-Apr-08, at 9:47 AM, Peter Suber wrote:
Hi Stevan: Yesterday I tried to send the message below to the OACI
list. But I got an error message suggesting that the list has been
discontinued.
Instead of predicting citations from early downloads, as you've done,
this team predicts citations from properties of the article.
Prediction of citation counts for clinical articles at two years
using data available within three weeks of publication: retrospective
cohort study, BMJ, February 21, 2008.
http://dx.doi.org/10.1136/bmj.39482.526713.BE
Conclusion: Citation counts can be reliably predicted at two
years using data within three weeks of publication.
Hi Peter,
I am forwarding your post instead to the Sigmetrics list:
SIGMETRICS at LISTSERV.UTK.EDU
This interesting article finds that there are a number of metrics
immediately upon publication that predict citations two years later
(using multiple regression analysis).
1274 articles from 105 journals published from January to June
2005, randomly divided into a 60:40 split to provide derivation and
validation datasets. 20 article and journal features, including ratings
of clinical relevance and newsworthiness, routinely collected by the
McMaster online rating of evidence system, compared with citation counts
at two years. The derivation analysis showed that the regression
equation accounted for 60% of the variation (R2=0.60, 95% confidence
interval 0.538 to 0.629). This model applied to the validation dataset
gave a similar prediction (R2=0.56, 0.476 to 0.596, shrinkage 0.04;
shrinkage measures how well the derived equation matches data from the
validation dataset). Cited articles in the top half and top third were
predicted with 83% and 61% sensitivity and 72% and 82% specificity.
Higher citations were predicted by indexing in numerous databases;
number of authors; abstraction in synoptic journals; clinical relevance
scores; number of cited references; and original, multicentred, and
therapy articles from journals with a greater proportion of articles
abstracted. Conclusion: Citation counts can be reliably predicted at
two years using data within three weeks of publication.
This finding reinforces the importance of taking into account as many
predictor metrics as possible, though a number of the metrics do seem
specific to clinical medical articles. The (apparently already known)
high correlation with physician ratings for clinical relevance is a
variable specific to this field. (The metrics used are listed at the end
of this message.)
We might perhaps make a distinction between static and dynamic metrics.
This study was based largely on static metrics, in that they are fixed
as of the day of publication. Dynamic metrics like early downloads
(which have also been found to predict later citations) were not
included (the Perneger study was cited but the Brody et al study was
not), nor were early citation growth metics (also predictive of later
citations).
Perneger TV. Relation between online "hit counts" and subsequent
citations: prospective study of research papers in the BMJ. BMJ
2004;329:546-7. doi:10.1136/bmj.329.7465.546
Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage
Statistics as Predictors of Later Citation Impact. Journal of the
American Association for Information Science and Technology (JASIST)
57(8) pp. 1060-1072. http://eprints.ecs.soton.ac.uk/10713/
Journal impact factor was not included either, because it was not
available for a large number journals in the sample.
To my mind, the article reinforces the importance of validating all
these metrics, not just against one another, but against peer
evaluations, in all fields, as in the RAE 2008 database:
Harnad, S. (2007) Open Access Scientometrics and the UK Research
Assessment Exercise. In Proceedings of 11th Annual Meeting of the
International Society for Scientometrics and Informetrics 11(1), pp.
27-33, Madrid, Spain. Torres-Salinas, D. and Moed, H. F., Eds.
http://eprints.ecs.soton.ac.uk/13804/
Stevan Harnad
------------------------------------------------------------------------
----------------------------------
Predictor variables Hypothesised influences:
Article specific from external sources:
No of authors More authors
Residence of first author in North America North America
No of pages Longer article
No of references in bibliography More references
No of participants More participants
Structured abstract Structured abstracts
Length of abstract Longer
Multicentre studies If multicentred
Original article rather than systematic review If systematic
review
Dealing with therapy If therapy
Article specific from internal sources:
No of disciplines chosen relevant to article (breadth of
interest) More disciplines
Average relevance scores over all raters Higher scores
Average newsworthiness scores over all raters Higher scores
Average time taken by raters to rate article More time
Whether article was selected for abstraction in 1 of 3 synoptic
journals If yes
No of views per email alert sent More views per alert
Journal specific using internal data:
Proportion of articles that passed criteria (2005) Higher
proportion
Proportion abstracted by 3 synoptic journals Higher proportion
Journal specific using external data:
No of databases that index journal More databases
--
Philip M. Davis
PhD Student
Department of Communication
336 Kennedy Hall
Cornell University, Ithaca, NY 14853
email: pmd8 at cornell.edu
phone: 607 255-4735
https://confluence.cornell.edu/display/~pmd8/resume
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20080426/a1fb88f3/attachment.html>
More information about the SIGMETRICS
mailing list