Predicting later citation counts from very early data

Sun Apr 27 00:33:45 EDT 2008

If you are interested in the use of the Negative Binomial Regression
model in citation analysis, see my paper:

McDonald, JD (2007) Understanding journal usage: A statistical analysis
of citation and use.  JASIST, 58:1, p.39-50.
http://resolver.caltech.edu/CaltechLIB:2006.001
And the first (and possibly only other) example of its use in
bibliometrics:

Van Dalen, HP and Henkens K. (2001). What makes a scientific article
influential? The Case of Demographers. Scientometrics, 50:3 (455-482).
http://www.akademiai.com/content/r1601w1nx453281n/fulltext.pdf

John McDonald
Assistant Director
User Services & Technology Innovation
Libraries of the Claremont Colleges
800 N. Dartmouth Avenue
Claremont, CA 91711
909-621-8014

From: ASIS&T Special Interest Group on Metrics
[mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Phil Davis
Sent: Monday, April 21, 2008 6:24 AM
To: SIGMETRICS at listserv.utk.edu
Subject: Re: [SIGMETRICS] Predicting later citation counts from very
early data

Take a look at Fig 2, the residual plot of the regression analysis.
There is clearly something else going on in the data that the authors
are unable to model, as the distribution of the outliers should be
equally dispersed around the zero line.  

Some of the data transformations in this paper are unorthodox (like
taking square root instead of natural log) and have have no theoretical
basis.  Since they have a lot of zeros in their dataset, a Negative
Binomial Regression model may have been a better choice than a linear
model.  I'm surprised the reviewers of this article (or the staff
statistician at BMJ) didn't see Figure 2 as a red flag that their model
was problematic.  Notice that they defined any article that received
more than 150 citations as an 'outlier' and threw it out for the simple
reason that it reduced the power of their prediction -- this reinforced
the notion that their model is problematic!

Still, the key message in this article -- that characteristics of an
article (like article length, number of authors, etc.) can predict
citations is not new (see Stewart, 1983).  The authors don't seem to be
aware of the field of bibliometrics from a quick glance of their
bibliography.

Stewart, J. A. (1983). Achievement and Ascriptive Processes in the
Recognition of Scientific Articles. Social Forces, 62(1), 166-189.

--Phil Davis

Stevan Harnad wrote: 
On 20-Apr-08, at 9:47 AM, Peter Suber wrote:

Hi Stevan:  Yesterday I tried to send the message below to the OACI
list.  But I got an error message suggesting that the list has been
discontinued.  

Instead of predicting citations from early downloads, as you've done,
this team predicts citations from properties of the article.
	Prediction of citation counts for clinical articles at two years
using data available within three weeks of publication: retrospective
cohort study, BMJ, February 21, 2008.
http://dx.doi.org/10.1136/bmj.39482.526713.BE
	Conclusion:  Citation counts can be reliably predicted at two
years using data within three weeks of publication.

Hi Peter,

I am forwarding your post instead to the Sigmetrics list:
SIGMETRICS at LISTSERV.UTK.EDU

This interesting article finds that there are a number of metrics
immediately upon publication that predict citations two years later
(using multiple regression analysis).

	1274 articles from 105 journals published from January to June
2005, randomly divided into a 60:40 split to provide derivation and
validation datasets. 20 article and journal features, including ratings
of clinical relevance and newsworthiness, routinely collected by the
McMaster online rating of evidence system, compared with citation counts
at two years. The derivation analysis showed that the regression
equation accounted for 60% of the variation (R2=0.60, 95% confidence
interval 0.538 to 0.629). This model applied to the validation dataset
gave a similar prediction (R2=0.56, 0.476 to 0.596, shrinkage 0.04;
shrinkage measures how well the derived equation matches data from the
validation dataset). Cited articles in the top half and top third were
predicted with 83% and 61% sensitivity and 72% and 82% specificity.
Higher citations were predicted by indexing in numerous databases;
number of authors; abstraction in synoptic journals; clinical relevance
scores; number of cited references; and original, multicentred, and
therapy articles from journals with a greater proportion of articles
abstracted. Conclusion:  Citation counts can be reliably predicted at
two years using data within three weeks of publication.

This finding reinforces the importance of taking into account as many
predictor metrics as possible, though a number of the metrics do seem
specific to clinical medical articles. The (apparently already known)
high correlation with physician ratings for clinical relevance is a
variable specific to this field. (The metrics used are listed at the end
of this message.)

We might perhaps make a distinction between static and dynamic metrics.
This study was based largely on static metrics, in that they are fixed
as of the day of publication. Dynamic metrics like early downloads
(which have also been found to predict later citations) were not
included (the Perneger study was cited but the Brody et al study was
not), nor were early citation growth metics (also predictive of later
citations).

	Perneger TV. Relation between online "hit counts" and subsequent
citations: prospective study of research papers in the BMJ. BMJ
2004;329:546-7. doi:10.1136/bmj.329.7465.546

	Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage
Statistics as Predictors of Later Citation Impact. Journal of the
American Association for Information Science and Technology (JASIST)
57(8) pp. 1060-1072. http://eprints.ecs.soton.ac.uk/10713/

Journal impact factor was not included either, because it was not
available for a large number journals in the sample. 

To my mind, the article reinforces the importance of validating all
these metrics, not just against one another, but against peer
evaluations, in all fields, as in the RAE 2008 database:

	Harnad, S. (2007) Open Access Scientometrics and the UK Research
Assessment Exercise. In Proceedings of 11th Annual Meeting of the
International Society for Scientometrics and Informetrics 11(1), pp.
27-33, Madrid, Spain. Torres-Salinas, D. and Moed, H. F., Eds.
http://eprints.ecs.soton.ac.uk/13804/

Stevan Harnad

------------------------------------------------------------------------
----------------------------------
	Predictor variables  Hypothesised influences: 

	Article specific from external sources: 

	No of authors  More authors
	 Residence of first author in North America  North America
	 No of pages  Longer article
	 No of references in bibliography  More references
	 No of participants  More participants
	 Structured abstract  Structured abstracts
	 Length of abstract  Longer
	 Multicentre studies  If multicentred
	 Original article rather than systematic review  If systematic
review
	 Dealing with therapy  If therapy

	Article specific from internal sources:  

	 No of disciplines chosen relevant to article (breadth of
interest)  More disciplines
	 Average relevance scores over all raters Higher scores
	 Average newsworthiness scores over all raters  Higher scores
	 Average time taken by raters to rate article  More time
	 Whether article was selected for abstraction in 1 of 3 synoptic
journals  If yes
	 No of views per email alert sent  More views per alert

	Journal specific using internal data:  

	 Proportion of articles that passed criteria (2005) Higher
proportion
	 Proportion abstracted by 3 synoptic journals Higher proportion

	Journal specific using external data:  

	 No of databases that index journal More databases

-- 
Philip M. Davis
PhD Student
Department of Communication
336 Kennedy Hall
Cornell University, Ithaca, NY 14853
email: pmd8 at cornell.edu
phone: 607 255-4735
https://confluence.cornell.edu/display/~pmd8/resume 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20080426/a1fb88f3/attachment.html>