Predicting later citation counts from very early data

Glanzel, Wolfgang Wolfgang.Glanzel at ECON.KULEUVEN.AC.BE
Sun Apr 27 14:10:47 EDT 2008


Dear Colleague,

A model based on an inhomogeneous birth-process is described in

W. GLÄNZEL, A. SCHUBERT, Predictive Aspects of a Stochastic Model for Citation Processes, Information Processing & Management, 31 (1), 1995, 69-80.
and

W. GLÄNZEL, On the Reliability of Predictions Based on Stochastic Citation Processes, Scientometrics, 40 (3), 1997, 481‑492.

Examples are given as well.
Beste regards,

Wolfgang Glänzel.
________________________________
Van: ASIS&T Special Interest Group on Metrics [SIGMETRICS at LISTSERV.UTK.EDU] namens John McDonald [John.McDonald at LIBRARIES.CLAREMONT.EDU]
Verzonden: zondag 27 april 2008 6:33
Aan: SIGMETRICS at LISTSERV.UTK.EDU
Onderwerp: Re: [SIGMETRICS] Predicting later citation counts from very early data

If you are interested in the use of the Negative Binomial Regression model in citation analysis, see my paper:


McDonald, JD (2007) Understanding journal usage: A statistical analysis of citation and use.  JASIST, 58:1, p.39-50.
http://resolver.caltech.edu/CaltechLIB:2006.001
And the first (and possibly only other) example of its use in bibliometrics:

Van Dalen, HP and Henkens K. (2001). What makes a scientific article influential? The Case of Demographers. Scientometrics, 50:3 (455-482).
http://www.akademiai.com/content/r1601w1nx453281n/fulltext.pdf



John McDonald
Assistant Director
User Services & Technology Innovation
Libraries of the Claremont Colleges
800 N. Dartmouth Avenue
Claremont, CA 91711
909-621-8014



From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Phil Davis
Sent: Monday, April 21, 2008 6:24 AM
To: SIGMETRICS at listserv.utk.edu
Subject: Re: [SIGMETRICS] Predicting later citation counts from very early data


Some of the data transformations in this paper are unorthodox (like taking square root instead of natural log) and have have no theoretical basis.  Since they have a lot of zeros in their dataset, a Negative Binomial Regression model may have been a better choice than a linear model.  I'm surprised the reviewers of this article (or the staff statistician at BMJ) didn't see Figure 2 as a red flag that their model was problematic.  Notice that they defined any article that received more than 150 citations as an 'outlier' and threw it out for the simple reason that it reduced the power of their prediction -- this reinforced the notion that their model is problematic!

Still, the key message in this article -- that characteristics of an article (like article length, number of authors, etc.) can predict citations is not new (see Stewart, 1983).  The authors don't seem to be aware of the field of bibliometrics from a quick glance of their bibliography.


Stewart, J. A. (1983). Achievement and Ascriptive Processes in the Recognition of Scientific Articles. Social Forces, 62(1), 166-189.

--Phil Davis


Stevan Harnad wrote:


Hi Stevan:  Yesterday I tried to send the message below to the OACI list.  But I got an error message suggesting that the list has been discontinued.

Instead of predicting citations from early downloads, as you've done, this team predicts citations from properties of the article.
Prediction of citation counts for clinical articles at two years using data available within three weeks of publication: retrospective cohort study, BMJ, February 21, 2008. http://dx.doi.org/10.1136/bmj.39482.526713.BE
Conclusion:  Citation counts can be reliably predicted at two years using data within three weeks of publication.

Hi Peter,

I am forwarding your post instead to the Sigmetrics list: SIGMETRICS at LISTSERV.UTK.EDU<mailto:SIGMETRICS at LISTSERV.UTK.EDU>

This interesting article finds that there are a number of metrics immediately upon publication that predict citations two years later (using multiple regression analysis).

1274 articles from 105 journals published from January to June 2005, randomly divided into a 60:40 split to provide derivation and validation datasets. 20 article and journal features, including ratings of clinical relevance and newsworthiness, routinely collected by the McMaster online rating of evidence system, compared with citation counts at two years. The derivation analysis showed that the regression equation accounted for 60% of the variation (R2=0.60, 95% confidence interval 0.538 to 0.629). This model applied to the validation dataset gave a similar prediction (R2=0.56, 0.476 to 0.596, shrinkage 0.04; shrinkage measures how well the derived equation matches data from the validation dataset). Cited articles in the top half and top third were predicted with 83% and 61% sensitivity and 72% and 82% specificity. Higher citations were predicted by indexing in numerous databases; number of authors; abstraction in synoptic journals; clinical relevance scores; number of cited references; and original, multicentred, and therapy articles from journals with a greater proportion of articles abstracted. Conclusion:  Citation counts can be reliably predicted at two years using data within three weeks of publication.

This finding reinforces the importance of taking into account as many predictor metrics as possible, though a number of the metrics do seem specific to clinical medical articles. The (apparently already known) high correlation with physician ratings for clinical relevance is a variable specific to this field. (The metrics used are listed at the end of this message.)

We might perhaps make a distinction between static and dynamic metrics. This study was based largely on static metrics, in that they are fixed as of the day of publication. Dynamic metrics like early downloads (which have also been found to predict later citations) were not included (the Perneger study was cited but the Brody et al study was not), nor were early citation growth metics (also predictive of later citations).

Perneger TV. Relation between online "hit counts" and subsequent citations: prospective study of research papers in the BMJ. BMJ 2004;329:546-7. doi:10.1136/bmj.329.7465.546

Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage Statistics as Predictors of Later Citation Impact. Journal of the American Association for Information Science and Technology (JASIST) 57(8) pp. 1060-1072. http://eprints.ecs.soton.ac.uk/10713/

Journal impact factor was not included either, because it was not available for a large number journals in the sample.

To my mind, the article reinforces the importance of validating all these metrics, not just against one another, but against peer evaluations, in all fields, as in the RAE 2008 database:

Harnad, S. (2007) Open Access Scientometrics and the UK Research Assessment Exercise. In Proceedings of 11th Annual Meeting of the International Society for Scientometrics and Informetrics 11(1), pp. 27-33, Madrid, Spain. Torres-Salinas, D. and Moed, H. F., Eds. http://eprints.ecs.soton.ac.uk/13804/

Stevan Harnad

----------------------------------------------------------------------------------------------------------
Predictor variables  Hypothesised influences:

Article specific from external sources:

No of authors  More authors
 Residence of first author in North America  North America
 No of pages  Longer article
 No of references in bibliography  More references
 No of participants  More participants
 Structured abstract  Structured abstracts
 Length of abstract  Longer
 Multicentre studies  If multicentred
 Original article rather than systematic review  If systematic review
 Dealing with therapy  If therapy

Article specific from internal sources:

 No of disciplines chosen relevant to article (breadth of interest)  More disciplines
 Average relevance scores over all raters Higher scores
 Average newsworthiness scores over all raters  Higher scores
 Average time taken by raters to rate article  More time
 Whether article was selected for abstraction in 1 of 3 synoptic journals  If yes
 No of views per email alert sent  More views per alert

Journal specific using internal data:

 Proportion of articles that passed criteria (2005) Higher proportion
 Proportion abstracted by 3 synoptic journals Higher proportion

Journal specific using external data:

 No of databases that index journal More databases






--

Philip M. Davis

PhD Student

Department of Communication

336 Kennedy Hall

Cornell University, Ithaca, NY 14853

email: pmd8 at cornell.edu<mailto:pmd8 at cornell.edu>

phone: 607 255-4735

https://confluence.cornell.edu/display/~pmd8/resume



More information about the SIGMETRICS mailing list