Predicting later citation counts from very early data

Stevan Harnad harnad at ECS.SOTON.AC.UK
Mon Apr 21 07:47:13 EDT 2008


On 20-Apr-08, at 9:47 AM, Peter Suber wrote:

> Hi Stevan:  Yesterday I tried to send the message below to the OACI  
> list.  But I got an error message suggesting that the list has been  
> discontinued.
>
> Instead of predicting citations from early downloads, as you've  
> done, this team predicts citations from properties of the article.
> Prediction of citation counts for clinical articles at two years  
> using data available within three weeks of publication:  
> retrospective cohort study, BMJ, February 21, 2008. http://dx.doi.org/10.1136/bmj.39482.526713.BE
> Conclusion:  Citation counts can be reliably predicted at two years  
> using data within three weeks of publication.


Hi Peter,

I am forwarding your post instead to the Sigmetrics list: SIGMETRICS at LISTSERV.UTK.EDU

This interesting article finds that there are a number of metrics  
immediately upon publication that predict citations two years later  
(using multiple regression analysis).

1274 articles from 105 journals published from January to June 2005,  
randomly divided into a 60:40 split to provide derivation and  
validation datasets. 20 article and journal features, including  
ratings of clinical relevance and newsworthiness, routinely collected  
by the McMaster online rating of evidence system, compared with  
citation counts at two years. The derivation analysis showed that the  
regression equation accounted for 60% of the variation (R2=0.60, 95%  
confidence interval 0.538 to 0.629). This model applied to the  
validation dataset gave a similar prediction (R2=0.56, 0.476 to 0.596,  
shrinkage 0.04; shrinkage measures how well the derived equation  
matches data from the validation dataset). Cited articles in the top  
half and top third were predicted with 83% and 61% sensitivity and 72%  
and 82% specificity. Higher citations were predicted by indexing in  
numerous databases; number of authors; abstraction in synoptic  
journals; clinical relevance scores; number of cited references; and  
original, multicentred, and therapy articles from journals with a  
greater proportion of articles abstracted. Conclusion:  Citation  
counts can be reliably predicted at two years using data within three  
weeks of publication.

This finding reinforces the importance of taking into account as many  
predictor metrics as possible, though a number of the metrics do seem  
specific to clinical medical articles. The (apparently already known)  
high correlation with physician ratings for clinical relevance is a  
variable specific to this field. (The metrics used are listed at the  
end of this message.)

We might perhaps make a distinction between static and dynamic  
metrics. This study was based largely on static metrics, in that they  
are fixed as of the day of publication. Dynamic metrics like early  
downloads (which have also been found to predict later citations) were  
not included (the Perneger study was cited but the Brody et al study  
was not), nor were early citation growth metics (also predictive of  
later citations).

Perneger TV. Relation between online "hit counts" and subsequent  
citations: prospective study of research papers in the BMJ. BMJ  
2004;329:546-7. doi:10.1136/bmj.329.7465.546

Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage Statistics  
as Predictors of Later Citation Impact. Journal of the American  
Association for Information Science and Technology (JASIST) 57(8) pp.  
1060-1072. http://eprints.ecs.soton.ac.uk/10713/

Journal impact factor was not included either, because it was not  
available for a large number journals in the sample.

To my mind, the article reinforces the importance of validating all  
these metrics, not just against one another, but against peer  
evaluations, in all fields, as in the RAE 2008 database:

Harnad, S. (2007) Open Access Scientometrics and the UK Research  
Assessment Exercise. In Proceedings of 11th Annual Meeting of the  
International Society for Scientometrics and Informetrics 11(1), pp.  
27-33, Madrid, Spain. Torres-Salinas, D. and Moed, H. F., Eds. http://eprints.ecs.soton.ac.uk/13804/

Stevan Harnad

----------------------------------------------------------------------------------------------------------
Predictor variables 	Hypothesised influences:
	
Article specific from external sources:

No of authors 	More authors
  Residence of first author in North America 	North America
  No of pages 	Longer article
  No of references in bibliography 	More references
  No of participants 	More participants
  Structured abstract 	Structured abstracts
  Length of abstract 	Longer
  Multicentre studies 	If multicentred
  Original article rather than systematic review 	If systematic review
  Dealing with therapy 	If therapy

Article specific from internal sources: 	

  No of disciplines chosen relevant to article (breadth of interest) 	 
More disciplines
  Average relevance scores over all raters	Higher scores
  Average newsworthiness scores over all raters 	Higher scores
  Average time taken by raters to rate article 	More time
  Whether article was selected for abstraction in 1 of 3 synoptic  
journals 	If yes
  No of views per email alert sent 	More views per alert

Journal specific using internal data: 	

  Proportion of articles that passed criteria (2005)	Higher proportion
  Proportion abstracted by 3 synoptic journals	Higher proportion

Journal specific using external data: 	

  No of databases that index journal	More databases



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20080421/03acc334/attachment.html>


More information about the SIGMETRICS mailing list