Citation statistics

Sun Jun 15 12:01:47 EDT 2008

On Thu, 12 Jun 2008, Charles Oppenheim wrote:

 >>    Re:
 >>    International Mathematical Union announces Citation Statistics  
report
 >>    Numbers with a number of problems
 >>    Robert Adler, John Ewing (Chair), Peter Taylor
 >>    http://www.mathunion.org/Publications/Report/CitationStatistics
 >
 > CHARLES OPPEHEIM:
 > I've now read the whole report.  Yes, it tilts at the
 > usual windmills, and rightly dismissed the use of Impact
 > factors for anything but crude comparisons, but it fails
 > to address the fundamental issue, which is:  citation and
 > other metrics correlate superbly with subjective peer
 > review.  Both methods have their faults, but they are
 > clearly measuring the same (or closely related) things.
 > Ergo, if you have evaluate research in some way, there is
 > no reason NOT to use them!  It also keeps referring to
 > examples from the field of maths, which is a very strange
 > subject citation-wise.

I have now read the IMU report too, and agree with Charles that it  
makes many valid points but it misunderstands the one fundamental  
point concerning the question at hand: Can and should metrics be used  
in place of peer-panel based rankings in the UK Research Assessment  
Exercise (RAE) and its successors and homologues elsewhere? And there  
the answer is a definite Yes.

The IMU critique points out that research metrics in particular and  
statistics in general are often misused, and this is certainly true.  
It also points out that metrics are often used without validation.  
This true is correct. There is also a simplistic tendency to try to  
use one single metric, rather than multiple metrics that can  
complement and correct one another. There too, a practical and  
methodological error is correctly pointed out. It is also true that  
the “journal impact factor” has many flaws, and should on no  
account be used to rank individual papers of researchers, and  
especially not alone, as a single metric.

But what all this valuable, valid cautionary discussion overlooks is  
not only the possibility but the empirically demonstrated fact that  
there exist metrics that are highly correlated with human expert  
rankings. It follows that to the degree that such metrics account for  
the same variance, they can substitute for the human rankings. The  
substitution is desirable, because expert rankings are extremely  
costly in terms of expert time and resources. Moreover, a metric that  
can be shown to be highly correlated with an already validated  
variable predictor variable (such as expert rankings) thereby itself  
becomes a validated predictor variable. And this is why the answer to  
the basic question of whether the RAE’s decision to convert to  
metrics was a sound one is: Yes.

Nevertheless, the IMU’s cautions are welcome: Metrics do need to be  
validated; they do need to be multiple, rather than a single,  
unidimensional index; they do have to be separately validated for each  
discipline, and the weights on the multiple metrics need to be  
calibrated and adjusted both for the discipline being assessed and for  
the properties on which it is being ranked. The RAE 2008 database  
provides the ideal opportunity to do all this discipline-specific  
validation and calibration, because it is providing parallel data from  
both peer panel rankings and metrics. The metrics, however, should be  
as rich and diverse as possible, to capitalize on this unique  
opportunity for joint validation.

Here are some comments on particular points in the IMU report. (All  
quotes are from the report):

 >    The meaning of a citation can be even more subjective than peer  
review.

True. But if there is a non-metric criterion measure – such as peer  
review – on which we already rely, then metrics can be cross- 
validated against that criterion measure, and this is exactly what the  
RAE 2008 database makes it possible to do, for all disciplines, at the  
level of an entire sizeable nation’s total research output..

 >    The sole reliance on citation data provides at best an  
incomplete and often shallow understanding of research—an  
understanding that is valid only when reinforced by other judgments.

This is correct. But the empirical fact has turned out to be that a  
department’s total article/author citation counts are highly  
correlated with its peer rankings in the RAE in every discipline  
tested. This does not mean that citation counts are the only metric  
that should be used, or that they account for 100% of the variance in  
peer rankings. But it is strong evidence that citation counts should  
be among the metrics used, and it constitutes a (pairwise) validation.

 >    Using the impact factor alone to judge a journal is like using  
weight alone to judge a person's health.

Using only the journal impact factor (the average citation counts of  
article published in that journal) in place of the actual citation  
counts for individual articles and authors is of course as absurd as  
using only the average marks of a candidate’s secondary school,  
instead of the candidate’s own actual marks, to decide on university  
admission. However, the journal’s average might still be used as one  
of the battery of candidate metrics to be validated and collaborated  
jointly, discipline by discipline, as it may give further, valid  
independent information about the level of the publication venue  
itself, over and above the individual citation counts.

 >    For papers, instead of relying on the actual count of citations  
to compare individual papers, people frequently substitute the impact  
factor of the journals in which the papers appear.

As noted, this is a foolish error if the journal impact factor is used  
alone, but it may enhance predictivity and hence validity if added to  
a battery of jointly validated metrics.

 >    The validity of statistics such as the impact factor and h‐ 
index is neither well understood nor well studied.

The h-index (and its variants) were created ad hoc, without  
validation. They turn out to be highly correlated with citation counts  
(for obvious reasons, since they are in part based on them). Again,  
they are all welcome in a battery of metrics to be jointly cross- 
validated against peer rankings or other already-validated or face- 
valid metrics.

 >    citation data provide only a limited and incomplete view of  
research quality, and the statistics derived from citation data are  
sometimes poorly understood and misused.

It is certainly true that there are many more potential metrics of  
research performance productivity, impact and quality than just  
citation metrics (e.g., download counts, student counts, research  
funding, etc.). They should all be jointly validated, discipline by  
discipline and each metric should be weighted according to what  
percentage of the criterion variance (e.g., RAE 2008 peer rankings) it  
predicts.

 >    relying primarily on metrics (statistics) derived from citation  
data rather than a variety of methods, including judgments by  
scientists themselves…

The whole point is to cross-validate the metrics against the peer  
judgments, and then use the weighted metrics in place of the peer  
judgments, in accordance with their validated predictive power.

 >    bibliometrics (using counts of journal articles and their  
citations) will be a central quality index in this system [RAE]

Yes, but the successor of RAE is not yet clear on which metrics it  
will use, and whether and how it will validate them. There is still  
some risk that a small number of metrics will simply be picked a  
priori, without systematic validation. It is to be hoped that the IMU  
critique, along with other critiques and recommendations, will result  
in the use of the 2008 parallel metric/peer data for a systematic and  
exhaustive cross-validation exercise, separately for each discipline.  
Future assessments can then use the metric battery, with initialized  
weights (specific to each discipline), and can calibrate and optimize  
them across the years, as more data accumulates – including spot- 
checks cross-validating periodically against “light-touch” peer  
rankings and other validated or face-valid measures.

 >    sole reliance on citation‐based metrics replaces one kind of  
judgment with another. Instead of subjective peer review one has the  
subjective interpretation of a citation's meaning.

Correct. This is why multiple metrics are needed, and why they need to  
be systematically cross-validated against already-validated or face- 
valid criteria (such as peer judgment).

 >    Research usually has multiple goals, both short‐term and long,  
and it is therefore reasonable that its value must be judged by  
multiple criteria.

Yes, and this means multiple, validated metrics. (Time-course  
parameters, such as growth and decay rates of download, citation and  
other metrics are themselves metrics.)

 >    many things, both real and abstract, that cannot be simply  
ordered, in the sense that each two can be compared

Yes, we should not compare the incomparable and incommensurable. But  
whatever we are already comparing, by other means, can be used to  
cross-validate metrics. (And of course it should be done discipline by  
discipline, and sometimes even by sub-discipline, rather than by  
treating all research as if it were of the same kind, with the same  
metrics and weights.)

 >    plea to use multiple methods to assess the quality of research

Valid plea, but the multiple “methods” means multiple metrics, to  
be tested for reliability and validity against already validated  
methods.

 >    Measures of esteem such as invitations, membership on editorial  
boards, and awards often measure quality. In some disciplines and in  
some countries, grant funding can play a role. And peer review—the  
judgment of fellow scientists—is an important component of assessment.

These are all sensible candidate metrics to be included, alongside  
citation and other candidate metrics, in the multiple regression  
equation to be cross-validated jointly against already validated  
criteria, such as peer rankings (especially in RAE 2008).

 >    lure of a simple process and simple numbers (preferably a single  
number) seems to overcome common sense and good judgment.

Validation should definitely be done with multiple metrics, jointly,  
using multiple regression analysis, not with a single metric, and not  
one at a time.

 >    special citation culture of mathematics, with low citation  
counts for journals, papers, and authors, makes it especially  
vulnerable to the abuse of citation statistics.

Metric validation and weighting should been done separately, field by  
field.

 >    For some fields, such as bio‐medical sciences, this is  
appropriate because most published articles receive most of their  
citations soon after publication. In other fields, such as  
mathematics, most citations occur beyond the two‐year period.

Chronometrics – growth and decay rates and other time-based  
parameters for download, citations and other time-based, cumulative  
measures – should be among the battery of candidate metrics for  
validation.

 >    The impact factor varies considerably among disciplines... The  
impact factor can vary considerably from year to year, and the  
variation tends to be larger for smaller journals.

All true. Hence the journal impact factor – perhaps with various time  
constants – should be part of the battery of candidate metrics, not  
simply used a priori.

 >    The most important criticism of the impact factor is that its  
meaning is not well understood. When using the impact factor to  
compare two journals, there is no a priori model that defines what it  
means to be "better". The only model derives from the impact factor  
itself — a larger impact factor means a better journal... How does  
the impact factor measure quality? Is it the best statistic to measure  
quality? What precisely does it measure? Remarkably little is known...

And this is because the journal impact factor (like most other  
metrics) has not been cross-validated against face-valid criteria,  
such as peer rankings.

 >    employing other criteria to refine the ranking and verify that  
the groups make sense

In other words, systematic cross-validation is needed.

 >    impact factor cannot be used to compare journals across  
disciplines

All metrics should be independently validated for each discipline.

 >    impact factor may not accurately reflect the full range of  
citation activity in some disciplines, both because not all journals  
are indexed and because the time period is too short. Other statistics  
based on longer periods of time and more journals may be better  
indicators of quality. Finally, citations are only one way to judge  
journals, and should be supplemented with other information

Chronometrics. And multiple metrics.

 >    The impact factor and similar citation‐based statistics can be  
misused when ranking journals, but there is a more fundamental and  
more insidious misuse: Using the impact factor to compare individual  
papers, people, programs, or even disciplines

Individual citation counts and other metrics: Multiple metrics,  
jointly validated.

 >    the distribution of citation counts for individual papers in a  
journal is highly skewed, approximating a so‐called power law...   
highly skewed distribution and the narrow window of time used to  
compute the impact factor

To the extent that distributions are pertinent, they too can be  
parametrized and taken into account in validating metrics. Comparing  
like with like (e.g., discipline by discipline) should also help  
maximize comparability.

 >     using the impact factor as a proxy for actual citation counts  
for individual papers

No need to use one metric as a proxy for another. Jointly validate  
them all.

 >    if you want to rank a person's papers using only citations to  
measure the quality of a particular paper, you must begin by counting  
that paper's citations. The impact factor of the journal in which the  
paper appears is not a reliable substitute.

Correct, but this obvious truth does not need to be repeated so many  
times, and it is an argument against single metrics in general; and  
journal impact factor as a single factor in particular. But there’s  
nothing wrong with using it in a battery of metrics for validation.

 >    h‐index Hirsch extols the virtues of the h‐index by claiming  
that "h is preferable to other single‐number criteria commonly used  
to evaluate scientific output of a researcher…"[Hirsch 2005, p. 1],  
but he neither defines "preferable" nor explains why one wants to find  
"single‐number criteria."... Much of the analysis consists of showing  
"convergent validity," that is, the h‐index correlates well with  
other publication/citation metrics, such as the number of published  
papers or the total number of citations. This correlation is  
unremarkable, since all these variables are functions of the same  
basic phenomenon

The h-index is again a single metric. And cross-validation only works  
against either an already validated or a face-valid criterion, not  
just another unvalidated metric. And the only way multiple metrics,  
all inter-correlated, can be partitioned and weighted is with multiple  
regression analysis – and once again against a criterion, such as  
peer rankings.

 >    Some might argue that the meaning of citations is immaterial  
because citation‐based statistics are highly correlated with some  
other measure of research quality (such as peer review).

Not only might some say it: Many have said it, and they are quite  
right. That means citation counts have been validated against peer  
review, pairwise. Now it is time to cross-validate and entire spectrum  
of candidate metrics, so each can be weighted for its predictive  
contribution.

 >    The conclusion seems to be that citation‐based statistics,  
regardless of their precise meaning, should replace other methods of  
assessment, because they often agree with them. Aside from the  
circularity of this argument, the fallacy of such reasoning is easy to  
see.

The argument is circular only if unvalidated metrics are being cross- 
correlated with other unvalidated metrics. Then it’s a skyhook. But  
when they are cross-validated against a criterion like peer rankings,  
which have been the predominant basis for the RAE for 20 years, they  
are being cross-validated against a face-valid criterion – for which  
they can indeed be subsequently substituted, if the correlation turns  
out to be high enough.

 >    "Damned lies and statistics"

Yes, one can lie with unvalidated metrics and statistics. But we are  
talking here about validating metics against validated or face-valid  
criteria. In that case, the metrics lie no more (or less) than the  
criteria did, before the substitution.

 >    Several groups have pushed the idea of using Google Scholar to  
implement citation‐based statistics, such as the h‐index, but the  
data contained in Google Scholar is often inaccurate (since things  
like author names are automatically extracted from web postings)...

This is correct. But Google Scholar’s accuracy is growing daily, with  
growing content, and there are ways to triangulate author identity  
from such data even before the (inevitable) unique author identifier  
is adopted.

 >    Citation statistics for individual scientists are sometimes  
difficult to obtain because authors are not uniquely identified...

True, but a good approximation is -- or will soon be – possible (not  
for arbitrary search on the works of “Lee,” but, for example, for  
all the works of all the authors in the UK university LDAPs).

 >    Citation counts seem to be correlated with quality, and there is  
an intuitive understanding that high‐quality articles are highly‐ 
cited.

The intuition is replaced by objective data once the correlation with  
peer rankings of quality is demonstrated (and replaced in proportion  
to the proportion of the criterion variance accounted for) by the  
predictor metric.

 >    But as explained above, some articles, especially in some  
disciplines, are highly‐cited for reasons other than high quality,  
and it does not follow that highly‐cited articles are necessarily  
high quality.

This is why validation/weighting of metrics must be done separately,  
discipline by discipline, and why citation metrics alone are not  
enough: multiple metrics are needed to take into account multiple  
influences on quality and impact, and to weight them accordingly.

 >     The precise interpretation of rankings based on citation  
statistics needs to be better understood.

Once a sufficiently broad and predictive battery of metrics is  
validated and its weights initialized (e.g., in RAE 2008), further  
interpretation and fine-tuning can follow.

 >     In addition, if citation statistics play a central role in  
research assessment, it is clear that authors, editors, and even  
publishers will find ways to manipulate the system to their advantage.

True, but inasmuch as the new metric batteries will be Open Access,  
there will also be multiple metrics for detecting metric anomalies,  
inconsistency and manipulation, and for naming and shaming the  
manipulators, which will serve to control the temptation.

Harnad, S. (2001) Research access, impact and assessment. Times Higher  
Education Supplement 1487: p. 16.  http://cogprints.org/1683/

Harnad, S., Carr, L., Brody, T. & Oppenheim, C. (2003) Mandated online  
RAE CVs Linked to University Eprint Archives:
Improving the UK Research Assessment Exercise whilst making it cheaper  
and easier. Ariadne 35. http://www.ariadne.ac.uk/issue35/harnad/

Brody, T., Kampa, S., Harnad, S., Carr, L. and Hitchcock, S. (2003)  
Digitometric Services for Open Archives Environments. In Proceedings  
of European Conference on Digital Libraries 2003, pp. 207-220,  
Trondheim, Norway. http://eprints.ecs.soton.ac.uk/7503/

Harnad, S. (2007) Open Access Scientometrics and the UK Research  
Assessment Exercise. In Proceedings of 11th Annual Meeting of the  
International Society for Scientometrics and Informetrics 11(1), pp.  
27-33, Madrid, Spain. Torres-Salinas, D. and Moed, H. F., Eds. http://eprints.ecs.soton.ac.uk/13804/

Brody, T., Carr, L., Harnad, S. and Swan, A. (2007) Time to Convert to  
Metrics. Research Fortnight pp. 17-18. http://eprints.ecs.soton.ac.uk/14329/

Brody, T., Carr, L., Gingras, Y., Hajjem, C., Harnad, S. and Swan, A.  
(2007) Incentivizing the Open Access Research Web: Publication- 
Archiving, Data-Archiving and Scientometrics. CTWatch Quarterly 3(3). http://eprints.ecs.soton.ac.uk/14418/

Harnad, S. (2008) Self-Archiving, Metrics and Mandates. Science Editor  
31(2) 57-59

Harnad, S. (2008) Validating Research Performance Metrics Against Peer  
Rankings. Ethics in Science and Environmental Politics 8 (11 doi: 
10.3354/esep00088  http://eprints.ecs.soton.ac.uk/15619/

 > On Wed, 11 Jun 2008 18:05:36 +0200
 > "Armbruster, Chris" <Chris.Armbruster at EUI.EU> wrote:
 >> It is true that Thomson is misspelled as Thompson, but
 >> it is so consistently. It also the case that the Leiden
 >> stalwarts A.J.F. van Raan (wide body of work on
 >> performance measurement, university ranking etc.) and
 >> H.F. Moed (Book: Citation analysis in research
 >> evaluation) are not cited.
 >>
 >> Nevertheless, after reading the report, I would caution
 >> against dismissing it. Science and scientists should be
 >> concerned about the politicisation of metrics.
 >> Politicisation comes from governments and research
 >> funders but is also going on inside academic
 >> institutions. Moreover, in a general sense the citation
 >> and usage metrics currently available are not 'fit for
 >> purpose'. Worse still, politicisation carries with it the
 >> significant risk of arresting the development of tools
 >> for metric research evaluation. Evaluation is often
 >> narrowly defined as assessment and performance of
 >> institutions and indivudals for the purpose of awarding
 >> or denying funding and employment. This is something
 >> entirely different from metric evaluation as research
 >> information service to aid scientists in reducing the
 >> complexity of scientific information in their daily
 >> research.
 >>
 >> All we have at the moment are some 'quick fix metrics'.
 >> And these are increasingly used to make and legitimate
 >> all kinds of decisions. It is thus welcome that
 >> mathematicians and statisticians scrutinise current
 >> practices and show up the lack of validity and
 >> reliability of many measures, technical faults as well as
 >> the misguided judgements of peers, university management,
 >> funding agencies and government.
 >>
 >> My own contribution (working paper) may be found with
 >> SSRN:
 >> Armbruster, Chris, "Access, Usage and Citation Metrics:
 >> What Function for Digital Libraries and Repositories in
 >> Research Evaluation?" (January 29, 2008).
 >> Available at SSRN: http://ssrn.com/abstract=1088453
 >>
 >> If the link is broken, please use a search engine *SSRN
 >> plus title*
 >>
 >> Chris Armbruster
 >>
 >> -----Original Message-----
 >> From: American Scientist Open Access Forum on behalf of
 >> C.Oppenheim at lboro.ac.uk
 >> Sent: Wed 11/06/2008 14:56
 >> To:
 >> AMERICAN-SCIENTIST-OPEN-ACCESS-FORUM at LISTSERVER.SIGMAXI.ORG
 >> Subject:      Re: Citation statistics
 >>
 >> I haven't had a chance to read the report yet, but I'd
 >> be suspicious of any report that fails to spell "Thomson"
 >> correctly and fails to cite Ton van Raan, THE expert on
 >> the subject.
 >>
 >> Charles
 >>
 >> Professor Charles Oppenheim
 >> Head
 >> Department of Information Science
 >> Loughborough University
 >> Loughborough
 >> Leics LE11 3TU
 >>
 >> Tel 01509-223065
 >> Fax 01509 223053
 >> e mail c.oppenheim at lboro.ac.uk
 >> -----Original Message-----
 >> From: American Scientist Open Access Forum
 >> [mailto:AMERICAN-SCIENTIST-OPEN-ACCESS-FORUM at LISTSERVER.SIGMAXI.ORG]
 >> On Behalf Of Jean Kempf
 >> Sent: 11 June 2008 12:01
 >> To:
 >> AMERICAN-SCIENTIST-OPEN-ACCESS-FORUM at LISTSERVER.SIGMAXI.ORG
 >> Subject: Citation statistics
 >>
 >> Here's a report on citation statistics written by a
 >> statistician
 >>
 >>  http://www.mathunion.org/Publications/Report/CitationStatistics
 >>
 >> A press release that was mailed out today to journalists
 >> is at:
 >>
 >> http://www.mathunion.org/Publications/PressRelease/2008-06-11/CitationStatistics

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20080615/c3d72019/attachment.html>