Future UK RAEs to be Metrics-Based

Tue Mar 28 06:19:27 EST 2006

This an anonymised exchange from a non-public list concerning
scientometrics and the future of the UK Research Assessment Exercise.
I think it has important general scientometric implications.
By way of context: The RAE was an expensive, time-consuming
submission/peer-re-evaluation exercise, performed every 4 years. It turned
out a few simple metrics were highly correlated with its outcome. So it
was proposed to scrap the expensive method in favour of just using
the metrics. -- SH

---------- Forwarded message ----------

On Tue, 28 Mar 2006, [identity deleted] wrote:

> At 8:34 am -0500 27/3/06, Stevan Harnad wrote:
> >SH: Scrap the RAE make-work, by all means, but don't just rely on one
> >metric! The whole point of metrics is to have many independent
> >predictors, so as to account for as much as possible of the
> >criterion variance:
>
> This seems extremely naive to me. All the proposed metrics I have
> seen are *far* from independent - indeed they seem likely to be
> strongly positively associated.

that's fine. In multiple regression it is not necessary that each
predictor variable be orthogonal; they need only predict a significant
portion of the residual variance in the target (or "criterion") after
the correlated portion has been partialled out. If you are trying to
predict university performance and you have maths marks, english marks
and letters of recommendation (quantified), it is not necessary, indeed
not even desirable, that the correlation among the three predictors
should be zero. That they are correlated shows that they are partially
measuring the same thing. What is needed is that the three jointly, in a
multilinear equation, should predict university performance better than
any one of them alone. Their respective contributions to the variance
can then be given a weight.

The analogy is vectors, a linear combination of several of which may
yield another, target vector. It need not be a linear combination of
orthogonal vectors, just linearly independent ones.

Three other points:

(1) RCUK ranking itself is just a predictor, not the criterion that is
being predicted and against which the predictor(s) need to be validated.
The criterion is research performance/quality. Only metrics with face
validity can be taken to be identical with the criterion, as opposed to
mere predictors of it, and the RAE outcome is certainly not face-valid.

(2) Given (1), it follows that the *extremely* high correlation between
prior funding and RAE rank (0.98 was mentioned) is *not* a desirable
thing. The predictive power of the RAE ranking needs to be increased, by
adding more (semi-independent but not necessarily orthogonal) predictor
metrics to a regression equation (such as funding, citations, downloads,
co-citations, completed PhDs, and many other potential metrics that will
emerge from an Open Access database and digital performance record-keeping
CVs, customised for each discipline) rather than being replaced by a
single one-dimensional predictor metric (prior funding) that happens to
co-vary almost identically with the prior RAE outcome in many disciplines.

(3) Validating predictor metrics against the target criterion is
notoriously difficult when the criterion itself has no direct
face-valid measure. (An example is the problem of validating IQ tests.)
The solution is partly internal validation (validating multiple
predictor metrics against one another) and partly calibration, which
is the adjustment of the weight and number of the predictor metrics
according to corrective feedback from their outcome: In the case of the
RAE multiple regression equation, this could be done partly on the basis
of the 4-year predictive power of metrics against their own later values,
and partly against subjective peer rankings of departmental performance
and quality as well as peer satisfaction ratings for the RAE outcomes
themselves. (There may well be other validating methods.)

> This sounds perilously close to what I used to read in the software
> metrics literature, where attempts were made to capture 'complexity'
> in order to predict the success or failure of software projects.
> People there adopted a
> measure-everything-you-can-think-of-and-hope-something-useful-pops-up
> approach. The problem was that all the different metrics turned out
> to be variants of 'size', and even together they did not enable good
> prediction.

It is conceivable but unlikely that all research performance predictor
metrics turn out to be measuring the same thing, and that none of them
contributes a separate independent component to the variance of the
outcome; but I rather doubt it. At the risk of arousing other prejudices,
I would make an analogy with psychometrics: Test of cognitive
performance capacity (formerly called "IQ" tests) (maths, spatial,
verbal, motor, musical, reasoning, etc.) are constructed and validated
by devising test items and testing them first for reliability (i.e.,
how well they correlate with themselves on repeated administration)
and then cross-correlation and external validation. The (empirical)
result has been the emergence of one general or "G" factor for which
the weight or "load" of some tests is greater than others, so that no
single test measures it exactly, and hence a multiple regression battery,
with each test weighted according to the amount of variance it accounts
for, is preferable to relying on just a single test. And the outcome
is that there turns out to be the one large underlying G factor, with
a component in every one of the tests, plus a constellation of special
factors, associated with special abilities supplementing the G factor,
each adding a smaller but significant component to the variance too,
but varying by individual and field in their predictive power.

The controversy has been about whether the fact that the tests are
validated on the basis of positive correlations among the items is the
artifactual source of the positive manifold underlying G. I am not
a statistician or a psychometrician, but I think the more competent,
objective verdict (the one not driven by a-priori ideological views)
has been that G is *not* an artifact of the selection for positive
correlations, but a genuine empirical finding about a single general
(indeed biological) factor underlying intelligence.

I am not saying there will be a "G" underlying research performance!
Just that the multilinear (and indeed nonlinear) regression method can
be used to tease out the variance and the predictivity from a rich and
diverse set of intercorrelated predictor metrics. (It can also sort
out the duds, that are either redundant or predict nothing of interest
at all.)

> > SH: Metrics are trying to measure and evaluate research performance,
>
> I think you mean 'predict' - not the same thing at all

They measure the predictor variable and try to predict the criterion
variable. As such, they are meant to provide an objective (but
validated) basis for evaluation.

> >SH: not just to 2nd-guess the present RAE outcome,
> >nor merely to ape existing funding levels. We need a rich multiple
> >regression equation, with many weighted predictors, not just one
> >redundant mirror image of existing funding!
>
> Well.... In fact 'existing funding' *may* actually be a good
> predictor of whatever it is we want to predict (see [deleted]'s recent
> posting)!

To repeat: The RAE itself is a predictor, in want of validation. Prior
funding correlates 0.98 with this predictor (in some fields, and is
hence virtually identical with it), but is itself in want of validation.
This high correlation with the actual RAE outcome is already rational
grounds for scrapping the time-wasting and expensive ritual that is the
present RAE, but it is certainly not grounds for scrapping other metrics
that can and should be weighted components in the metric equation that
replaces the current wasteful and redundant RAE. The metric predictors
can then be enriched, cross-tested, and calibrated. (It is my
understanding that RAE 2008 will consist of a double exercise: yet
another iteration of the current ergonomically profligate RAE ritual
plus a parallel metric exercise. I think they could safely scrap the
ritual already, but the parallel testing of a rich battery of actual
and potential metrics is an extremely good -- and economical -- idea.)

> We can only test such hypotheses when we are clear what it
> is we want to predict, and what we mean by 'accuracy' of prediction.

In the first instance, in the decision about whether or not to scrap the
expensive and inefficient current RAE ritual, it is sufficient to
predict the current RAE outcome with metrics.

In order to go on to test and strengthen the predictive power of
that battery of metrics, they need to be enriched and diversified,
internally validated and weighted against one another (and the prior
RAE), and externally validated against the kinds of measure I mentioned
(subjective peer evaluations, predictive power across time, perhaps
other outcome metrics etc.)

> Even if we knew this, I'm not sure the right data is available. But
> in the absence of such a proper investigation, let's not pretend that
> the answer is obvious, as you seem to be doing.

The answer is obvious insofar as scrapping the prior RAE method is
concerned, given the strong correlations. The answer is also obvious
regarding the fact that multiple metrics are preferable to a single
one. Ways of strengthening the predictive power of objective measures
of research performance are practical and empirical matters we need to
be analysing und upgrading continuously.

Stevan Harnad