Enriching the Impact Regression Equation

Stevan Harnad harnad at ECS.SOTON.AC.UK
Sun Jan 16 10:01:57 EST 2005

In the OACI Leiden statement (if there is to be one)
the following constructive recommendations could perhaps be made:

The 2-year average number of citations to a journal (i.e., the ISI impact
factor) is not meaningless and unpredictive, but merely a needlessly
crude measure of the impact of either an article, an author or a journal.
It can be gretaly refined and improved.

Apart from exact citation counts for articles (and authors), and apart
from avoiding the comparison of apples with oranges (by making sure these
measures are used in comparing like with like), there are obvious ways
that even journal impact factors could be made far more accurate and
representative of true research impact.

Right now, "like tends to cite like" in more ways than one! Not only do
articles in phytology tend to cite articles in phytology, but average research
tends to cite average research! This means that there is necessarily a quanitative
citation bulge toward the middle (mean) of the distribution that masks any far more
important qualitative impact from the smaller, higher quality tail-end of the

There are at least five ways that this could be remedied -- and it makes
no sense to wait for ISI, with their primary need to pay more attention
to market matters, to get around to doing all this for us. A growing
Open Access full-text corpus can count on many talented and enterprising
doctoral students like Tim Brody doing this and more:

(1) RECURSIVE "CiteRank": A recursive measure of citation of citation
weight could replace flat citation counting: If article A cites article
B, Article A's citation weight is not 1 but a normalized multiple of
1 based on the number of citations the *citing* article has itself
received. This would go some way toward replacing the pure weight of
numbers by a recursive measure of the weight of the numbers (without ever
yet leaving the circle of citation counts themselves). Average work will
lose some of its strength-of-numbers unless it manages to draw citations
from above-average articles too (still in terms of citation counts).

[This recursive technique is analogous to Google's PageRank, hence could
perhaps be called CiteRank; it is ironic that Google got the idea of
PageRank from citation ranking, but then improved it, yet the improvement
has not yet percolated back to citation ranking, because ISI had no
particular motive to implement it -- perhaps even a disincentive, as it
might reduce the journal impact factor of the large, average journals
which are of necessity ISI's numerical mainstay!]

(2) USAGE COUNTS: The circularity of citation counting can also be broken
in various ways. One is by adding download counts to the impact measure,
not as a weight on the citation count, but as a second variable in a
multiple regression equation. We know  now from Tim Brody's findings that
downloads correlate with and hence predict citations. That means citation
counts plus download counts are better predictors of impact than just
citation counts alone, and are especially good at correcting for early
impact, which may not yet be felt in the citation counts.

(3) RATING SCORES: A more radical way to break out of the circularity of citation
counting can be done in two ways: Systematic rating polls can easily be
conducted, asking researchers (by field and subfield) to rank the N most
important articles in their field in the past year (or two). Even with
the inevitable incest this will evoke, a good-sized systematic sample will
pick out the recurrent articles (because, by definition, local-average
mediocrity effects are merely local) and then the rankings could either
be used as (3a) a third independent variable in the impact regression
equation or, perhaps more interestingly, as (3b) another constraint on
the weighting of the CiteRank score (effectively making that weight the
result of a 2nd order regression equation based on the citer's citation
count aas well as on the citer's rating score: the download count could also be
used instead as a 3rd component in this 2nd order regression). The result
will be a still better adjustment of the citation count for an article
(and hence an adjustment of the journal's average citation count too).

(4) CO-CITATION & HUB-AUTHORITY SCORES: Although I would need to consult
with a statistician to sort it out optimally, I am certain that
co-citation (what article/author is co-cited with what article/author)
can also be used to correct or add to the impact regression equation. So,
I expect, could a hub (fan-in) and authority (fan-out) score, as well
as a better use of citation latency (ISI's "immediacy factor") in the
impact equation.

(5) AUTHOR/JOURNAL SELF-CITATIONS: Another clean-up factor for citation
counts is of course the elimination of self-citations, which would
be interesting not only for author self-citations, but also journal
self-citations: This ttoo might be added as another pair of variables in
the regression equation (self-citation score and journal self-citation
score), with the weight adjusting itself, as the variable's proves
its predictivity.

The predictivity and validity of the regression equation should of
course also be actively tested and calibrated by validating it against
(a) later citation impact, (b) subjective impact ratings (2, above), (c)
other impact measures such as prizes, funding, and time-line descendents
that are further than one citation-step away (A is cited by B, B is
cited by C: this could be an uncited credit to A...)

And all of this is without even mentioning full-text "semantic" analysis.
So the potential world of impact analysis is a rich and diverse one. Let us not be
parochial, focussing only on the limits of the ISI 2-year average journal
citation-count that has become so mindlessly overused by libraries and
assessors. Let us talk instead about the positive horizons OA opens up!

Cheers, Stevan

More information about the SIGMETRICS mailing list