skewed citation distributions should not be averaged

Thu Sep 1 09:12:09 EDT 2011

The crucial issue here is the point, which Dr. Glaenzel emphasizes, that
for the Central Limit Theorem to be applicable, and hence for the mean to
be valid, the distribution has to fall in the "domain of attraction of the
Gaussian distribution".  As others have pointed out, the Pareto or
power-law distribution to which the citation distribution is believed to
approximate, does not fall in this domain of attraction if its exponent is
less than 3.  Thus, the theorem is not wrong, but it's not applicable here.

What does this mean in practice?  Of course one can always calculate a mean
number of citations for a given data sample.  But if one calculates such
means for different samples -- even samples drawn from the exact same
underlying distribution -- one will get wildly different answers.  Indeed,
it can be shown that the values of the mean themselves follow a power law
under these circumstances, and hence can themselves vary over orders of
magnitude.

When people say the mean is invalid, this is what they are referring to.
You may calculate a mean from this year's data and get a value of 1, then
calculate it again for next year's and get a value of 100.  Under such
conditions, it does seem ill-advised to impute much meaning to such
measurements.

Mark Newman

--
Mark Newman
Paul Dirac Professor of Physics
University of Michigan

On 11-09-01 03:54 AM, Glanzel, Wolfgang wrote:
> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html
>
> Dear Colleagues,
>
> Please, read the text of 2.7 Myth #7 carefully. It is not about the distribution itself but about the distribution of the mean value. Furthermore, the text is not a statement but based on a proven theorem in probability theorem. One needs large however not huge data sets for empirical application.
> I would also like to stress that the mean value is still an efficient and unbiased estimator of the expected value of the underlying random variable. This applies to all (continuous, discrete, symmetrical, skewed or whatsoever) distributions  as long as the latter one is finite.
> Best regards,
>
> Wolfgang