SV: [SIGMETRICS] skewed citation distributions should not be averaged

Thu Sep 1 07:41:56 EDT 2011

Dear Jesper and colleagues,

This discussion is a bit amazing to me.

Concerning sampling. Random sampling has a precise, technical meaning:
> sample units are drawn independently, and each unit in the population has an
> equal chance to be drawn at each stage.
>

This is how I learned it. We almost never have this condition in our type of
studies. For example, if one is interested in the impact of a research unit
in a field of science, one does not draw randomly from the papers in this
field of science, but with very precise criteria. Thus, we do not study
random samples in this type of designs. Do we agree?

> Independence, equal chance and the population are central issues. Leaving,
> independence and chance aside (most often they are violated), defining the
> population is central to your question. You need a clearly defined real
> population, or a natural chance mechanism, in order to justify random
> sampling - in order to claim a random sample of something manifest.
>

Isn't the reference set considered as the population. For example, the
journal or the field, or the country. One cannot even compute percentiles
without specification of a reference set. The reference set is also needed
for the specification of the expectation.

Unlike biology (or medicine), we are not dealing with populations and random
samples from this, but with specific--culturally meaningful--sets and
subsets thereof.

> Imaginary, assumed or super-populations are ill-defined and quite frankly
> convenient fictions because they do not have any empirical existence of
> their own. Random sampling from such "populations" also become fictitious as
> the data-generation mechanism is uncertain and assumptions become
> unjustifiable.
>

Wolfgang's point is that the means of the samples are normally distributed
even if the samples are not. Thus, one can compare means as "unbiased
estimators" of the distributions (using parametric statistics, e.g., the
t-test). Now that I fully understand this point (given some offline
conversation), I agree with it.

However, one should not use means (or other central tendency statistics) for
performance evaluation. (If one wishes nevertheless to use means then these
means are normally distributed. :-)). Using the mean, the N in the
denominator has detrimental effects on performance measurement. For example,
if one compares two equally highly cited authors with equal N, and one adds
to the one a number of PhD students and postdocs and not to the other, the
two research groups will be very differently evaluated when using the N in
the denominator. However, the less-cited papers of the juniores add to the
impact of the unit.

If I hit you twice, the impact is not the average of the two hits, but the
sum. The surfaces beneath the citation curves have to be integrated and can
then be summed and substracted. However, the integration would lead to
"total cites", but this is too raw a number. By first normalizing in terms
of percentages, one can make sure that top-1% is compared with top-1%, etc.
Both averages and sumtotals are too crude measures. The Integrated Impact
Indicator solves these issues elegantly.

In observational studies, we are often left with samples of convenience or
> "apparent populations" and we very seldom examine or justify the
> data-generation mechanism needed for a probability sample. In scientometrics
> we have "large" or "huge" second-order data sets that resembles "apparent
> populations" - we can draw random samples from these if we clearly define
> the population and ensures that a sampling technique, where units are drawn
> independently and with equal chance, is used. Using "journals", "keywords",
> "institutions" or the like as a selection criterion without defining a real
> population can easily create "selection bias" and violate the assumptions
> needed.
>

It seems to me that we agree. These assumptions -- needed for using means --
are usually violated in empirical studies. We deliberately introduce a
selection bias by using intellectual criteria. :-) This is not biology (or
thermodynamics).

Best wishes, Loet

> Kind regards - Jesper W. Schneider
>
>
>
>
>
>
> -----Oprindelig meddelelse-----
> Fra: ASIS&T Special Interest Group on Metrics [mailto:
> SIGMETRICS at LISTSERV.UTK.EDU] På vegne af Loet Leydesdorff
> Sendt: 1. september 2011 11:43
> Til: SIGMETRICS at LISTSERV.UTK.EDU
> Emne: Re: [SIGMETRICS] skewed citation distributions should not be averaged
>
> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html
>
> Dear Wolfgang:
>
> Let's try to take this further. I have two questions:
>
> 1. You formulate:
>
> "According to the central limit theorem, the distribution of the means of
> random samples is approximately normal for a large sample size, provided
> the
> underlying distribution of the population is in the domain of attraction of
> the Gaussian distri-bution."
>
> What is a "large" as different from a "huge" sample size? In Pajek, one
> calls networks "huge" with more than 100,000 nodes. Do you mean that order
> of magnitude? (10^5)
>
> 2. Are samples such as all citable items of a specific journal random
> samples? The same in the case of performance measurement: can the sample of
> all papers of the University of Louvain in 2010 be considered as a random
> sample? Can samples based on specific selection criteria (such as search
> strings) or stratified samples equally be considered as random?
>
> Perhaps, I learned wrongly how to draw a random sample. :-)
>
> Best wishes,
> Loet
>
> -----Original Message-----
> From: ASIS&T Special Interest Group on Metrics
> [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Glanzel, Wolfgang
> Sent: Thursday, September 01, 2011 9:55 AM
> To: SIGMETRICS at LISTSERV.UTK.EDU
> Subject: Re: [SIGMETRICS] skewed citation distributions should not be
> averaged
>
> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html
>
> Dear Colleagues,
>
> Please, read the text of 2.7 Myth #7 carefully. It is not about the
> distribution itself but about the distribution of the mean value.
> Furthermore, the text is not a statement but based on a proven theorem in
> probability theorem. One needs large however not huge data sets for
> empirical application.
> I would also like to stress that the mean value is still an efficient and
> unbiased estimator of the expected value of the underlying random variable.
> This applies to all (continuous, discrete, symmetrical, skewed or
> whatsoever) distributions  as long as the latter one is finite.
> Best regards,
>
> Wolfgang
> -----Original Message-----
> From: ASIS&T Special Interest Group on Metrics
> [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Sylvan Katz
> Sent: Mittwoch, 31. August 2011 23:24
> To: SIGMETRICS at LISTSERV.UTK.EDU
> Subject: Re: [SIGMETRICS] skewed citation distributions should not be
> averaged
>
> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html
>
> Loet,
>
> Yes - perhaps something in the order of 20-30 years of Scopus or WoS data
> might be large enough.
>
> Sylvan
>
>
> --On Wednesday, August 31, 2011 11:08 PM +0200 Loet Leydesdorff
> <loet at LEYDESDORFF.NET> wrote:
>
> > Adminstrative info for SIGMETRICS (for example unsubscribe):
> > http://web.utk.edu/~gwhitney/sigmetrics.html
> >
> >> A closer look at the evolution of the citation distributions over a
> >> long
> > period of time maybe necessary before a definitive answer can be given
> > to the question of whether "Citation distributions are so skewed that
> > using the mean or any other central tendency measure is ill-advised."
> >
> > Dear Silvan,
> >
> > Wouldn't one need very large samples (N > 10^6) to test this?
> > Typically, IFs, for example, are computed over 10^2 - 10^3 citable items.
> >
> > Best,
> > Loet
> >
>
>
>
> Dr. J. Sylvan Katz, Visiting Fellow
> SPRU, University of Sussex
> http://www.sussex.ac.uk/Users/sylvank
>
>

-- 
Prof. Loet Leydesdorff
Amsterdam School of Communications Research (ASCoR)
Kloveniersburgwal 48, 1012 CX Amsterdam
Tel.: +31-20- 525 6598; fax: +31-20- 525 3681
loet at leydesdorff.net ; http://www.leydesdorff.net/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20110901/f11fac88/attachment.html>