SV: [SIGMETRICS] skewed citation distributions should not be averaged

Thu Sep 1 07:06:43 EDT 2011

Dear Loet,

Concerning sampling. Random sampling has a precise, technical meaning: sample units are drawn independently, and each unit in the population has an equal chance to be drawn at each stage.

Independence, equal chance and the population are central issues. Leaving, independence and chance aside (most often they are violated), defining the population is central to your question. You need a clearly defined real population, or a natural chance mechanism, in order to justify random sampling - in order to claim a random sample of something manifest.

Imaginary, assumed or super-populations are ill-defined and quite frankly convenient fictions because they do not have any empirical existence of their own. Random sampling from such "populations" also become fictitious as the data-generation mechanism is uncertain and assumptions become unjustifiable.

In observational studies, we are often left with samples of convenience or "apparent populations" and we very seldom examine or justify the data-generation mechanism needed for a probability sample. In scientometrics we have "large" or "huge" second-order data sets that resembles "apparent populations" - we can draw random samples from these if we clearly define the population and ensures that a sampling technique, where units are drawn independently and with equal chance, is used. Using "journals", "keywords", "institutions" or the like as a selection criterion without defining a real population can easily create "selection bias" and violate the assumptions needed.

Kind regards - Jesper W. Schneider

-----Oprindelig meddelelse-----
Fra: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at LISTSERV.UTK.EDU] På vegne af Loet Leydesdorff
Sendt: 1. september 2011 11:43
Til: SIGMETRICS at LISTSERV.UTK.EDU
Emne: Re: [SIGMETRICS] skewed citation distributions should not be averaged

Dear Wolfgang: 

Let's try to take this further. I have two questions:

1. You formulate:

"According to the central limit theorem, the distribution of the means of
random samples is approximately normal for a large sample size, provided the
underlying distribution of the population is in the domain of attraction of
the Gaussian distri-bution."

What is a "large" as different from a "huge" sample size? In Pajek, one
calls networks "huge" with more than 100,000 nodes. Do you mean that order
of magnitude? (10^5)

2. Are samples such as all citable items of a specific journal random
samples? The same in the case of performance measurement: can the sample of
all papers of the University of Louvain in 2010 be considered as a random
sample? Can samples based on specific selection criteria (such as search
strings) or stratified samples equally be considered as random?

Perhaps, I learned wrongly how to draw a random sample. :-)

Best wishes, 
Loet

-----Original Message-----
From: ASIS&T Special Interest Group on Metrics
[mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Glanzel, Wolfgang
Sent: Thursday, September 01, 2011 9:55 AM
To: SIGMETRICS at LISTSERV.UTK.EDU
Subject: Re: [SIGMETRICS] skewed citation distributions should not be
averaged

Dear Colleagues,

Please, read the text of 2.7 Myth #7 carefully. It is not about the
distribution itself but about the distribution of the mean value.
Furthermore, the text is not a statement but based on a proven theorem in
probability theorem. One needs large however not huge data sets for
empirical application.
I would also like to stress that the mean value is still an efficient and
unbiased estimator of the expected value of the underlying random variable.
This applies to all (continuous, discrete, symmetrical, skewed or
whatsoever) distributions  as long as the latter one is finite. 
Best regards,

Wolfgang
-----Original Message-----
From: ASIS&T Special Interest Group on Metrics
[mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Sylvan Katz
Sent: Mittwoch, 31. August 2011 23:24
To: SIGMETRICS at LISTSERV.UTK.EDU
Subject: Re: [SIGMETRICS] skewed citation distributions should not be
averaged

Loet,

Yes - perhaps something in the order of 20-30 years of Scopus or WoS data
might be large enough.

Sylvan

--On Wednesday, August 31, 2011 11:08 PM +0200 Loet Leydesdorff
<loet at LEYDESDORFF.NET> wrote:

> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html
>
>> A closer look at the evolution of the citation distributions over a 
>> long
> period of time maybe necessary before a definitive answer can be given 
> to the question of whether "Citation distributions are so skewed that 
> using the mean or any other central tendency measure is ill-advised."
>
> Dear Silvan,
>
> Wouldn't one need very large samples (N > 10^6) to test this?
> Typically, IFs, for example, are computed over 10^2 - 10^3 citable items.
>
> Best,
> Loet
>

Dr. J. Sylvan Katz, Visiting Fellow
SPRU, University of Sussex
http://www.sussex.ac.uk/Users/sylvank