ABS: Rousseau,LOTKA: A program to fit a power la wdistribution to observedfrequency data
Eric Archambault
Eric_Archambault at INRS-URB.UQUEBEC.CA
Fri Jan 26 17:04:00 EST 2001
I agree with Mark on the danger of working with power law distributions.
Those who have tried to replicate Lotka's work will have noticed that
although he did no mention of it, Lotka excluded some data from his dataset.
One has to transform some of the data through the use of outliers to
alleviate the problems outlined by Mark, in the case of number-frequency
distributions in scientometrics at least. As he so rightly pointed out,
regressions will be overestimated or underestimated regardless of whether
the data is number-frequency or rank-frequency. In number-frequency
distributions, it is very difficult to calculate a valid regression whatever
the method used. This is inherent to the data and not to the method. This
applies more to problems in the social sciences than in the natural
sciences, although I do not pretend it is absent in the latter. In social
systems, for number-frequency distributions, it is very difficult to
calculate a valid exponent without the use of outliers.
The use of rank-frequency only shifts the problem around. The solution that
I have favoured to calculate rank-frequency is to minimise this effect by
binning the data, hence, using the mid-point for any given frequency, the
data becomes a mean-rank - frequency distribution. Once this transformation
is accomplished, I'm not certain that least-square fitting is so bad, hence
my suggestion to test the difference obtained by maximum-likelihood and
least-square methods, and why not the maximum-entropy method while we're at
it.
In the end, the epistemological question remains of how to choose the best
answer, and hence, the best method. If we do not know a priori the power
coefficient of a distribution, and given the weakness of our theoretical
knowledge on the why of power-law distributions in social systems (sorry for
those drawing (weak) analogies between sand-piles, dinosaur extinction, and
scientific publications, this is not a theory nor an explanation for what we
observe in scientometric research) there is no foolproof method to determine
which measure is the "real one".
Eric Archambault
-----Original Message-----
From: Mark Newman [mailto:mark at SANTAFE.EDU]
Sent: 26 janvier, 2001 15:31
To: SIGMETRICS at LISTSERV.UTK.EDU
Subject: Re: [SIGMETRICS] ABS: Rousseau,LOTKA: A program to fit a power
lawdistribution to observedfrequency data
> Eric Archambault wrote:
>
> My contribution aimed to inform the people on the list on an additional
way to calculate power law
> distributions. I believe that Rousseau's contribution is useful since it
uses a maximum likelihood
> approach. It would be interesting to compare the extent of the difference
between this method
> compared to using least-square fits.
As has been pointed out by many people before (though perhaps
not on this list), performing least-squares fits to data in
order to fit a power law is fraught with danger. The principal
objection to this method is that, with logarithmic fits, the
statistical fluctuations in the logarithms of the data are
greater in the downward direction than in the upward one, for
obvious reasons. This effect is more pronounced in the tail
of the power law, and this has the result that there is a
systematic tendency for least-squares fits to overestimate the
slope of the power law. How much they overestimate depends on
the size of the statistical fluctuations, and is therefore
rather hard to control for. For this reason simple least-squares
fits are to be avoided.
Two common methods are used to circumvent this problem, neither
of which is perfect: (1) One calculates a backward cumulated
histogram of one's data (also called a rank/frequency plot).
This much improves the statistical fluctuations, but has the
undesirable property that successive data points become
correlated, making the simple statistical estimate of error
on the fit invalid. (2) One performs logarithmic binning of
the data, i.e., binning where the widths of adjacent bins are
a constant ratio, and normalizes by bin width. This reduces
the effects of the fluctuations, but for power laws with slope
greater than -1 it does not eliminate them altogether. (This
latter is my favored method.)
Ronald Rousseau proposes a further method based on maximization
of likelihood. This is also a good method to use, but is also
not perfect since, like all maximum likelihood methods, it
implicitly assumes that the probability of the model given the
data is equal to the probability of the data given the model,
which is only strictly true if the prior probabilities of both
model and data are uniform, which in general they are not.
The ultimate correct way of doing it is to use maximum entropy,
given the correct prior on the model. The trouble is, we
rarely know what the correct prior is, which is why maximum
likelihood is popular.
Mark Newman.
--
Prof. M. E. J. Newman
Santa Fe Institute
Santa Fe, New Mexico
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20010126/3688b064/attachment.html>
More information about the SIGMETRICS
mailing list