Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel "Journal Status" arXiv:cs.GL/0601030 v1 9 Jan 2006

Thu Mar 9 15:44:55 EST 2006

Hooray!  I got a rise out of the guy I was going to name as probably the
only person active today who could possibly solve this problem.  However,
instead of giving a history and bibliography of the thing, why don't you
write a practitioner piece designed for idiots, explaining in simple
mathematicl terms--kindergarten, if possible--on how a practitioner like I
can handle the problem.  I am tired of always having to dodge around the
issue due to my stupidity, and the abililty to estimate the invisible zero
class is very important in the practical management of library collections
and electronic databases.

SB

"Quentin L. Burrell" <quentinburrell at MANX.NET>@LISTSERV.UTK.EDU> on
03/09/2006 02:26:01 PM

Please respond to ASIS&T Special Interest Group on Metrics
       <SIGMETRICS at LISTSERV.UTK.EDU>

Sent by:    ASIS&T Special Interest Group on Metrics
       <SIGMETRICS at LISTSERV.UTK.EDU>

To:    SIGMETRICS at LISTSERV.UTK.EDU
cc:     (bcc: Stephen J Bensman/notsjb/LSU)

Subject:    Re: [SIGMETRICS] Johan Bollen, Marko A. Rodriguez, and Herbert
       Van de Sompel "Journal Status" arXiv:cs.GL/0601030 v1 9 Jan 2006

I have been following these various exchanges with much interest. Let me
pick up a subthread from one of Stephen's earlier pieces, quoted in part:

----- Original Message -----
From: "Stephen J Bensman" <notsjb at LSU.EDU>
To: <SIGMETRICS at LISTSERV.UTK.EDU>
Sent: Monday, March 06, 2006 5:41 PM
Subject: Re: [SIGMETRICS] Johan Bollen, Marko A. Rodriguez, and Herbert Van
de Sompel "Journal Status" arXiv:cs.GL/0601030 v1 9 Jan 2006

  Small as this may be, the
> probabilities and lambda were actually much smaller, for Garfield's
> constant is based on the set of articles actually cited that year, i.e.,
> it
> it truncated on the left and does not take into account the articles that
> could have been cited but were not.  I do not have the technical or
> intellectual ability to estimate this zero claqs.  I do know that Sir
> Maurice Kendall backed off from the prmblem when he confronted it in
> Bradford's Law, and who the hell am I compared to Maurice Kendall.  I
wish
> that somebody would write an article understandable to simpletons on how
> to
> make such estimates.  From my perspective, this would be one of the most
> important articles ever written.
>

The estimation of the zero class is a longstanding problem that I recently
referred to in a Letter to the Editor of JASIS&T ("Sample-size dependence
or
time dependence of statistical measures in informetrics?" 55(2), 183-184,
2004).
The relevant extract and some historical references are as follows:

"Yoshikane et al. (2003a, 2003b) also make reference to Good (1953), Good &
Toulmin (1956), and Efron & Thisted (1976) in the context of interpolation
and extrapolation of data. The extrapolation problem - for instance, given
the cumulated data for 1992-1997, what can we say about the distribution if
the cumulation were to be extended to cover 1998? - has a long history.
Within bibliometrics Kendall (1960), in his discussion of Bradford's work
on
journal productivity, posed the problem "there is also a non-observed class
of journals which have not carried a relevant article in the period
examined
but may do so at any moment in the future. One would like to be able to
estimate the size of this potentially contributory class". This problem is
equivalent to the so-called unseen species problem in ecology and dates
back
at least to Fisher et al. (1943), see also Engen (1978). In the ecological
context the extrapolation may be in the sense of widening the geographical
area, in bibliometrics it is in the sense of increasing the time scale of
observation - obviously there are other variants. Kendall's problem was
addressed by Brookes (1975) who essentially, but independently,
demonstrated
a special case of the so-called Good & Toulmin formula (see Burrell
(1988)).
Efron & Thisted's (1976) empirical approach was further developed and
applied within bibliometrics by Burrell (1989, 1990). The current setting
of
a database being cumulated over time, using the Burrell (1992a) data, was
addressed by Burrell (1992b).

References

Brookes, B. C. (1975). A sampling theorem for finite discrete
distributions.
Journal of Documentation, 31, 26-35.

Burrell, Q. L. (1988). A simple empirical method for predicting library
circulations. Journal of Documentation, 44, 302-314.

Burrell, Q. L. (1989). On the growth of bibliographies with time: an
exercise in bibliometric prediction. Journal of Documentation, 45, 302-317.

Burrell, Q. L. (1990). Empirical prediction of library circulations based
on
negative binomial processes.  In L. Egghe & R. Rousseau (Eds.),
Informetrics
89/90: Selection of papers submitted for the Second International
Conference
on Bibliometrics, Scientometrics and Informetrics (pp. 57-64). Amsterdam:
Elsevier.

Burrell, Q. L. (1992a). The dynamic nature of bibliometric processes: a
case
study. In I. K. Ravichandra Rao (Ed.), Informetrics - 91: selected papers
from the Third International Conference on Informetrics  (pp. 97-129),
Bangalore: Ranganathan Endowment.

Burrell, Q. L. (1992b). One-step-ahead prediction for a growing database:
an
empirical Bayes approach. Journal of Scientific and Industrial Research,
51,
756-762.

Burrell, Q . L. (2003). The sample size dependency of statistical measures
in informetrics? Some comments. Journal of the American Society for
Information Science and Technology. (Published online 12 June, 2003.)

Efron, B. & Thisted, R. (1976). Estimating the number of unseen species:
How
many words did Shakespeare know? Biometrika, 63, 435-477.

Engen, S. (1978). Stochastic abundance models. London: Chapman and Hall.

Fisher, R. A., Corbet, A. S. & Williams, C. B. (1943). The relation between
the number of species and the number of individuals in a random sample from
an animal population. Journal of Animal Ecology, 12, 42-58.

Good, I. J. (1953). The population frequencies of species and the
estimation
of population parameters. Biometrika, 40, 237-264.

Good, I. J. & Toulmin, G. H. (1956). The number of new species, and the
increase in population coverage, when a sample is increased. Biometrika,
43,
45-63.

Kendall, M. G. (1960). The bibliography of operational research.
Operational
Research Quarterly, 11, 31-36.

Yoshikane, F., Kageura, K. & Tsuji, K. (2003a). A method for the
comparative
analysis of concentration of author productivity, giving consideration to
the effect of sample size dependency of statistical measures. Journal of
the
American Society for Information Science and Technology, 54, 521-528.

Yoshikane, F., Kageura, K. & Tsuji, K. (2003b). The sample size dependency
of statistical measures and synchronic potentiality in informetrics. Some
comments on some comments by Professor Burrell. Journal of the American
Society for Information Science and Technology, (Published online 25 June,
2003.)"

I am not sure that any of these could be described as "one of the most
important article ever written", from whatever perspective, but at least
they show that the problem has been considered and some results derived -
and applied.

Sorry for this diversion from the main theme!

Quentin

***********************************************
Dr Quentin L Burrell
Isle of Man International Business School
The Nunnery
Old Castletown Road
Douglas
Isle of Man IM2 1QB
via United Kingdom

q.burrell at ibs.ac.im

www.ibs.ac.im