# White HD "Author cocitation analysis and ...

Loet Leydesdorff loet at LEYDESDORFF.NET
Tue Dec 23 02:27:29 EST 2003

Dear Steve,

Thank you for the interesting contribution. Let me make a few remarks:

1. Why did you reduce the matrices studied to binary ones? ("The (i,j)th
element of O(p,ra) is unity if paper i cites reference author j one or
more times, zero otherwise." at
<http://samorris.ceat.okstate.edu/web/rxy/default.htm>
http://samorris.ceat.okstate.edu/web/rxy/default.htm .) Both r and the
cosine are well defined for frequency distributions.

The cosine between two vectors x(i) and y(i) is defined as:

cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) *
Sigma(i) y(i)^2))

For those of you who read this in html:

<http://users.fmg.uva.nl/lleydesdorff/stemcell/index_files/image021.gif>

In the case of the binary matrix this formula degenerates to the simpler
format that you used:

cos=n(i,j)/sqrt[n(i)*n(j)]

SPSS calls this simpler format the "Ochiai". Salton & McGill (1983)
provided the full formula in their "Introduction to Modern Information
Retrieval" (Auckland, etc.: McGraw-Hill).

There seems no reason to throw away part of the information that is
available in your datasets. I would be curious to see how your curves
would look like using the full data. I expect some effects.

purposes, one may wish to use either measure as White (2003) posits.
However, the fundamental points remain the same, isn't it? One could
also have a zero variance in an ACA matrix or not? The problem with the
zeros signalled by Ahlgren et al. (2003) remains also in this case,
isn't it?

3. In addition to the technical differences, there may be differences
stemming from the research design that make the researcher decide to use
one or the other measure. For example, in a factor analytic design one
uses Pearson's r. For mapping purposes one may also consider the
Euclidean distance, but this is expected to provide very different
results. The theoretical purposes of the research have first to be
specified, in my opinion.

4. My interest in this issue is driven by my interest in the evolution
of communication systems. One can expect communication systems to
develop in different phases like a segmentation, stratification, and
differentiation. In a segmented communication system only mutual
relations would count. Euclidean distances may be the right measure.

In a fully differentiated one, one would expect eigenvector to be
spanned orthogonally at the network level. Here factor analysis provides
us with insights in the structural differentiation. In the in-between
stage a stratified communication system is expected to be hierarchically
organized. The grouping is then reduced to a ranking. For this case, the
cosine seems a good mapping tool since it organized the "star" of the
network in the center of the map (using a visualization tool). Pearson's
r in this case has the disadvantages mentioned previously during this
discussion.

The Jaccard index seems to operate somewhere between the Euclidean
distance and the cosine. It focusses on segments, but the interpretation
is closer to the cosine than to the Euclidean distance measure. Thus, I
am not sure that one should use this measure in an evolutionary
analysis.

I mentioned the forthcoming paper of Caroline Wagner and me about
coauthorship relations (http://www.leydesdorff.net/sciencenets ) in
which we showed how the cosine-based analysis and mapping versus the the
Pearson-correlation based factor analysis enabled us to explore
different aspects of the same matrix. These different aspects can be
provided with different interpretations: the hierarchy in the network
and the competitive relations among leading countries, respectively. But
I still have to develop the fundamental argument more systematically.

With kind regards,

Loet
_____

Loet Leydesdorff
Amsterdam School of Communications Research (ASCoR)
Kloveniersburgwal 48, 1012 CX Amsterdam
Tel.: +31-20- 525 6598; fax: +31-20- 525 3681
<mailto:loet at leydesdorff.net> loet at leydesdorff.net ;
<http://www.leydesdorff.net/> http://www.leydesdorff.net/

<http://www.upublish.com/books/leydesdorff-sci.htm> The Challenge of
Scientometrics ;  <http://www.upublish.com/books/leydesdorff.htm> The
Self-Organization of the Knowledge-Based Society

> -----Original Message-----
> From: ASIS&T Special Interest Group on Metrics
> [mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Steven Morris
> Sent: Tuesday, December 23, 2003 3:26 AM
> To: SIGMETRICS at listserv.utk.edu
> Subject: Re: [SIGMETRICS] White HD "Author cocitation analysis and ...
>
>
> Dear colleagues,
>
> Regarding rxy vs. cosine similarity:
>
> Web of Science, where a paper to reference author citation
> matrix can be extracted, the calculation of cosine similarity
> and rxy, the correlation coefficient, are both
> straightforward. Similarity is based on the number of times a
> pair of authors are cited together. N is the number of papers
> in the collection, n(i), n(j) is the number of citations
> received by ref author i and j, n(i,j) is the number of
> papers citing both ref author i and ref author j. The
> correlation coefficient is calculated from
> rxy=[N*n(i,j)-n(i)*n(j)]/sqrt[(N*n(i)-n(i)^2)*(N*n(j)-n(j)^2)]
>  while the cosine similarity is calulated using
> s=n(i,j)/sqrt[n(i)*n(j)]. If N is large compared to the
> product of the number of cites received by a pair of authors,
> then rxy and cosine formula give equal results.  See
> http://samorris.ceat.okstate.edu/web/rxy/default.htm
> for crossplots of cosine similarity vs. rxy for reference
> authors from several collections of papers.
>
> For collections of papers without domininant reference
> authors there is very little difference between cosine and
> rxy.  For collections with dominant reference authors that
> are cited by a large fraction of the total number of papers,
> rxy can be much less than cosine similarity.
>
> Correlation coefficient is problematic in this case because
> it is possible for pairs of authors with large co-citation
> counts to have zero rxy.  For example, two authors, both
> cited by half the papers in the collection, but cocited by
> 1/4 of the papers will have a correlation coefficient of zero
> but a cosine similarity of 1/2. Also, the correlation
> coefficient is not defined for any author that is cited by
> all papers in the collection, since that author has zero
> variance. Recall that rxy is cov(x,y)/sqrt[var(x)*var(y)], so
> zero variance drives the denominator to zero in the rxy
> equation, thus undefined rxy.
>
> For this reason it's probably better to use cosine similarity
> than rxy for ACA analysis based on a paper to ref author
> matrix.  Converting similarities to distances for clustering
> is less problematic as well.
>
> The situation is different for ACA based on a co-citation
> count matrix. In this case the similarity between two authors
> is not based on how often they are cited together, but
> whether the two authors are  co-cited in the same proportions
> among the other authors in the collection.  In this case it
> would seem that rxy would be the appropriate measure of
> similarity to use.
>
> S. Morris
>
>
>
> Loet Leydesdorff wrote:
> >  > -----Original Message-----
> >  > From: ASIS&T Special Interest Group on Metrics
> >  > [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Eugene
> Garfield
> > > Sent: Monday, December 01, 2003 9:57 PM  > To:
> > SIGMETRICS at LISTSERV.UTK.EDU  > Subject: [SIGMETRICS] White
> HD "Author
> > cocitation analysis  > and Pearson's r" Journal of the American
> > Society for  > Information Science and Technology 54(13):1250-1259
> > November 2003,  >
> >  >
> >  > Howard D. White : Howard.Dalby.White at drexel.edu
> >  >
> >  > TITLE    Author cocitation analysis and Pearson's r
> >  >
> >  > AUTHOR   White HD
> >  >
> >  > JOURNAL  JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION
> >  >          SCIENCE AND TECHNOLOGY 54 (13): 1250-1259 NOV 2003
> >
> > Dear Howard and colleagues,
> >
> practical
> > purposes Pearson's r will do a job similar to Salton's cosine.
> > Nevertheless, the argument of Ahlgren et al. (2002) seems
> convincing
> > to me. Scientometric distributions are often highly skewed and the
> > mean can easily be distorted by the zeros. The cosine
> elegantly solves
> > this problem.
> >
> > A disadvantage of the cosine (in comparison to the r) may
> be that it
> > does not become negative in order to indicate
> dissimilarity. This is
> > particularly important for the factor analysis. I have
> > input-ing the cosine matrix into the factor analysis (SPSS
> allows for
> > importing a matrix in this analysis), but that seems a bit tricky.
> >
> > Caroline Wagner and I did a study on coauthorship relations
> entitled
> > "Mapping Global Science using International Coauthorships: A
> > comparison of 1990 and 2000" (Intern. J. of Technology and
> > Globalization,
> > forthcoming) in which we used the same matrix for mapping using the
> > cosine (and then Pajek for the visualization) and for the factor
> > analysis using Pearson's r. The results are provided as
> factor plots in
> > the preprint version of the paper at
> > http://www.leydesdorff.net/sciencenets/mapping.pdf .
> >
> > While the cosine maps exhibit the hierarchy by placing the central
> > cluster in the center (including the U.S.A. and some
> Western-European
> > countries), the factor analysis reveals the main structural axes of
> > the system as competitive relations between the U.S.A., U.K., and
> > continental Europe (Germany + Russia). The French system can be
> > considered as a fourth axis. These eigenvectors function as
> > competitors for collaboration with authors from other
> (smaller or more
> > peripheral) countries.
> >
> > Thus, the two measures enable us to show something differently:
> > Salton's cosine exhibits the hierarchy and one might say that the
> > factor analysis on the basis of Pearson's r enables us to show the
> > heterarchy among competing axes in the system.
> >
> > With kind regards,
> >
> > Loet
> >
> >
> ----------------------------------------------------------------------
> > --
> > Loet Leydesdorff
> > Amsterdam School of Communications Research (ASCoR)
> > Kloveniersburgwal 48, 1012 CX Amsterdam
> > Tel.: +31-20- 525 6598; fax: +31-20- 525 3681
> > loet at leydesdorff.net <mailto:loet at leydesdorff.net>;
> > http://www.leydesdorff.net/
> >
> > The Challenge of Scientometrics
> > <http://www.upublish.com/books/leydesdorff-sci.htm> ; The
> > Self-Organization of the Knowledge-Based Society
> > <http://www.upublish.com/books/leydesdorff.htm>
> >
> >
> >
> >  >
> >  >
> >  >  Document type: Article  Language: English  Cited
> References:  > 20
> > Times Cited: 0  >
> >  > Abstract:
> >  > In their article "Requirements for a cocitation similarity
> >  > measure, with special reference to Pearson's correlation
> >  > coefficient," Ahlgren, Jarneving, and Rousseau fault
> >  > traditional author cocitation analysis (ACA) for using
> >  > Pearson's r as a measure of similarity between authors
> >  > because it fails two tests of stability of measurement. The
> >  > instabilities arise when rs are recalculated after a first
> >  > coherent group of authors has been augmented by a second
> >  > coherent group with whom the first has little or no
> >  > cocitation. However, AJ&R neither cluster nor map their data
> >  > to demonstrate how fluctuations in rs will mislead the
> >  > analyst, and the problem they pose is remote from both theory
> >  > and practice in traditional ACA. By entering their own rs
> >  > into multidimensional scaling and clustering routines, I show
> >  > that, despite rs fluctuations, clusters based on it are much
> >  > the same for the combined groups as for the separate groups.
> >  > The combined groups when mapped appear as polarized clumps of
> >  > points in two-dimensional space, confirming that differences
> >  > between the groups have become much more important than
> >  > differences within the groups-an accurate portrayal of what
> >  > has happened to the data. Moreover, r produces clusters and
> >  > maps very like those based on other coefficients that AJ&R
> >  > mention as possible replacements, such as a cosine similarity
> >  > measure or a chi square dissimilarity measure. Thus, r
> >  > performs well enough for the purposes of ACA. Accordingly, I
> >  > argue that qualitative information revealing why authors are
> >  > cocited is more important than the cautions proposed in the
> >  > AJ&R critique. I include notes on topics such as handling the
> >  > diagonal in author cocitation matrices, lognormalizing data,
> >  > and testing r for significance.
> >  >
> >  > KeyWords Plus:
> >  > INTELLECTUAL STRUCTURE, SCIENCE
> >  >
> >  > White HD, Drexel Univ, Coll Informat Sci & Technol, 3152
> >  > Chestnut St, Philadelphia, PA 19104 USA Drexel Univ, Coll
> >  > Informat Sci & Technol, Philadelphia, PA 19104 USA
> >  >
> >  > Publisher:
> >  > JOHN WILEY & SONS INC, 111 RIVER ST, HOBOKEN, NJ 07030 USA
> >  >
> >  > IDS Number:
> >  > 730VQ
> >  >
> >  >
> >  >  Cited Author            Cited Work                Volume
> >  >  Page   Year
> >  >      ID
> >  >
> >  >  AHLGREN P             J AM SOC INF SCI TEC          54
> >  > 550      2003
> >  >  BAYER AE              J AM SOC INFORM SCI           41
> >  > 444      1990
> >  >  BORGATTI SP           UCINET WINDOWS SOFTW
> >  >          2002
> >  >  BORGATTI SP           WORKSH SUNB 20 INT S
> >  >          2000
> >  >  DAVISON ML            MULTIDIMENSIONAL SCA
> >  >          1983
> >  >  EOM SB                J AM SOC INFORM SCI           47
> >  > 941      1996
> >  >  EVERITT B             CLUSTER ANAL
> >  >          1974
> >  >  GRIFFITH BC           KEY PAPERS INFORMATI
> >  >  R6      1980
> >  >  HOPKINS FL            SCIENTOMETRICS                 6
> >  >  33      1984
> >  >  HUBERT L              BRIT J MATH STAT PSY          29
> >  > 190      1976
> >  >  LEYDESDORFF L         INFORMERICS 87 88
> >  > 105      1988
> >  >  MCCAIN KW             J AM SOC INFORM SCI           41
> >  > 433      1990
> >  >  MCCAIN KW             J AM SOC INFORM SCI           37
> >  > 111      1986
> >  >  MCCAIN KW             J AM SOC INFORM SCI           35
> >  > 351      1984
> >  >  MULLINS NC            THEORIES THEORY GROU
> >  >          1973
> >  >  WHITE HD              BIBLIOMETRICS SCHOLA
> >  >  84      1990
> >  >  WHITE HD              J AM SOC INF SCI TEC          54
> >  > 423      2003
> >  >  WHITE HD              J AM SOC INFORM SCI           49
> >  > 327      1998
> >  >  WHITE HD              J AM SOC INFORM SCI           41
> >  > 430      1990
> >  >  WHITE HD              J AM SOC INFORM SCI           32
> >  > 163      1981
> >  >
> >  >
> >  > When responding, please attach my original message
> >  > ______________________________________________________________
> >  > _________
> >  > Eugene Garfield, PhD.  email: garfield at codex.cis.upenn.edu
> >  > Tel: 215-243-2205 Fax 215-387-1266
> >  > President, The Scientist LLC. www.the-scientist.com
> >  > Chairman Emeritus, ISI www.isinet.com
> >  > Past President, American Society for Information Science and
> >  > Technology
> >  > (ASIS&T)  www.asis.org
> >  > ______________________________________________________________
> >  > _________
> >  >
> >  >
> >  >
> >  > ISSN:
> >  > 1532-2882
> >  >
> >
>
>
> --
> ---------------------------------------------------------------
> Steven A. Morris                            samorri at okstate.edu
> Electrical and Computer Engineering        office: 405-744-1662
> 202 Engineering So.
> Oklahoma State University
> Stillwater, Oklahoma 74078
> http://samorris.ceat.okstate.edu
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20031223/81c5a7cd/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image021.gif
Type: image/gif
Size: 1318 bytes
Desc: not available
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20031223/81c5a7cd/attachment.gif>