White HD "Author cocitation analysis and ...

Steven Morris samorri at OKSTATE.EDU
Mon Dec 22 21:26:02 EST 2003


Dear colleagues,

Regarding rxy vs. cosine similarity:

When working with a collection of papers downloaded from the Web of
Science, where a paper to reference author citation matrix can be
extracted, the calculation of cosine similarity and rxy, the correlation
coefficient, are both straightforward. Similarity is based on the number
of times a pair of authors are cited together. N is the number of papers
in the collection, n(i), n(j) is the number of citations received by ref
author i and j, n(i,j) is the number of papers citing both ref author i
and ref author j. The correlation coefficient is calculated from
rxy=[N*n(i,j)-n(i)*n(j)]/sqrt[(N*n(i)-n(i)^2)*(N*n(j)-n(j)^2)] while the
cosine similarity is calulated using s=n(i,j)/sqrt[n(i)*n(j)]. If N is
large compared to the product of the number of cites received by a pair
of authors, then rxy and cosine formula give equal results.  See
http://samorris.ceat.okstate.edu/web/rxy/default.htm
for crossplots of cosine similarity vs. rxy for reference authors from
several collections of papers.

For collections of papers without domininant reference authors there is
very little difference between cosine and rxy.  For collections with
dominant reference authors that are cited by a large fraction of the
total number of papers, rxy can be much less than cosine similarity.

Correlation coefficient is problematic in this case because it is
possible for pairs of authors with large co-citation counts to have zero
rxy.  For example, two authors, both cited by half the papers in the
collection, but cocited by 1/4 of the papers will have a correlation
coefficient of zero but a cosine similarity of 1/2. Also, the
correlation coefficient is not defined for any author that is cited by
all papers in the collection, since that author has zero variance.
Recall that rxy is cov(x,y)/sqrt[var(x)*var(y)], so zero variance drives
the denominator to zero in the rxy equation, thus undefined rxy.

For this reason it's probably better to use cosine similarity than rxy
for ACA analysis based on a paper to ref author matrix.  Converting
similarities to distances for clustering is less problematic as well.

The situation is different for ACA based on a co-citation count matrix.
In this case the similarity between two authors is not based on how
often they are cited together, but whether the two authors are  co-cited
in the same proportions among the other authors in the collection.  In
this case it would seem that rxy would be the appropriate measure of
similarity to use.

S. Morris



Loet Leydesdorff wrote:
>  > -----Original Message-----
>  > From: ASIS&T Special Interest Group on Metrics
>  > [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Eugene Garfield
>  > Sent: Monday, December 01, 2003 9:57 PM
>  > To: SIGMETRICS at LISTSERV.UTK.EDU
>  > Subject: [SIGMETRICS] White HD "Author cocitation analysis
>  > and Pearson's r" Journal of the American Society for
>  > Information Science and Technology 54(13):1250-1259 November 2003,
>  >
>  >
>  > Howard D. White : Howard.Dalby.White at drexel.edu
>  >
>  > TITLE    Author cocitation analysis and Pearson's r
>  >
>  > AUTHOR   White HD
>  >
>  > JOURNAL  JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION
>  >          SCIENCE AND TECHNOLOGY 54 (13): 1250-1259 NOV 2003
>
> Dear Howard and colleagues,
>
> I read this article with interest and I agree that for most practical
> purposes Pearson's r will do a job similar to Salton's cosine.
> Nevertheless, the argument of Ahlgren et al. (2002) seems convincing to
> me. Scientometric distributions are often highly skewed and the mean can
> easily be distorted by the zeros. The cosine elegantly solves this problem.
>
> A disadvantage of the cosine (in comparison to the r) may be that it
> does not become negative in order to indicate dissimilarity. This is
> particularly important for the factor analysis. I have thought about
> input-ing the cosine matrix into the factor analysis (SPSS allows for
> importing a matrix in this analysis), but that seems a bit tricky.
>
> Caroline Wagner and I did a study on coauthorship relations entitled
> "Mapping Global Science using International Coauthorships: A comparison
> of 1990 and 2000" (Intern. J. of Technology and Globalization,
> forthcoming) in which we used the same matrix for mapping using the
> cosine (and then Pajek for the visualization) and for the factor
> analysis using Pearson's r. The results are provided as factor plots in
> the preprint version of the paper at
> http://www.leydesdorff.net/sciencenets/mapping.pdf .
>
> While the cosine maps exhibit the hierarchy by placing the central
> cluster in the center (including the U.S.A. and some Western-European
> countries), the factor analysis reveals the main structural axes of the
> system as competitive relations between the U.S.A., U.K., and
> continental Europe (Germany + Russia). The French system can be
> considered as a fourth axis. These eigenvectors function as competitors
> for collaboration with authors from other (smaller or more peripheral)
> countries.
>
> Thus, the two measures enable us to show something differently: Salton's
> cosine exhibits the hierarchy and one might say that the factor analysis
> on the basis of Pearson's r enables us to show the heterarchy among
> competing axes in the system.
>
> With kind regards,
>
> Loet
>
> ------------------------------------------------------------------------
> Loet Leydesdorff
> Amsterdam School of Communications Research (ASCoR)
> Kloveniersburgwal 48, 1012 CX Amsterdam
> Tel.: +31-20- 525 6598; fax: +31-20- 525 3681
> loet at leydesdorff.net <mailto:loet at leydesdorff.net>;
> http://www.leydesdorff.net/
>
> The Challenge of Scientometrics
> <http://www.upublish.com/books/leydesdorff-sci.htm> ; The
> Self-Organization of the Knowledge-Based Society
> <http://www.upublish.com/books/leydesdorff.htm>
>
>
>
>  >
>  >
>  >  Document type: Article  Language: English  Cited References:
>  > 20  Times Cited: 0
>  >
>  > Abstract:
>  > In their article "Requirements for a cocitation similarity
>  > measure, with special reference to Pearson's correlation
>  > coefficient," Ahlgren, Jarneving, and Rousseau fault
>  > traditional author cocitation analysis (ACA) for using
>  > Pearson's r as a measure of similarity between authors
>  > because it fails two tests of stability of measurement. The
>  > instabilities arise when rs are recalculated after a first
>  > coherent group of authors has been augmented by a second
>  > coherent group with whom the first has little or no
>  > cocitation. However, AJ&R neither cluster nor map their data
>  > to demonstrate how fluctuations in rs will mislead the
>  > analyst, and the problem they pose is remote from both theory
>  > and practice in traditional ACA. By entering their own rs
>  > into multidimensional scaling and clustering routines, I show
>  > that, despite rs fluctuations, clusters based on it are much
>  > the same for the combined groups as for the separate groups.
>  > The combined groups when mapped appear as polarized clumps of
>  > points in two-dimensional space, confirming that differences
>  > between the groups have become much more important than
>  > differences within the groups-an accurate portrayal of what
>  > has happened to the data. Moreover, r produces clusters and
>  > maps very like those based on other coefficients that AJ&R
>  > mention as possible replacements, such as a cosine similarity
>  > measure or a chi square dissimilarity measure. Thus, r
>  > performs well enough for the purposes of ACA. Accordingly, I
>  > argue that qualitative information revealing why authors are
>  > cocited is more important than the cautions proposed in the
>  > AJ&R critique. I include notes on topics such as handling the
>  > diagonal in author cocitation matrices, lognormalizing data,
>  > and testing r for significance.
>  >
>  > KeyWords Plus:
>  > INTELLECTUAL STRUCTURE, SCIENCE
>  >
>  > Addresses:
>  > White HD, Drexel Univ, Coll Informat Sci & Technol, 3152
>  > Chestnut St, Philadelphia, PA 19104 USA Drexel Univ, Coll
>  > Informat Sci & Technol, Philadelphia, PA 19104 USA
>  >
>  > Publisher:
>  > JOHN WILEY & SONS INC, 111 RIVER ST, HOBOKEN, NJ 07030 USA
>  >
>  > IDS Number:
>  > 730VQ
>  >
>  >
>  >  Cited Author            Cited Work                Volume
>  >  Page   Year
>  >      ID
>  >
>  >  AHLGREN P             J AM SOC INF SCI TEC          54
>  > 550      2003
>  >  BAYER AE              J AM SOC INFORM SCI           41
>  > 444      1990
>  >  BORGATTI SP           UCINET WINDOWS SOFTW
>  >          2002
>  >  BORGATTI SP           WORKSH SUNB 20 INT S
>  >          2000
>  >  DAVISON ML            MULTIDIMENSIONAL SCA
>  >          1983
>  >  EOM SB                J AM SOC INFORM SCI           47
>  > 941      1996
>  >  EVERITT B             CLUSTER ANAL
>  >          1974
>  >  GRIFFITH BC           KEY PAPERS INFORMATI
>  >  R6      1980
>  >  HOPKINS FL            SCIENTOMETRICS                 6
>  >  33      1984
>  >  HUBERT L              BRIT J MATH STAT PSY          29
>  > 190      1976
>  >  LEYDESDORFF L         INFORMERICS 87 88
>  > 105      1988
>  >  MCCAIN KW             J AM SOC INFORM SCI           41
>  > 433      1990
>  >  MCCAIN KW             J AM SOC INFORM SCI           37
>  > 111      1986
>  >  MCCAIN KW             J AM SOC INFORM SCI           35
>  > 351      1984
>  >  MULLINS NC            THEORIES THEORY GROU
>  >          1973
>  >  WHITE HD              BIBLIOMETRICS SCHOLA
>  >  84      1990
>  >  WHITE HD              J AM SOC INF SCI TEC          54
>  > 423      2003
>  >  WHITE HD              J AM SOC INFORM SCI           49
>  > 327      1998
>  >  WHITE HD              J AM SOC INFORM SCI           41
>  > 430      1990
>  >  WHITE HD              J AM SOC INFORM SCI           32
>  > 163      1981
>  >
>  >
>  > When responding, please attach my original message
>  > ______________________________________________________________
>  > _________
>  > Eugene Garfield, PhD.  email: garfield at codex.cis.upenn.edu
>  > home page: www.eugenegarfield.org
>  > Tel: 215-243-2205 Fax 215-387-1266
>  > President, The Scientist LLC. www.the-scientist.com
>  > Chairman Emeritus, ISI www.isinet.com
>  > Past President, American Society for Information Science and
>  > Technology
>  > (ASIS&T)  www.asis.org
>  > ______________________________________________________________
>  > _________
>  >
>  >
>  >
>  > ISSN:
>  > 1532-2882
>  >
>


--
---------------------------------------------------------------
Steven A. Morris                            samorri at okstate.edu
Electrical and Computer Engineering        office: 405-744-1662
202 Engineering So.
Oklahoma State University
Stillwater, Oklahoma 74078
http://samorris.ceat.okstate.edu



More information about the SIGMETRICS mailing list