White HD "Author cocitation analysis and ...

Steven A. Morris samorri at OKSTATE.EDU
Wed Jan 7 10:57:18 EST 2004


Dear Loet,

Thank you for a fruitful discussion.  I guess I don't have much to add at
this point.
To be truthful, in practice I haven't noticed a lot of difference in results
when using different similarity measures: rxy, "binary" cosine, or
"non-binary" cosine, or overlap. This isn't surprising, since authors are
usually clustered in a "core and scatter" pattern, comprised of small,
well-defined  "core" groups of closely related authors and a large "scatter"
group of authors with sparse and highly overlapping relations. So a map of a
collection of authors is like looking down on a pot of chicken and
dumplings, it's easy to spot the dumplings (the core groups), but the stew
(scatter authors) always looks different depending on how you stirred the
pot (rxy, or cosine or whatever)..

The overlap similarity is just another way of defining a co-citation count.
I think it was introduced by Salton or one of his co-workers.  You can find
a good discussion of overlap similarity in: Jones, W. P. and G. W. Furnas
(1987). "Pictures of relevance: A geometrical analysis of similarity
measures." Journal of the American Society for Information Science and
Technology 38(6): 420-442.,  This paper also discusses many other similarity
measures such as rxy, cosine, dice and so forth, and I think it gives a good
discussion of the merits of each type of measure.

Thanks kindly,

S. Morris





On Tue, 6 Jan 2004 20:20:20 +0100, Loet Leydesdorff <loet at LEYDESDORFF.NET>
wrote:

>Dear Steven,
>
>Thank you for communicating these experimental results. They are
>interesting.
>
>It seems to me that you have convincingly shown that the two measures
>(the binary and the non-binary one) are different in the case that there
>is information available at a measurement scale higher than dichotomous
>(e.g., at the interval level). Of course, if one has only binary
>information, one can use the binary formulation of the formula, but this
>is generated only because the square or the root of one is also one, and
>the square or root of zero is also zero. Thus, the cosine is defined
>more generally in terms of what you call the non-binary formulation.
>
>I don't agree with the overlap function. It seems to me most naturally
>to return to the original matrix of authors cited as cases and citations
>as variables (columns). A cocitation is then the case that two cells are
>filled in the same column. One can then compute cosines between authors
>as the cases. Choose within SPSS for Analyze > Correlate > Distances and
>you find all the options, including cosines between cases. There is no
>need for the invention of a new function, in my opinion.
>
>With kind regards,
>
>
>Loet
>
>
>
>Dear Loet,
>
>Thanks very much for your interesting remarks.
>In answer to item 1 below,  I have always converted the paper to
>reference authors matrix and paper to term matrix to binary matrices so
>that co-occurences can be calculated easily by multiplying the matrices
>by their transpose.  I'd actually never thought of using the cosine
>formula that you give below.  I did try that calculation on non-binary
>paper to reference authors matrices using:
>
>       cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) *
>Sigma(i) y(i)^2))
>
>I crossplotted the similarity values thus obtained against "binary"
>cosine similarity values.  The results can be seen at:
>http://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm
>There does appear to be a lot of scatter between these two measures,
>though in most of the paper collections it doesn't appear to be biased
>off the 1:1 line.  I don't know what effect this difference would have
>on clustering of authors. I'm not sure I agree with you that using the
>binary version of the cosine similarity is "throwing away information."
>
>After all, references are cited multiple times in papers but the data we
>
>have available (from ISI) only shows that a reference showed up at least
>
>once, yet the data is still very useful.  Granted that knowing the exact
>
>number of times an author was cited in a paper adds more information,
>I'm still not sure that using the non-binary cosine formula above is the
>
>most appropriate way to exploit that extra information.  Alternate
>approaches are available, for example, using the 'overlap' measure.
>
>I have tried using an "overlap" function to compute cocitation counts
>for cosine calculations.  For a paper the overlap of ref author i and
>ref author j  is defined as min[m(i), m(j)],  m(i) and m(j) are the
>number of times author i and author j were cited in the paper
>respectively.  This appears to be a reasonable measure of multiple
>co-citation as it doesn't give a lot of weight to co-citations with
>authors that tend to appear many times in papers.  So "overlap cosine
>similarity" can be calculated using   s(i,,j)  = sum[overlap(i,j)] /
>sqrt( n(i)*n(j)) ) , where the sum is over all papers and n(i) and n(j)
>are the sum over all papers of the number of citations to author i and j
>
>respectively.  For the datasets I have, you can see crossplots of
>"overlap cosine similarity" against "binary cosine similarity at:
>http://samorris.ceat.okstate.edu/web/overlap/default.htm .  These plots
>show that overlap similarity tends to be a little larger than binary
>similarity. This may imply the the overlap method generally tends to
>increase similarity over the binary method, but proportionally, so that
>there is no effect of distances between authors and thus no effect, bad
>or good, on clustering.
>
>On point 2 below,  similarities between a pair of authors using a
>co-citation count matrix is based on whether those two authors are
>cocited in the same proportions among the other authors.  Correlation
>seems a natural  measure for this, as it is the measure used for
>estimating linear dependence.  Also it would seem that negative
>correlation would be applicable:
>
>Suppose there are two "camps" among a group of 10 authors and that the
>1st and 10th authors are the leaders of the two groups respectively.
>Assume
>two authors have the following co-citation counts:
>
>x = [  1     2     3     4     5     6     7     8     9    10 ]
>y = [ 10     9     8     7     6     5     4     3     2     1 ]
>
>so author x is in author 10's camp and author y is in author 1's camp.
>
>in this case rxy = -1, and (1+rxy)/2 gives a similarity of 0.
>   while cosine s =  0.5714.  as cosine similarity.
>
>so the rxy similarity shows the authors as disimilar (logical since they
>
>belong to different camps).
>  while cosine similarity shows that they are similar.  Wouldn't this
>type of effect be a problem with using the cosine similarity for
>co-citation count matricies?
>
>With correlation there is still the problem of what to do with authors
>that have zero variance or cocitation count matrices that have large
>numbers of zeros.
>
>Thanks kindly,
>
>Steven Morris
>
>
>
>
>Loet Leydesdorff wrote:
>
>> Dear Steve,
>>
>> Thank you for the interesting contribution. Let me make a few remarks:
>>
>> 1. Why did you reduce the matrices studied to binary ones? ("The
>> (i,j)th element of O(p,ra) is unity if paper i cites reference author
>> j one or more times, zero otherwise." at
>> http://samorris.ceat.okstate.edu/web/rxy/default.htm .) Both r and the
>
>> cosine are well defined for frequency distributions.
>>
>> The cosine between two vectors x(i) and y(i) is defined as:
>>
>>         cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) *
>> Sigma(i) y(i)^2))
>>
>> For those of you who read this in html:
>>
>> In the case of the binary matrix this formula degenerates to the
>> simpler format that you used:
>>
>>             cos=n(i,j)/sqrt[n(i)*n(j)]
>>
>> SPSS calls this simpler format the "Ochiai". Salton & McGill (1983)
>> provided the full formula in their "Introduction to Modern Information
>
>> Retrieval" (Auckland, etc.: McGraw-Hill).
>>
>> There seems no reason to throw away part of the information that is
>> available in your datasets. I would be curious to see how your curves
>> would look like using the full data. I expect some effects.
>>
>> 2. Why would your reasoning not hold for ACA? For rough-and-ready
>> purposes, one may wish to use either measure as White (2003) posits.
>> However, the fundamental points remain the same, isn't it? One could
>> also have a zero variance in an ACA matrix or not? The problem with
>> the zeros signalled by Ahlgren et al. (2003) remains also in this
>> case, isn't it?
>>
>> 3. In addition to the technical differences, there may be differences
>> stemming from the research design that make the researcher decide to
>> use one or the other measure. For example, in a factor analytic design
>
>> one uses Pearson's r. For mapping purposes one may also consider the
>> Euclidean distance, but this is expected to provide very different
>> results. The theoretical purposes of the research have first to be
>> specified, in my opinion.
>>
>> 4. My interest in this issue is driven by my interest in the evolution
>> of communication systems. One can expect communication systems to
>> develop in different phases like a segmentation, stratification, and
>> differentiation. In a segmented communication system only mutual
>> relations would count. Euclidean distances may be the right measure.
>>
>> In a fully differentiated one, one would expect eigenvector to be
>> spanned orthogonally at the network level. Here factor analysis
>> provides us with insights in the structural differentiation. In the
>> in-between stage a stratified communication system is expected to be
>> hierarchically organized. The grouping is then reduced to a ranking.
>> For this case, the cosine seems a good mapping tool since it organized
>
>> the "star" of the network in the center of the map (using a
>> visualization tool). Pearson's r in this case has the disadvantages
>> mentioned previously during this discussion.
>>
>> The Jaccard index seems to operate somewhere between the Euclidean
>> distance and the cosine. It focusses on segments, but the
>> interpretation is closer to the cosine than to the Euclidean distance
>> measure. Thus, I am not sure that one should use this measure in an
>> evolutionary analysis.
>>
>> I mentioned the forthcoming paper of Caroline Wagner and me about
>> coauthorship relations (http://www.leydesdorff.net/sciencenets ) in
>> which we showed how the cosine-based analysis and mapping versus the
>> the Pearson-correlation based factor analysis enabled us to explore
>> different aspects of the same matrix. These different aspects can be
>> provided with different interpretations: the hierarchy in the network
>> and the competitive relations among leading countries, respectively.
>> But I still have to develop the fundamental argument more
>systematically.
>>
>> With kind regards,
>>
>>
>> Loet
>> ----------------------------------------------------------------------
>> --
>> Loet Leydesdorff
>> Amsterdam School of Communications Research (ASCoR)
>> Kloveniersburgwal 48, 1012 CX Amsterdam
>> Tel.: +31-20- 525 6598; fax: +31-20- 525 3681
>> loet at leydesdorff.net <mailto:loet at leydesdorff.net>;
>> http://www.leydesdorff.net/
>>
>> The Challenge of Scientometrics
>> <http://www.upublish.com/books/leydesdorff-sci.htm> ; The
>> Self-Organization of the Knowledge-Based Society
>> <http://www.upublish.com/books/leydesdorff.htm>
>>
>> > -----Original Message-----
>> > From: ASIS&T Special Interest Group on Metrics
>> > [mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Steven Morris
>> > Sent: Tuesday, December 23, 2003 3:26 AM
>> > To: SIGMETRICS at listserv.utk.edu
>> > Subject: Re: [SIGMETRICS] White HD "Author cocitation analysis and
>> > ...
>> >
>> >
>> > Dear colleagues,
>> >
>> > Regarding rxy vs. cosine similarity:
>> >
>> > When working with a collection of papers downloaded from the Web of
>> > Science, where a paper to reference author citation matrix can be
>> > extracted, the calculation of cosine similarity and rxy, the
>> > correlation coefficient, are both straightforward. Similarity is
>> > based on the number of times a pair of authors are cited together. N
>
>> > is the number of papers in the collection, n(i), n(j) is the number
>> > of citations received by ref author i and j, n(i,j) is the number of
>> > papers citing both ref author i and ref author j. The
>> > correlation coefficient is calculated from
>> > rxy=[N*n(i,j)-n(i)*n(j)]/sqrt[(N*n(i)-n(i)^2)*(N*n(j)-n(j)^2)]
>> >  while the cosine similarity is calulated using
>> > s=n(i,j)/sqrt[n(i)*n(j)]. If N is large compared to the
>> > product of the number of cites received by a pair of authors,
>> > then rxy and cosine formula give equal results.  See
>> > http://samorris.ceat.okstate.edu/web/rxy/default.htm
>> > for crossplots of cosine similarity vs. rxy for reference
>> > authors from several collections of papers.
>> >
>> > For collections of papers without domininant reference authors there
>
>> > is very little difference between cosine and rxy.  For collections
>> > with dominant reference authors that are cited by a large fraction
>> > of the total number of papers, rxy can be much less than cosine
>> > similarity.
>> >
>> > Correlation coefficient is problematic in this case because it is
>> > possible for pairs of authors with large co-citation counts to have
>> > zero rxy.  For example, two authors, both cited by half the papers
>> > in the collection, but cocited by 1/4 of the papers will have a
>> > correlation coefficient of zero but a cosine similarity of 1/2.
>> > Also, the correlation coefficient is not defined for any author that
>
>> > is cited by all papers in the collection, since that author has zero
>> > variance. Recall that rxy is cov(x,y)/sqrt[var(x)*var(y)], so
>> > zero variance drives the denominator to zero in the rxy
>> > equation, thus undefined rxy.
>> >
>> > For this reason it's probably better to use cosine similarity than
>> > rxy for ACA analysis based on a paper to ref author matrix.
>> > Converting similarities to distances for clustering is less
>> > problematic as well.
>> >
>> > The situation is different for ACA based on a co-citation count
>> > matrix. In this case the similarity between two authors is not based
>
>> > on how often they are cited together, but whether the two authors
>> > are  co-cited in the same proportions among the other authors in the
>
>> > collection.  In this case it would seem that rxy would be the
>> > appropriate measure of similarity to use.
>> >
>> > S. Morris
>> >
>> >
>> >
>> > Loet Leydesdorff wrote:
>> > >  > -----Original Message-----
>> > >  > From: ASIS&T Special Interest Group on Metrics
>> > >  > [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Eugene
>> > Garfield
>> > > > Sent: Monday, December 01, 2003 9:57 PM  > To:
>> > > SIGMETRICS at LISTSERV.UTK.EDU  > Subject: [SIGMETRICS] White
>> > HD "Author
>> > > cocitation analysis  > and Pearson's r" Journal of the American
>> > > Society for  > Information Science and Technology 54(13):1250-1259
>
>> > > November 2003,  >  >
>> > >  > Howard D. White : Howard.Dalby.White at drexel.edu
>> > >  >
>> > >  > TITLE    Author cocitation analysis and Pearson's r
>> > >  >
>> > >  > AUTHOR   White HD
>> > >  >
>> > >  > JOURNAL  JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION
>> > >  >          SCIENCE AND TECHNOLOGY 54 (13): 1250-1259 NOV 2003
>> > >
>> > > Dear Howard and colleagues,
>> > >
>> > > I read this article with interest and I agree that for most
>> > practical
>> > > purposes Pearson's r will do a job similar to Salton's cosine.
>> > > Nevertheless, the argument of Ahlgren et al. (2002) seems
>> > convincing
>> > > to me. Scientometric distributions are often highly skewed and the
>
>> > > mean can easily be distorted by the zeros. The cosine
>> > elegantly solves
>> > > this problem.
>> > >
>> > > A disadvantage of the cosine (in comparison to the r) may
>> > be that it
>> > > does not become negative in order to indicate
>> > dissimilarity. This is
>> > > particularly important for the factor analysis. I have
>> > thought about
>> > > input-ing the cosine matrix into the factor analysis (SPSS
>> > allows for
>> > > importing a matrix in this analysis), but that seems a bit tricky.
>> > >
>> > > Caroline Wagner and I did a study on coauthorship relations
>> > entitled
>> > > "Mapping Global Science using International Coauthorships: A
>> > > comparison of 1990 and 2000" (Intern. J. of Technology and
>> > > Globalization,
>> > > forthcoming) in which we used the same matrix for mapping using
>> > > the cosine (and then Pajek for the visualization) and for the
>> > > factor analysis using Pearson's r. The results are provided as
>> > factor plots in
>> > > the preprint version of the paper at
>> > > http://www.leydesdorff.net/sciencenets/mapping.pdf .
>> > >
>> > > While the cosine maps exhibit the hierarchy by placing the central
>
>> > > cluster in the center (including the U.S.A. and some
>> > Western-European
>> > > countries), the factor analysis reveals the main structural axes
>> > > of the system as competitive relations between the U.S.A., U.K.,
>> > > and continental Europe (Germany + Russia). The French system can
>> > > be considered as a fourth axis. These eigenvectors function as
>> > > competitors for collaboration with authors from other
>> > (smaller or more
>> > > peripheral) countries.
>> > >
>> > > Thus, the two measures enable us to show something differently:
>> > > Salton's cosine exhibits the hierarchy and one might say that the
>> > > factor analysis on the basis of Pearson's r enables us to show the
>
>> > > heterarchy among competing axes in the system.
>> > >
>> > > With kind regards,
>> > >
>> > > Loet
>> > >
>> > >
>> > --------------------------------------------------------------------
>> > --
>> > > --
>> > > Loet Leydesdorff
>> > > Amsterdam School of Communications Research (ASCoR)
>> > > Kloveniersburgwal 48, 1012 CX Amsterdam
>> > > Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 loet at leydesdorff.net
>
>> > > <mailto:loet at leydesdorff.net>; http://www.leydesdorff.net/
>> > >
>> > > The Challenge of Scientometrics
>> > > <http://www.upublish.com/books/leydesdorff-sci.htm> ; The
>> > > Self-Organization of the Knowledge-Based Society
>> > > <http://www.upublish.com/books/leydesdorff.htm>
>> > >
>> > >
>> > >
>> > >  >
>> > >  >
>> > >  >  Document type: Article  Language: English  Cited
>> > References:  > 20
>> > > Times Cited: 0  >
>> > >  > Abstract:
>> > >  > In their article "Requirements for a cocitation similarity  >
>> > > measure, with special reference to Pearson's correlation  >
>> > > coefficient," Ahlgren, Jarneving, and Rousseau fault  >
>> > > traditional author cocitation analysis (ACA) for using  >
>> > > Pearson's r as a measure of similarity between authors  > because
>> > > it fails two tests of stability of measurement. The  >
>> > > instabilities arise when rs are recalculated after a first  >
>> > > coherent group of authors has been augmented by a second  >
>> > > coherent group with whom the first has little or no  > cocitation.
>
>> > > However, AJ&R neither cluster nor map their data  > to demonstrate
>
>> > > how fluctuations in rs will mislead the  > analyst, and the
>> > > problem they pose is remote from both theory  > and practice in
>> > > traditional ACA. By entering their own rs  > into multidimensional
>
>> > > scaling and clustering routines, I show  > that, despite rs
>> > > fluctuations, clusters based on it are much  > the same for the
>> > > combined groups as for the separate groups.  > The combined groups
>
>> > > when mapped appear as polarized clumps of  > points in
>> > > two-dimensional space, confirming that differences  > between the
>> > > groups have become much more important than  > differences within
>> > > the groups-an accurate portrayal of what  > has happened to the
>> > > data. Moreover, r produces clusters and  > maps very like those
>> > > based on other coefficients that AJ&R  > mention as possible
>> > > replacements, such as a cosine similarity  > measure or a chi
>> > > square dissimilarity measure. Thus, r  > performs well enough for
>> > > the purposes of ACA. Accordingly, I  > argue that qualitative
>> > > information revealing why authors are  > cocited is more important
>
>> > > than the cautions proposed in the  > AJ&R critique. I include
>> > > notes on topics such as handling the  > diagonal in author
>> > > cocitation matrices, lognormalizing data,  > and testing r for
>> > > significance.  >
>> > >  > KeyWords Plus:
>> > >  > INTELLECTUAL STRUCTURE, SCIENCE
>> > >  >
>> > >  > Addresses:
>> > >  > White HD, Drexel Univ, Coll Informat Sci & Technol, 3152
>> > >  > Chestnut St, Philadelphia, PA 19104 USA Drexel Univ, Coll
>> > >  > Informat Sci & Technol, Philadelphia, PA 19104 USA
>> > >  >
>> > >  > Publisher:
>> > >  > JOHN WILEY & SONS INC, 111 RIVER ST, HOBOKEN, NJ 07030 USA
>> > >  >
>> > >  > IDS Number:
>> > >  > 730VQ
>> > >  >
>> > >  >
>> > >  >  Cited Author            Cited Work                Volume
>> > >  >  Page   Year
>> > >  >      ID
>> > >  >
>> > >  >  AHLGREN P             J AM SOC INF SCI TEC          54
>> > >  > 550      2003
>> > >  >  BAYER AE              J AM SOC INFORM SCI           41
>> > >  > 444      1990
>> > >  >  BORGATTI SP           UCINET WINDOWS SOFTW
>> > >  >          2002
>> > >  >  BORGATTI SP           WORKSH SUNB 20 INT S
>> > >  >          2000
>> > >  >  DAVISON ML            MULTIDIMENSIONAL SCA
>> > >  >          1983
>> > >  >  EOM SB                J AM SOC INFORM SCI           47
>> > >  > 941      1996
>> > >  >  EVERITT B             CLUSTER ANAL
>> > >  >          1974
>> > >  >  GRIFFITH BC           KEY PAPERS INFORMATI
>> > >  >  R6      1980
>> > >  >  HOPKINS FL            SCIENTOMETRICS                 6
>> > >  >  33      1984
>> > >  >  HUBERT L              BRIT J MATH STAT PSY          29
>> > >  > 190      1976
>> > >  >  LEYDESDORFF L         INFORMERICS 87 88
>> > >  > 105      1988
>> > >  >  MCCAIN KW             J AM SOC INFORM SCI           41
>> > >  > 433      1990
>> > >  >  MCCAIN KW             J AM SOC INFORM SCI           37
>> > >  > 111      1986
>> > >  >  MCCAIN KW             J AM SOC INFORM SCI           35
>> > >  > 351      1984
>> > >  >  MULLINS NC            THEORIES THEORY GROU
>> > >  >          1973
>> > >  >  WHITE HD              BIBLIOMETRICS SCHOLA
>> > >  >  84      1990
>> > >  >  WHITE HD              J AM SOC INF SCI TEC          54
>> > >  > 423      2003
>> > >  >  WHITE HD              J AM SOC INFORM SCI           49
>> > >  > 327      1998
>> > >  >  WHITE HD              J AM SOC INFORM SCI           41
>> > >  > 430      1990
>> > >  >  WHITE HD              J AM SOC INFORM SCI           32
>> > >  > 163      1981
>> > >  >
>> > >  >
>> > >  > When responding, please attach my original message
>> > >  > ______________________________________________________________
>> > >  > _________
>> > >  > Eugene Garfield, PhD.  email: garfield at codex.cis.upenn.edu
>> > >  > home page: www.eugenegarfield.org
>> > >  > Tel: 215-243-2205 Fax 215-387-1266
>> > >  > President, The Scientist LLC. www.the-scientist.com
>> > >  > Chairman Emeritus, ISI www.isinet.com
>> > >  > Past President, American Society for Information Science and
>> > >  > Technology
>> > >  > (ASIS&T)  www.asis.org
>> > >  > ______________________________________________________________
>> > >  > _________
>> > >  >
>> > >  >
>> > >  >
>> > >  > ISSN:
>> > >  > 1532-2882
>> > >  >
>> > >
>> >
>> >
>> > --
>> > ---------------------------------------------------------------
>> > Steven A. Morris                            samorri at okstate.edu
>> > Electrical and Computer Engineering        office: 405-744-1662
>> > 202 Engineering So.
>> > Oklahoma State University
>> > Stillwater, Oklahoma 74078
>> > http://samorris.ceat.okstate.edu
>> >
>>
><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html>
><head>
>  <meta http-equiv="Content-Type"
>content="text/html;charset=ISO-8859-1">
>  <title></title>
></head>
><body text="#000000" bgcolor="#ffffff">
><meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
><title></title>
>Dear Dr. Leysesdorff,<br>
><br>
>The message below was sent as a reply to you on the Sigmetrics mailing
>list about a week ago. However, I'm not sure if the list server is
>working at the moment. If you've received this before then please
>forgive me for having sent it to you twice.  <br> <br> Very kind
>regards,<br> <br> Steven Morris<br> <br>
>------------------------------------------------------------------------
>-------------<br>
><br>
><br>
><br>
><br>
><br>
>Dear Loet,<br>
><br>
><font face="Microsoft Sans Serif">Thanks very much for your interesting
>remarks.<br> In answer to item 1 below,  I have always converted
>the paper to reference authors matrix and paper to term matrix to binary
>matrices so that co-occurences can be calculated easily by multiplying
>the matrices by their transpose.  I'd actually never thought of
>using the cosine formula that you give below.  I did try that
>calculation on non-binary paper to reference authors matrices using:<br>
><big><br> </big></font><font face="Microsoft Sans Serif"><font
>color="#0000ff"  size="2"><big>      
>cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i)
>x(i)^2) * Sigma(i) y(i)^2))<br>
></big><br>
><big><font color="#330033">I crossplotted the similarity values thus
>obtained against "binary" cosine similarity values.  The results
>can be seen at:<br> <a class="moz-txt-link-freetext"
>href="http://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm">http
>://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm</a>  
><br>
>There does appear to be a lot of scatter between these two measures,
>though in most of the paper collections it doesn't appear to be biased
>off the 1:1 line.  I don't know what effect this difference would
>have on clustering of authors. I'm not sure I agree with you that using
>the binary version of the cosine similarity is "throwing away
>information."  After all, references are cited multiple times in
>papers but the data we have available (from ISI) only shows that a
>reference showed up at least once, yet the data is still very
>useful.  Granted that knowing the exact number of times an author
>was cited in a paper adds more information, I'm still not sure that
>using the non-binary cosine formula above is the most appropriate way to
>exploit that extra information.  Alternate approaches are
>available, for example, using the 'overlap' measure.  <br> <br> I
>have tried using an "overlap" function to compute cocitation counts for
>cosine calculations.  For a paper the overlap of ref author i and
>ref author j  is defined as min[m(i), m(j)],  m(i) and m(j)
>are the number of times author i and author j were cited in the paper
>respectively.  This appears to be a reasonable measure of multiple
>co-citation as it doesn't give a lot of weight to co-citations with
>authors that tend to appear many times in papers.  So "overlap
>cosine similarity" can be calculated using   s(i,,j)  =
>sum[overlap(i,j)] / sqrt( n(i)*n(j)) ) , where the sum is over all
>papers and n(i) and n(j) are the sum over all papers of the number of
>citations to author i and j respectively.  For the datasets I have,
>you can see crossplots of "overlap cosine similarity" against "binary
>cosine similarity at:<br> <a class="moz-txt-link-freetext"
>href="http://samorris.ceat.okstate.edu/web/overlap/default.htm">http://s
>amorris.ceat.okstate.edu/web/overlap/default.htm</a>
>.  These plots
>show that overlap similarity tends to be a little larger than binary
>similarity. This may imply the the overlap method generally tends to
>increase similarity over the binary method, but proportionally, so that
>there is no effect of distances between authors and thus no effect, bad
>or good, on clustering.<br> </font></big></font><big>  </big><br>
>On point 2 below,  similarities between a pair of authors using a
>co-citation count matrix is based on whether those two authors are
>cocited in the same proportions among the other authors. 
>Correlation seems a natural  measure for this, as it is the measure
>used for estimating linear dependence.  Also it would seem that
>negative correlation would be applicable:<br> <br> Suppose there are two
>"camps" among a group of 10 authors and that the 1st and 10th authors
>are the leaders of the two groups respectively.  Assume <br> two
>authors have the following co-citation counts:<br> <br> x = [ 
>1     2    
>3     4    
>5     6    
>7     8     9   
>10 ]<br> y = [ 10     9    
>8     7    
>6     5    
>4     3    
>2     1 ]<br> <br> so author x is in author 10's
>camp and author y is in author 1's camp.<br> <br> in this case rxy = -1,
>and (1+rxy)/2 gives a similarity of 0.<br>    while cosine s
>=  0.5714.  as cosine similarity.<br> <br> so the rxy
>similarity shows the authors as disimilar (logical since they belong to
>different camps).<br>   while cosine similarity shows that they are
>similar.  Wouldn't this type of effect be a problem with using the
>cosine similarity for co-citation count matricies? <br> <br> With
>correlation there is still the problem of what to do with authors that
>have zero variance or cocitation count matrices that have large numbers
>of zeros.  <br> </font><br> Thanks kindly,<br> <br> Steven
>Morris<br> <br> <br> <br> <br> Loet Leydesdorff wrote:<br> <blockquote
>type="cite" cite="mid000f01c3c926$3c76ba60$1202a8c0 at loet">
>  <meta http-equiv="Content-Type" content="text/html; ">
>  <title>Message</title>
>  <meta content="MSHTML 6.00.2800.1170" name="GENERATOR">
>  <div><!-- Converted from text/plain format --><font face="Arial"



More information about the SIGMETRICS mailing list