From kraft at BIT.CSC.LSU.EDU Fri Jan 2 10:18:13 2004 From: kraft at BIT.CSC.LSU.EDU (Dr. Don Kraft) Date: Fri, 2 Jan 2004 09:18:13 -0600 Subject: A message from a colleague on Pearsons r and other statistics Message-ID: In recent times there have been articles now in press in JASIST with issues at stake that need clarification. Pearson's r does have the characteristic of being responsive to 0s, and this may affect the outcome of the analysis, depending upon what the purpose of the analysis is. My basic point still holds: If people think that Pearson's r is distortive because of this characteristic, then it is beholden upon them to demonstrate in a practical manner how this can affect conclusions and how a measure that does not distort in this manner can lead to a different interpretation of the data. This some authors adamantly refused to do. Mathematical fine points are all very well and good, but there needs to be a "so what" section at the end. I have spent the last few years reading Karl Pearson, R.A. Fisher, Student, Richard von Mises, Bortkiewicz, etc., and their mathematical capabilities far exceeded those of Rousseau and company. However, their papers--sometimes described as "a jungle of formulae"--always had "so what" sections, in which the issue at stake was clearly stated in clear language and pictures. If people want a model, then they should read the 3rd 1911 edition of Karl Pearson's "The Grammar of Science," where he explains contingency and correlation in clear terms to laymen. It is a model of clear thinking, which is often lacking in present-day work. Stephen J Bensman notsjb at lsu.edu From loet at LEYDESDORFF.NET Sun Jan 4 08:29:01 2004 From: loet at LEYDESDORFF.NET (Loet Leydesdorff) Date: Sun, 4 Jan 2004 14:29:01 +0100 Subject: A message from a colleague on Pearsons r and other statistics In-Reply-To: <200401021518.JAA03168@bit.csc.lsu.edu> Message-ID: -----Original Message----- From: ASIS&T Special Interest Group on Metrics [mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Dr. Don Kraft Sent: Friday, January 02, 2004 4:18 PM To: SIGMETRICS at listserv.utk.edu Subject: [SIGMETRICS] A message from a colleague on Pearsons r and other statistics In recent times there have been articles now in press in JASIST with issues at stake that need clarification. Pearson's r does have the characteristic of being responsive to 0s, and this may affect the outcome of the analysis, depending upon what the purpose of the analysis is. My basic point still holds: If people think that Pearson's r is distortive because of this characteristic, then it is beholden upon them to demonstrate in a practical manner how this can affect conclusions and how a measure that does not distort in this manner can lead to a different interpretation of the data. This some authors adamantly refused to do. Mathematical fine points are all very well and good, but there needs to be a "so what" section at the end. I have spent the last few years reading Karl Pearson, R.A. Fisher, Student, Richard von Mises, Bortkiewicz, etc., and their mathematical capabilities far exceeded those of Rousseau and company. However, their papers--sometimes described as "a jungle of formulae"--always had "so what" sections, in which the issue at stake was clearly stated in clear language and pictures. If people want a model, then they should read the 3rd 1911 edition of Karl Pearson's "The Grammar of Science," where he explains contingency and correlation in clear terms to laymen. It is a model of clear thinking, which is often lacking in present-day work. Stephen J Bensman notsjb at lsu.edu Dear colleague, I have argued in a series of articles (in Scientometrics, in the early 1990s, and later compiled in a book entitled "The Challenge of Scientometrics") that one cannot expect scientific communications to be normally distributed and that therefore parametric measures like the Pearson correlation are often distorting. Negative powerlaws, for example, cannot be represented by using the means of the distribution. The large number of zeros in scientometric matrices are a consequence of the shape of the distributions and therefore not avoidable in most applications. Fortunately, a mathematical theory of communication is available based on non-parametric statistics. I have elaborated this theory for scientometric applications in the book mentioned above. The probabilistic entropy measures are non-parametric. Unfortunately, most of the available software and most of the statistics is based on parametric measures (e.g., factor analysis) because attributes to agents are often normally distributed. Many scientometric indicators (e.g., impact factors) are based on averages although one is aware that the underlying processes are the result of a dynamic at the network level. (Of course, one can still use the average as an indicator, but the interpretation is different from the case of a normal distribution.) Salton's cosine has certain properties that merit discussion. It allows for a hierarchical representation (because one can also use the cosine among centroids). The Vector Space Model has more recently been developed as a form of multidimensional scaling (e.g., Ortego Priego, 2003). Furthermore, this measure is most easy in the computation, while information-theoretical (probabilistic) measures are often computationally intensive. I have argued in previous postings in this thread that the cosine enables us to make a spatial representation different from the factor analysis (which is usually based on the Pearson correlation). The cosine, for example, enables us to visualize a hierarchy in the network, while the factor analysis exhibits (heterarchical) dimensions. Thus, there may be a beginning of an answer to your "so what?" question. With kind regards, Loet _____ Loet Leydesdorff Science & Technology Dynamics, University of Amsterdam Amsterdam School of Communications Research (ASCoR) Kloveniersburgwal 48, 1012 CX Amsterdam Tel.: +31-20-525 6598; fax: +31-20-525 3681 loet at leydesdorff.net; http://www.leydesdorff.net From loet at LEYDESDORFF.NET Tue Jan 6 14:20:20 2004 From: loet at LEYDESDORFF.NET (Loet Leydesdorff) Date: Tue, 6 Jan 2004 20:20:20 +0100 Subject: White HD "Author cocitation analysis and ... In-Reply-To: <3FF1AC25.6050808@okstate.edu> Message-ID: Dear Steven, Thank you for communicating these experimental results. They are interesting. It seems to me that you have convincingly shown that the two measures (the binary and the non-binary one) are different in the case that there is information available at a measurement scale higher than dichotomous (e.g., at the interval level). Of course, if one has only binary information, one can use the binary formulation of the formula, but this is generated only because the square or the root of one is also one, and the square or root of zero is also zero. Thus, the cosine is defined more generally in terms of what you call the non-binary formulation. I don't agree with the overlap function. It seems to me most naturally to return to the original matrix of authors cited as cases and citations as variables (columns). A cocitation is then the case that two cells are filled in the same column. One can then compute cosines between authors as the cases. Choose within SPSS for Analyze > Correlate > Distances and you find all the options, including cosines between cases. There is no need for the invention of a new function, in my opinion. With kind regards, Loet Dear Loet, Thanks very much for your interesting remarks. In answer to item 1 below, I have always converted the paper to reference authors matrix and paper to term matrix to binary matrices so that co-occurences can be calculated easily by multiplying the matrices by their transpose. I'd actually never thought of using the cosine formula that you give below. I did try that calculation on non-binary paper to reference authors matrices using: cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) * Sigma(i) y(i)^2)) I crossplotted the similarity values thus obtained against "binary" cosine similarity values. The results can be seen at: http://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm There does appear to be a lot of scatter between these two measures, though in most of the paper collections it doesn't appear to be biased off the 1:1 line. I don't know what effect this difference would have on clustering of authors. I'm not sure I agree with you that using the binary version of the cosine similarity is "throwing away information." After all, references are cited multiple times in papers but the data we have available (from ISI) only shows that a reference showed up at least once, yet the data is still very useful. Granted that knowing the exact number of times an author was cited in a paper adds more information, I'm still not sure that using the non-binary cosine formula above is the most appropriate way to exploit that extra information. Alternate approaches are available, for example, using the 'overlap' measure. I have tried using an "overlap" function to compute cocitation counts for cosine calculations. For a paper the overlap of ref author i and ref author j is defined as min[m(i), m(j)], m(i) and m(j) are the number of times author i and author j were cited in the paper respectively. This appears to be a reasonable measure of multiple co-citation as it doesn't give a lot of weight to co-citations with authors that tend to appear many times in papers. So "overlap cosine similarity" can be calculated using s(i,,j) = sum[overlap(i,j)] / sqrt( n(i)*n(j)) ) , where the sum is over all papers and n(i) and n(j) are the sum over all papers of the number of citations to author i and j respectively. For the datasets I have, you can see crossplots of "overlap cosine similarity" against "binary cosine similarity at: http://samorris.ceat.okstate.edu/web/overlap/default.htm . These plots show that overlap similarity tends to be a little larger than binary similarity. This may imply the the overlap method generally tends to increase similarity over the binary method, but proportionally, so that there is no effect of distances between authors and thus no effect, bad or good, on clustering. On point 2 below, similarities between a pair of authors using a co-citation count matrix is based on whether those two authors are cocited in the same proportions among the other authors. Correlation seems a natural measure for this, as it is the measure used for estimating linear dependence. Also it would seem that negative correlation would be applicable: Suppose there are two "camps" among a group of 10 authors and that the 1st and 10th authors are the leaders of the two groups respectively. Assume two authors have the following co-citation counts: x = [ 1 2 3 4 5 6 7 8 9 10 ] y = [ 10 9 8 7 6 5 4 3 2 1 ] so author x is in author 10's camp and author y is in author 1's camp. in this case rxy = -1, and (1+rxy)/2 gives a similarity of 0. while cosine s = 0.5714. as cosine similarity. so the rxy similarity shows the authors as disimilar (logical since they belong to different camps). while cosine similarity shows that they are similar. Wouldn't this type of effect be a problem with using the cosine similarity for co-citation count matricies? With correlation there is still the problem of what to do with authors that have zero variance or cocitation count matrices that have large numbers of zeros. Thanks kindly, Steven Morris Loet Leydesdorff wrote: > Dear Steve, > > Thank you for the interesting contribution. Let me make a few remarks: > > 1. Why did you reduce the matrices studied to binary ones? ("The > (i,j)th element of O(p,ra) is unity if paper i cites reference author > j one or more times, zero otherwise." at > http://samorris.ceat.okstate.edu/web/rxy/default.htm .) Both r and the > cosine are well defined for frequency distributions. > > The cosine between two vectors x(i) and y(i) is defined as: > > cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) * > Sigma(i) y(i)^2)) > > For those of you who read this in html: > > In the case of the binary matrix this formula degenerates to the > simpler format that you used: > > cos=n(i,j)/sqrt[n(i)*n(j)] > > SPSS calls this simpler format the "Ochiai". Salton & McGill (1983) > provided the full formula in their "Introduction to Modern Information > Retrieval" (Auckland, etc.: McGraw-Hill). > > There seems no reason to throw away part of the information that is > available in your datasets. I would be curious to see how your curves > would look like using the full data. I expect some effects. > > 2. Why would your reasoning not hold for ACA? For rough-and-ready > purposes, one may wish to use either measure as White (2003) posits. > However, the fundamental points remain the same, isn't it? One could > also have a zero variance in an ACA matrix or not? The problem with > the zeros signalled by Ahlgren et al. (2003) remains also in this > case, isn't it? > > 3. In addition to the technical differences, there may be differences > stemming from the research design that make the researcher decide to > use one or the other measure. For example, in a factor analytic design > one uses Pearson's r. For mapping purposes one may also consider the > Euclidean distance, but this is expected to provide very different > results. The theoretical purposes of the research have first to be > specified, in my opinion. > > 4. My interest in this issue is driven by my interest in the evolution > of communication systems. One can expect communication systems to > develop in different phases like a segmentation, stratification, and > differentiation. In a segmented communication system only mutual > relations would count. Euclidean distances may be the right measure. > > In a fully differentiated one, one would expect eigenvector to be > spanned orthogonally at the network level. Here factor analysis > provides us with insights in the structural differentiation. In the > in-between stage a stratified communication system is expected to be > hierarchically organized. The grouping is then reduced to a ranking. > For this case, the cosine seems a good mapping tool since it organized > the "star" of the network in the center of the map (using a > visualization tool). Pearson's r in this case has the disadvantages > mentioned previously during this discussion. > > The Jaccard index seems to operate somewhere between the Euclidean > distance and the cosine. It focusses on segments, but the > interpretation is closer to the cosine than to the Euclidean distance > measure. Thus, I am not sure that one should use this measure in an > evolutionary analysis. > > I mentioned the forthcoming paper of Caroline Wagner and me about > coauthorship relations (http://www.leydesdorff.net/sciencenets ) in > which we showed how the cosine-based analysis and mapping versus the > the Pearson-correlation based factor analysis enabled us to explore > different aspects of the same matrix. These different aspects can be > provided with different interpretations: the hierarchy in the network > and the competitive relations among leading countries, respectively. > But I still have to develop the fundamental argument more systematically. > > With kind regards, > > > Loet > ---------------------------------------------------------------------- > -- > Loet Leydesdorff > Amsterdam School of Communications Research (ASCoR) > Kloveniersburgwal 48, 1012 CX Amsterdam > Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 > loet at leydesdorff.net ; > http://www.leydesdorff.net/ > > The Challenge of Scientometrics > ; The > Self-Organization of the Knowledge-Based Society > > > > -----Original Message----- > > From: ASIS&T Special Interest Group on Metrics > > [mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Steven Morris > > Sent: Tuesday, December 23, 2003 3:26 AM > > To: SIGMETRICS at listserv.utk.edu > > Subject: Re: [SIGMETRICS] White HD "Author cocitation analysis and > > ... > > > > > > Dear colleagues, > > > > Regarding rxy vs. cosine similarity: > > > > When working with a collection of papers downloaded from the Web of > > Science, where a paper to reference author citation matrix can be > > extracted, the calculation of cosine similarity and rxy, the > > correlation coefficient, are both straightforward. Similarity is > > based on the number of times a pair of authors are cited together. N > > is the number of papers in the collection, n(i), n(j) is the number > > of citations received by ref author i and j, n(i,j) is the number of > > papers citing both ref author i and ref author j. The > > correlation coefficient is calculated from > > rxy=[N*n(i,j)-n(i)*n(j)]/sqrt[(N*n(i)-n(i)^2)*(N*n(j)-n(j)^2)] > > while the cosine similarity is calulated using > > s=n(i,j)/sqrt[n(i)*n(j)]. If N is large compared to the > > product of the number of cites received by a pair of authors, > > then rxy and cosine formula give equal results. See > > http://samorris.ceat.okstate.edu/web/rxy/default.htm > > for crossplots of cosine similarity vs. rxy for reference > > authors from several collections of papers. > > > > For collections of papers without domininant reference authors there > > is very little difference between cosine and rxy. For collections > > with dominant reference authors that are cited by a large fraction > > of the total number of papers, rxy can be much less than cosine > > similarity. > > > > Correlation coefficient is problematic in this case because it is > > possible for pairs of authors with large co-citation counts to have > > zero rxy. For example, two authors, both cited by half the papers > > in the collection, but cocited by 1/4 of the papers will have a > > correlation coefficient of zero but a cosine similarity of 1/2. > > Also, the correlation coefficient is not defined for any author that > > is cited by all papers in the collection, since that author has zero > > variance. Recall that rxy is cov(x,y)/sqrt[var(x)*var(y)], so > > zero variance drives the denominator to zero in the rxy > > equation, thus undefined rxy. > > > > For this reason it's probably better to use cosine similarity than > > rxy for ACA analysis based on a paper to ref author matrix. > > Converting similarities to distances for clustering is less > > problematic as well. > > > > The situation is different for ACA based on a co-citation count > > matrix. In this case the similarity between two authors is not based > > on how often they are cited together, but whether the two authors > > are co-cited in the same proportions among the other authors in the > > collection. In this case it would seem that rxy would be the > > appropriate measure of similarity to use. > > > > S. Morris > > > > > > > > Loet Leydesdorff wrote: > > > > -----Original Message----- > > > > From: ASIS&T Special Interest Group on Metrics > > > > [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Eugene > > Garfield > > > > Sent: Monday, December 01, 2003 9:57 PM > To: > > > SIGMETRICS at LISTSERV.UTK.EDU > Subject: [SIGMETRICS] White > > HD "Author > > > cocitation analysis > and Pearson's r" Journal of the American > > > Society for > Information Science and Technology 54(13):1250-1259 > > > November 2003, > > > > > > Howard D. White : Howard.Dalby.White at drexel.edu > > > > > > > > TITLE Author cocitation analysis and Pearson's r > > > > > > > > AUTHOR White HD > > > > > > > > JOURNAL JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION > > > > SCIENCE AND TECHNOLOGY 54 (13): 1250-1259 NOV 2003 > > > > > > Dear Howard and colleagues, > > > > > > I read this article with interest and I agree that for most > > practical > > > purposes Pearson's r will do a job similar to Salton's cosine. > > > Nevertheless, the argument of Ahlgren et al. (2002) seems > > convincing > > > to me. Scientometric distributions are often highly skewed and the > > > mean can easily be distorted by the zeros. The cosine > > elegantly solves > > > this problem. > > > > > > A disadvantage of the cosine (in comparison to the r) may > > be that it > > > does not become negative in order to indicate > > dissimilarity. This is > > > particularly important for the factor analysis. I have > > thought about > > > input-ing the cosine matrix into the factor analysis (SPSS > > allows for > > > importing a matrix in this analysis), but that seems a bit tricky. > > > > > > Caroline Wagner and I did a study on coauthorship relations > > entitled > > > "Mapping Global Science using International Coauthorships: A > > > comparison of 1990 and 2000" (Intern. J. of Technology and > > > Globalization, > > > forthcoming) in which we used the same matrix for mapping using > > > the cosine (and then Pajek for the visualization) and for the > > > factor analysis using Pearson's r. The results are provided as > > factor plots in > > > the preprint version of the paper at > > > http://www.leydesdorff.net/sciencenets/mapping.pdf . > > > > > > While the cosine maps exhibit the hierarchy by placing the central > > > cluster in the center (including the U.S.A. and some > > Western-European > > > countries), the factor analysis reveals the main structural axes > > > of the system as competitive relations between the U.S.A., U.K., > > > and continental Europe (Germany + Russia). The French system can > > > be considered as a fourth axis. These eigenvectors function as > > > competitors for collaboration with authors from other > > (smaller or more > > > peripheral) countries. > > > > > > Thus, the two measures enable us to show something differently: > > > Salton's cosine exhibits the hierarchy and one might say that the > > > factor analysis on the basis of Pearson's r enables us to show the > > > heterarchy among competing axes in the system. > > > > > > With kind regards, > > > > > > Loet > > > > > > > > -------------------------------------------------------------------- > > -- > > > -- > > > Loet Leydesdorff > > > Amsterdam School of Communications Research (ASCoR) > > > Kloveniersburgwal 48, 1012 CX Amsterdam > > > Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 loet at leydesdorff.net > > > ; http://www.leydesdorff.net/ > > > > > > The Challenge of Scientometrics > > > ; The > > > Self-Organization of the Knowledge-Based Society > > > > > > > > > > > > > > > > > > > > > > > > Document type: Article Language: English Cited > > References: > 20 > > > Times Cited: 0 > > > > > Abstract: > > > > In their article "Requirements for a cocitation similarity > > > > measure, with special reference to Pearson's correlation > > > > coefficient," Ahlgren, Jarneving, and Rousseau fault > > > > traditional author cocitation analysis (ACA) for using > > > > Pearson's r as a measure of similarity between authors > because > > > it fails two tests of stability of measurement. The > > > > instabilities arise when rs are recalculated after a first > > > > coherent group of authors has been augmented by a second > > > > coherent group with whom the first has little or no > cocitation. > > > However, AJ&R neither cluster nor map their data > to demonstrate > > > how fluctuations in rs will mislead the > analyst, and the > > > problem they pose is remote from both theory > and practice in > > > traditional ACA. By entering their own rs > into multidimensional > > > scaling and clustering routines, I show > that, despite rs > > > fluctuations, clusters based on it are much > the same for the > > > combined groups as for the separate groups. > The combined groups > > > when mapped appear as polarized clumps of > points in > > > two-dimensional space, confirming that differences > between the > > > groups have become much more important than > differences within > > > the groups-an accurate portrayal of what > has happened to the > > > data. Moreover, r produces clusters and > maps very like those > > > based on other coefficients that AJ&R > mention as possible > > > replacements, such as a cosine similarity > measure or a chi > > > square dissimilarity measure. Thus, r > performs well enough for > > > the purposes of ACA. Accordingly, I > argue that qualitative > > > information revealing why authors are > cocited is more important > > > than the cautions proposed in the > AJ&R critique. I include > > > notes on topics such as handling the > diagonal in author > > > cocitation matrices, lognormalizing data, > and testing r for > > > significance. > > > > > KeyWords Plus: > > > > INTELLECTUAL STRUCTURE, SCIENCE > > > > > > > > Addresses: > > > > White HD, Drexel Univ, Coll Informat Sci & Technol, 3152 > > > > Chestnut St, Philadelphia, PA 19104 USA Drexel Univ, Coll > > > > Informat Sci & Technol, Philadelphia, PA 19104 USA > > > > > > > > Publisher: > > > > JOHN WILEY & SONS INC, 111 RIVER ST, HOBOKEN, NJ 07030 USA > > > > > > > > IDS Number: > > > > 730VQ > > > > > > > > > > > > Cited Author Cited Work Volume > > > > Page Year > > > > ID > > > > > > > > AHLGREN P J AM SOC INF SCI TEC 54 > > > > 550 2003 > > > > BAYER AE J AM SOC INFORM SCI 41 > > > > 444 1990 > > > > BORGATTI SP UCINET WINDOWS SOFTW > > > > 2002 > > > > BORGATTI SP WORKSH SUNB 20 INT S > > > > 2000 > > > > DAVISON ML MULTIDIMENSIONAL SCA > > > > 1983 > > > > EOM SB J AM SOC INFORM SCI 47 > > > > 941 1996 > > > > EVERITT B CLUSTER ANAL > > > > 1974 > > > > GRIFFITH BC KEY PAPERS INFORMATI > > > > R6 1980 > > > > HOPKINS FL SCIENTOMETRICS 6 > > > > 33 1984 > > > > HUBERT L BRIT J MATH STAT PSY 29 > > > > 190 1976 > > > > LEYDESDORFF L INFORMERICS 87 88 > > > > 105 1988 > > > > MCCAIN KW J AM SOC INFORM SCI 41 > > > > 433 1990 > > > > MCCAIN KW J AM SOC INFORM SCI 37 > > > > 111 1986 > > > > MCCAIN KW J AM SOC INFORM SCI 35 > > > > 351 1984 > > > > MULLINS NC THEORIES THEORY GROU > > > > 1973 > > > > WHITE HD BIBLIOMETRICS SCHOLA > > > > 84 1990 > > > > WHITE HD J AM SOC INF SCI TEC 54 > > > > 423 2003 > > > > WHITE HD J AM SOC INFORM SCI 49 > > > > 327 1998 > > > > WHITE HD J AM SOC INFORM SCI 41 > > > > 430 1990 > > > > WHITE HD J AM SOC INFORM SCI 32 > > > > 163 1981 > > > > > > > > > > > > When responding, please attach my original message > > > > ______________________________________________________________ > > > > _________ > > > > Eugene Garfield, PhD. email: garfield at codex.cis.upenn.edu > > > > home page: www.eugenegarfield.org > > > > Tel: 215-243-2205 Fax 215-387-1266 > > > > President, The Scientist LLC. www.the-scientist.com > > > > Chairman Emeritus, ISI www.isinet.com > > > > Past President, American Society for Information Science and > > > > Technology > > > > (ASIS&T) www.asis.org > > > > ______________________________________________________________ > > > > _________ > > > > > > > > > > > > > > > > ISSN: > > > > 1532-2882 > > > > > > > > > > > > > -- > > --------------------------------------------------------------- > > Steven A. Morris samorri at okstate.edu > > Electrical and Computer Engineering office: 405-744-1662 > > 202 Engineering So. > > Oklahoma State University > > Stillwater, Oklahoma 74078 > > http://samorris.ceat.okstate.edu > > > Dear Dr. Leysesdorff,

The message below was sent as a reply to you on the Sigmetrics mailing list about a week ago. However, I'm not sure if the list server is working at the moment. If you've received this before then please forgive me for having sent it to you twice. 

Very kind regards,

Steven Morris

------------------------------------------------------------------------ -------------





Dear Loet,

Thanks very much for your interesting remarks.
In answer to item 1 below,  I have always converted the paper to reference authors matrix and paper to term matrix to binary matrices so that co-occurences can be calculated easily by multiplying the matrices by their transpose.  I'd actually never thought of using the cosine formula that you give below.  I did try that calculation on non-binary paper to reference authors matrices using:

       cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) * Sigma(i) y(i)^2))

I crossplotted the similarity values thus obtained against "binary" cosine similarity values.  The results can be seen at:
http ://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm  
There does appear to be a lot of scatter between these two measures, though in most of the paper collections it doesn't appear to be biased off the 1:1 line.  I don't know what effect this difference would have on clustering of authors. I'm not sure I agree with you that using the binary version of the cosine similarity is "throwing away information."  After all, references are cited multiple times in papers but the data we have available (from ISI) only shows that a reference showed up at least once, yet the data is still very useful.  Granted that knowing the exact number of times an author was cited in a paper adds more information, I'm still not sure that using the non-binary cosine formula above is the most appropriate way to exploit that extra information.  Alternate approaches are available, for example, using the 'overlap' measure. 

I have tried using an "overlap" function to compute cocitation counts for cosine calculations.  For a paper the overlap of ref author i and ref author j  is defined as min[m(i), m(j)],  m(i) and m(j) are the number of times author i and author j were cited in the paper respectively.  This appears to be a reasonable measure of multiple co-citation as it doesn't give a lot of weight to co-citations with authors that tend to appear many times in papers.  So "overlap cosine similarity" can be calculated using   s(i,,j)  = sum[overlap(i,j)] / sqrt( n(i)*n(j)) ) , where the sum is over all papers and n(i) and n(j) are the sum over all papers of the number of citations to author i and j respectively.  For the datasets I have, you can see crossplots of "overlap cosine similarity" against "binary cosine similarity at:
http://s amorris.ceat.okstate.edu/web/overlap/default.htm .  These plots show that overlap similarity tends to be a little larger than binary similarity. This may imply the the overlap method generally tends to increase similarity over the binary method, but proportionally, so that there is no effect of distances between authors and thus no effect, bad or good, on clustering.
 
On point 2 below,  similarities between a pair of authors using a co-citation count matrix is based on whether those two authors are cocited in the same proportions among the other authors.  Correlation seems a natural  measure for this, as it is the measure used for estimating linear dependence.  Also it would seem that negative correlation would be applicable:

Suppose there are two "camps" among a group of 10 authors and that the 1st and 10th authors are the leaders of the two groups respectively.  Assume
two authors have the following co-citation counts:

x = [  1     2     3     4     5     6     7     8     9    10 ]
y = [ 10     9     8     7     6     5     4     3     2     1 ]

so author x is in author 10's camp and author y is in author 1's camp.

in this case rxy = -1, and (1+rxy)/2 gives a similarity of 0.
   while cosine s =  0.5714.  as cosine similarity.

so the rxy similarity shows the authors as disimilar (logical since they belong to different camps).
  while cosine similarity shows that they are similar.  Wouldn't this type of effect be a problem with using the cosine similarity for co-citation count matricies?

With correlation there is still the problem of what to do with authors that have zero variance or cocitation count matrices that have large numbers of zeros. 

Thanks kindly,

Steven Morris




Loet Leydesdorff wrote:
Message
Dear subscribers. For those of you interested in research on the quality of the medical internet and its evaluation methods, a paper summarizing the work we have carried out for the last years has been recently published in Medical Informatics (Hern?ndez-Borges AA et al. "User preference as quality markers of paediatric web sites". Med Inform Internet Med 2003; 28(3): 183-194, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=14612306&dopt=Abstract) Those interested in a reprint, please e-mail me. Any commentary or suggestion will be sincerely thanked. Regards, Angel Hern?ndez-Borges From samorri at OKSTATE.EDU Wed Jan 7 10:57:18 2004 From: samorri at OKSTATE.EDU (Steven A. Morris) Date: Wed, 7 Jan 2004 10:57:18 -0500 Subject: White HD "Author cocitation analysis and ... Message-ID: Dear Loet, Thank you for a fruitful discussion. I guess I don't have much to add at this point. To be truthful, in practice I haven't noticed a lot of difference in results when using different similarity measures: rxy, "binary" cosine, or "non-binary" cosine, or overlap. This isn't surprising, since authors are usually clustered in a "core and scatter" pattern, comprised of small, well-defined "core" groups of closely related authors and a large "scatter" group of authors with sparse and highly overlapping relations. So a map of a collection of authors is like looking down on a pot of chicken and dumplings, it's easy to spot the dumplings (the core groups), but the stew (scatter authors) always looks different depending on how you stirred the pot (rxy, or cosine or whatever).. The overlap similarity is just another way of defining a co-citation count. I think it was introduced by Salton or one of his co-workers. You can find a good discussion of overlap similarity in: Jones, W. P. and G. W. Furnas (1987). "Pictures of relevance: A geometrical analysis of similarity measures." Journal of the American Society for Information Science and Technology 38(6): 420-442., This paper also discusses many other similarity measures such as rxy, cosine, dice and so forth, and I think it gives a good discussion of the merits of each type of measure. Thanks kindly, S. Morris On Tue, 6 Jan 2004 20:20:20 +0100, Loet Leydesdorff wrote: >Dear Steven, > >Thank you for communicating these experimental results. They are >interesting. > >It seems to me that you have convincingly shown that the two measures >(the binary and the non-binary one) are different in the case that there >is information available at a measurement scale higher than dichotomous >(e.g., at the interval level). Of course, if one has only binary >information, one can use the binary formulation of the formula, but this >is generated only because the square or the root of one is also one, and >the square or root of zero is also zero. Thus, the cosine is defined >more generally in terms of what you call the non-binary formulation. > >I don't agree with the overlap function. It seems to me most naturally >to return to the original matrix of authors cited as cases and citations >as variables (columns). A cocitation is then the case that two cells are >filled in the same column. One can then compute cosines between authors >as the cases. Choose within SPSS for Analyze > Correlate > Distances and >you find all the options, including cosines between cases. There is no >need for the invention of a new function, in my opinion. > >With kind regards, > > >Loet > > > >Dear Loet, > >Thanks very much for your interesting remarks. >In answer to item 1 below, I have always converted the paper to >reference authors matrix and paper to term matrix to binary matrices so >that co-occurences can be calculated easily by multiplying the matrices >by their transpose. I'd actually never thought of using the cosine >formula that you give below. I did try that calculation on non-binary >paper to reference authors matrices using: > > cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) * >Sigma(i) y(i)^2)) > >I crossplotted the similarity values thus obtained against "binary" >cosine similarity values. The results can be seen at: >http://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm >There does appear to be a lot of scatter between these two measures, >though in most of the paper collections it doesn't appear to be biased >off the 1:1 line. I don't know what effect this difference would have >on clustering of authors. I'm not sure I agree with you that using the >binary version of the cosine similarity is "throwing away information." > >After all, references are cited multiple times in papers but the data we > >have available (from ISI) only shows that a reference showed up at least > >once, yet the data is still very useful. Granted that knowing the exact > >number of times an author was cited in a paper adds more information, >I'm still not sure that using the non-binary cosine formula above is the > >most appropriate way to exploit that extra information. Alternate >approaches are available, for example, using the 'overlap' measure. > >I have tried using an "overlap" function to compute cocitation counts >for cosine calculations. For a paper the overlap of ref author i and >ref author j is defined as min[m(i), m(j)], m(i) and m(j) are the >number of times author i and author j were cited in the paper >respectively. This appears to be a reasonable measure of multiple >co-citation as it doesn't give a lot of weight to co-citations with >authors that tend to appear many times in papers. So "overlap cosine >similarity" can be calculated using s(i,,j) = sum[overlap(i,j)] / >sqrt( n(i)*n(j)) ) , where the sum is over all papers and n(i) and n(j) >are the sum over all papers of the number of citations to author i and j > >respectively. For the datasets I have, you can see crossplots of >"overlap cosine similarity" against "binary cosine similarity at: >http://samorris.ceat.okstate.edu/web/overlap/default.htm . These plots >show that overlap similarity tends to be a little larger than binary >similarity. This may imply the the overlap method generally tends to >increase similarity over the binary method, but proportionally, so that >there is no effect of distances between authors and thus no effect, bad >or good, on clustering. > >On point 2 below, similarities between a pair of authors using a >co-citation count matrix is based on whether those two authors are >cocited in the same proportions among the other authors. Correlation >seems a natural measure for this, as it is the measure used for >estimating linear dependence. Also it would seem that negative >correlation would be applicable: > >Suppose there are two "camps" among a group of 10 authors and that the >1st and 10th authors are the leaders of the two groups respectively. >Assume >two authors have the following co-citation counts: > >x = [ 1 2 3 4 5 6 7 8 9 10 ] >y = [ 10 9 8 7 6 5 4 3 2 1 ] > >so author x is in author 10's camp and author y is in author 1's camp. > >in this case rxy = -1, and (1+rxy)/2 gives a similarity of 0. > while cosine s = 0.5714. as cosine similarity. > >so the rxy similarity shows the authors as disimilar (logical since they > >belong to different camps). > while cosine similarity shows that they are similar. Wouldn't this >type of effect be a problem with using the cosine similarity for >co-citation count matricies? > >With correlation there is still the problem of what to do with authors >that have zero variance or cocitation count matrices that have large >numbers of zeros. > >Thanks kindly, > >Steven Morris > > > > >Loet Leydesdorff wrote: > >> Dear Steve, >> >> Thank you for the interesting contribution. Let me make a few remarks: >> >> 1. Why did you reduce the matrices studied to binary ones? ("The >> (i,j)th element of O(p,ra) is unity if paper i cites reference author >> j one or more times, zero otherwise." at >> http://samorris.ceat.okstate.edu/web/rxy/default.htm .) Both r and the > >> cosine are well defined for frequency distributions. >> >> The cosine between two vectors x(i) and y(i) is defined as: >> >> cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) * >> Sigma(i) y(i)^2)) >> >> For those of you who read this in html: >> >> In the case of the binary matrix this formula degenerates to the >> simpler format that you used: >> >> cos=n(i,j)/sqrt[n(i)*n(j)] >> >> SPSS calls this simpler format the "Ochiai". Salton & McGill (1983) >> provided the full formula in their "Introduction to Modern Information > >> Retrieval" (Auckland, etc.: McGraw-Hill). >> >> There seems no reason to throw away part of the information that is >> available in your datasets. I would be curious to see how your curves >> would look like using the full data. I expect some effects. >> >> 2. Why would your reasoning not hold for ACA? For rough-and-ready >> purposes, one may wish to use either measure as White (2003) posits. >> However, the fundamental points remain the same, isn't it? One could >> also have a zero variance in an ACA matrix or not? The problem with >> the zeros signalled by Ahlgren et al. (2003) remains also in this >> case, isn't it? >> >> 3. In addition to the technical differences, there may be differences >> stemming from the research design that make the researcher decide to >> use one or the other measure. For example, in a factor analytic design > >> one uses Pearson's r. For mapping purposes one may also consider the >> Euclidean distance, but this is expected to provide very different >> results. The theoretical purposes of the research have first to be >> specified, in my opinion. >> >> 4. My interest in this issue is driven by my interest in the evolution >> of communication systems. One can expect communication systems to >> develop in different phases like a segmentation, stratification, and >> differentiation. In a segmented communication system only mutual >> relations would count. Euclidean distances may be the right measure. >> >> In a fully differentiated one, one would expect eigenvector to be >> spanned orthogonally at the network level. Here factor analysis >> provides us with insights in the structural differentiation. In the >> in-between stage a stratified communication system is expected to be >> hierarchically organized. The grouping is then reduced to a ranking. >> For this case, the cosine seems a good mapping tool since it organized > >> the "star" of the network in the center of the map (using a >> visualization tool). Pearson's r in this case has the disadvantages >> mentioned previously during this discussion. >> >> The Jaccard index seems to operate somewhere between the Euclidean >> distance and the cosine. It focusses on segments, but the >> interpretation is closer to the cosine than to the Euclidean distance >> measure. Thus, I am not sure that one should use this measure in an >> evolutionary analysis. >> >> I mentioned the forthcoming paper of Caroline Wagner and me about >> coauthorship relations (http://www.leydesdorff.net/sciencenets ) in >> which we showed how the cosine-based analysis and mapping versus the >> the Pearson-correlation based factor analysis enabled us to explore >> different aspects of the same matrix. These different aspects can be >> provided with different interpretations: the hierarchy in the network >> and the competitive relations among leading countries, respectively. >> But I still have to develop the fundamental argument more >systematically. >> >> With kind regards, >> >> >> Loet >> ---------------------------------------------------------------------- >> -- >> Loet Leydesdorff >> Amsterdam School of Communications Research (ASCoR) >> Kloveniersburgwal 48, 1012 CX Amsterdam >> Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 >> loet at leydesdorff.net ; >> http://www.leydesdorff.net/ >> >> The Challenge of Scientometrics >> ; The >> Self-Organization of the Knowledge-Based Society >> >> >> > -----Original Message----- >> > From: ASIS&T Special Interest Group on Metrics >> > [mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Steven Morris >> > Sent: Tuesday, December 23, 2003 3:26 AM >> > To: SIGMETRICS at listserv.utk.edu >> > Subject: Re: [SIGMETRICS] White HD "Author cocitation analysis and >> > ... >> > >> > >> > Dear colleagues, >> > >> > Regarding rxy vs. cosine similarity: >> > >> > When working with a collection of papers downloaded from the Web of >> > Science, where a paper to reference author citation matrix can be >> > extracted, the calculation of cosine similarity and rxy, the >> > correlation coefficient, are both straightforward. Similarity is >> > based on the number of times a pair of authors are cited together. N > >> > is the number of papers in the collection, n(i), n(j) is the number >> > of citations received by ref author i and j, n(i,j) is the number of >> > papers citing both ref author i and ref author j. The >> > correlation coefficient is calculated from >> > rxy=[N*n(i,j)-n(i)*n(j)]/sqrt[(N*n(i)-n(i)^2)*(N*n(j)-n(j)^2)] >> > while the cosine similarity is calulated using >> > s=n(i,j)/sqrt[n(i)*n(j)]. If N is large compared to the >> > product of the number of cites received by a pair of authors, >> > then rxy and cosine formula give equal results. See >> > http://samorris.ceat.okstate.edu/web/rxy/default.htm >> > for crossplots of cosine similarity vs. rxy for reference >> > authors from several collections of papers. >> > >> > For collections of papers without domininant reference authors there > >> > is very little difference between cosine and rxy. For collections >> > with dominant reference authors that are cited by a large fraction >> > of the total number of papers, rxy can be much less than cosine >> > similarity. >> > >> > Correlation coefficient is problematic in this case because it is >> > possible for pairs of authors with large co-citation counts to have >> > zero rxy. For example, two authors, both cited by half the papers >> > in the collection, but cocited by 1/4 of the papers will have a >> > correlation coefficient of zero but a cosine similarity of 1/2. >> > Also, the correlation coefficient is not defined for any author that > >> > is cited by all papers in the collection, since that author has zero >> > variance. Recall that rxy is cov(x,y)/sqrt[var(x)*var(y)], so >> > zero variance drives the denominator to zero in the rxy >> > equation, thus undefined rxy. >> > >> > For this reason it's probably better to use cosine similarity than >> > rxy for ACA analysis based on a paper to ref author matrix. >> > Converting similarities to distances for clustering is less >> > problematic as well. >> > >> > The situation is different for ACA based on a co-citation count >> > matrix. In this case the similarity between two authors is not based > >> > on how often they are cited together, but whether the two authors >> > are co-cited in the same proportions among the other authors in the > >> > collection. In this case it would seem that rxy would be the >> > appropriate measure of similarity to use. >> > >> > S. Morris >> > >> > >> > >> > Loet Leydesdorff wrote: >> > > > -----Original Message----- >> > > > From: ASIS&T Special Interest Group on Metrics >> > > > [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Eugene >> > Garfield >> > > > Sent: Monday, December 01, 2003 9:57 PM > To: >> > > SIGMETRICS at LISTSERV.UTK.EDU > Subject: [SIGMETRICS] White >> > HD "Author >> > > cocitation analysis > and Pearson's r" Journal of the American >> > > Society for > Information Science and Technology 54(13):1250-1259 > >> > > November 2003, > > >> > > > Howard D. White : Howard.Dalby.White at drexel.edu >> > > > >> > > > TITLE Author cocitation analysis and Pearson's r >> > > > >> > > > AUTHOR White HD >> > > > >> > > > JOURNAL JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION >> > > > SCIENCE AND TECHNOLOGY 54 (13): 1250-1259 NOV 2003 >> > > >> > > Dear Howard and colleagues, >> > > >> > > I read this article with interest and I agree that for most >> > practical >> > > purposes Pearson's r will do a job similar to Salton's cosine. >> > > Nevertheless, the argument of Ahlgren et al. (2002) seems >> > convincing >> > > to me. Scientometric distributions are often highly skewed and the > >> > > mean can easily be distorted by the zeros. The cosine >> > elegantly solves >> > > this problem. >> > > >> > > A disadvantage of the cosine (in comparison to the r) may >> > be that it >> > > does not become negative in order to indicate >> > dissimilarity. This is >> > > particularly important for the factor analysis. I have >> > thought about >> > > input-ing the cosine matrix into the factor analysis (SPSS >> > allows for >> > > importing a matrix in this analysis), but that seems a bit tricky. >> > > >> > > Caroline Wagner and I did a study on coauthorship relations >> > entitled >> > > "Mapping Global Science using International Coauthorships: A >> > > comparison of 1990 and 2000" (Intern. J. of Technology and >> > > Globalization, >> > > forthcoming) in which we used the same matrix for mapping using >> > > the cosine (and then Pajek for the visualization) and for the >> > > factor analysis using Pearson's r. The results are provided as >> > factor plots in >> > > the preprint version of the paper at >> > > http://www.leydesdorff.net/sciencenets/mapping.pdf . >> > > >> > > While the cosine maps exhibit the hierarchy by placing the central > >> > > cluster in the center (including the U.S.A. and some >> > Western-European >> > > countries), the factor analysis reveals the main structural axes >> > > of the system as competitive relations between the U.S.A., U.K., >> > > and continental Europe (Germany + Russia). The French system can >> > > be considered as a fourth axis. These eigenvectors function as >> > > competitors for collaboration with authors from other >> > (smaller or more >> > > peripheral) countries. >> > > >> > > Thus, the two measures enable us to show something differently: >> > > Salton's cosine exhibits the hierarchy and one might say that the >> > > factor analysis on the basis of Pearson's r enables us to show the > >> > > heterarchy among competing axes in the system. >> > > >> > > With kind regards, >> > > >> > > Loet >> > > >> > > >> > -------------------------------------------------------------------- >> > -- >> > > -- >> > > Loet Leydesdorff >> > > Amsterdam School of Communications Research (ASCoR) >> > > Kloveniersburgwal 48, 1012 CX Amsterdam >> > > Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 loet at leydesdorff.net > >> > > ; http://www.leydesdorff.net/ >> > > >> > > The Challenge of Scientometrics >> > > ; The >> > > Self-Organization of the Knowledge-Based Society >> > > >> > > >> > > >> > > >> > > > >> > > > >> > > > Document type: Article Language: English Cited >> > References: > 20 >> > > Times Cited: 0 > >> > > > Abstract: >> > > > In their article "Requirements for a cocitation similarity > >> > > measure, with special reference to Pearson's correlation > >> > > coefficient," Ahlgren, Jarneving, and Rousseau fault > >> > > traditional author cocitation analysis (ACA) for using > >> > > Pearson's r as a measure of similarity between authors > because >> > > it fails two tests of stability of measurement. The > >> > > instabilities arise when rs are recalculated after a first > >> > > coherent group of authors has been augmented by a second > >> > > coherent group with whom the first has little or no > cocitation. > >> > > However, AJ&R neither cluster nor map their data > to demonstrate > >> > > how fluctuations in rs will mislead the > analyst, and the >> > > problem they pose is remote from both theory > and practice in >> > > traditional ACA. By entering their own rs > into multidimensional > >> > > scaling and clustering routines, I show > that, despite rs >> > > fluctuations, clusters based on it are much > the same for the >> > > combined groups as for the separate groups. > The combined groups > >> > > when mapped appear as polarized clumps of > points in >> > > two-dimensional space, confirming that differences > between the >> > > groups have become much more important than > differences within >> > > the groups-an accurate portrayal of what > has happened to the >> > > data. Moreover, r produces clusters and > maps very like those >> > > based on other coefficients that AJ&R > mention as possible >> > > replacements, such as a cosine similarity > measure or a chi >> > > square dissimilarity measure. Thus, r > performs well enough for >> > > the purposes of ACA. Accordingly, I > argue that qualitative >> > > information revealing why authors are > cocited is more important > >> > > than the cautions proposed in the > AJ&R critique. I include >> > > notes on topics such as handling the > diagonal in author >> > > cocitation matrices, lognormalizing data, > and testing r for >> > > significance. > >> > > > KeyWords Plus: >> > > > INTELLECTUAL STRUCTURE, SCIENCE >> > > > >> > > > Addresses: >> > > > White HD, Drexel Univ, Coll Informat Sci & Technol, 3152 >> > > > Chestnut St, Philadelphia, PA 19104 USA Drexel Univ, Coll >> > > > Informat Sci & Technol, Philadelphia, PA 19104 USA >> > > > >> > > > Publisher: >> > > > JOHN WILEY & SONS INC, 111 RIVER ST, HOBOKEN, NJ 07030 USA >> > > > >> > > > IDS Number: >> > > > 730VQ >> > > > >> > > > >> > > > Cited Author Cited Work Volume >> > > > Page Year >> > > > ID >> > > > >> > > > AHLGREN P J AM SOC INF SCI TEC 54 >> > > > 550 2003 >> > > > BAYER AE J AM SOC INFORM SCI 41 >> > > > 444 1990 >> > > > BORGATTI SP UCINET WINDOWS SOFTW >> > > > 2002 >> > > > BORGATTI SP WORKSH SUNB 20 INT S >> > > > 2000 >> > > > DAVISON ML MULTIDIMENSIONAL SCA >> > > > 1983 >> > > > EOM SB J AM SOC INFORM SCI 47 >> > > > 941 1996 >> > > > EVERITT B CLUSTER ANAL >> > > > 1974 >> > > > GRIFFITH BC KEY PAPERS INFORMATI >> > > > R6 1980 >> > > > HOPKINS FL SCIENTOMETRICS 6 >> > > > 33 1984 >> > > > HUBERT L BRIT J MATH STAT PSY 29 >> > > > 190 1976 >> > > > LEYDESDORFF L INFORMERICS 87 88 >> > > > 105 1988 >> > > > MCCAIN KW J AM SOC INFORM SCI 41 >> > > > 433 1990 >> > > > MCCAIN KW J AM SOC INFORM SCI 37 >> > > > 111 1986 >> > > > MCCAIN KW J AM SOC INFORM SCI 35 >> > > > 351 1984 >> > > > MULLINS NC THEORIES THEORY GROU >> > > > 1973 >> > > > WHITE HD BIBLIOMETRICS SCHOLA >> > > > 84 1990 >> > > > WHITE HD J AM SOC INF SCI TEC 54 >> > > > 423 2003 >> > > > WHITE HD J AM SOC INFORM SCI 49 >> > > > 327 1998 >> > > > WHITE HD J AM SOC INFORM SCI 41 >> > > > 430 1990 >> > > > WHITE HD J AM SOC INFORM SCI 32 >> > > > 163 1981 >> > > > >> > > > >> > > > When responding, please attach my original message >> > > > ______________________________________________________________ >> > > > _________ >> > > > Eugene Garfield, PhD. email: garfield at codex.cis.upenn.edu >> > > > home page: www.eugenegarfield.org >> > > > Tel: 215-243-2205 Fax 215-387-1266 >> > > > President, The Scientist LLC. www.the-scientist.com >> > > > Chairman Emeritus, ISI www.isinet.com >> > > > Past President, American Society for Information Science and >> > > > Technology >> > > > (ASIS&T) www.asis.org >> > > > ______________________________________________________________ >> > > > _________ >> > > > >> > > > >> > > > >> > > > ISSN: >> > > > 1532-2882 >> > > > >> > > >> > >> > >> > -- >> > --------------------------------------------------------------- >> > Steven A. Morris samorri at okstate.edu >> > Electrical and Computer Engineering office: 405-744-1662 >> > 202 Engineering So. >> > Oklahoma State University >> > Stillwater, Oklahoma 74078 >> > http://samorris.ceat.okstate.edu >> > >> > > > content="text/html;charset=ISO-8859-1"> > > > > > >Dear Dr. Leysesdorff,
>
>The message below was sent as a reply to you on the Sigmetrics mailing >list about a week ago. However, I'm not sure if the list server is >working at the moment. If you've received this before then please >forgive me for having sent it to you twice. 

Very kind >regards,

Steven Morris

>------------------------------------------------------------------------ >-------------
>
>
>
>
>
>Dear Loet,
>
>Thanks very much for your interesting >remarks.
In answer to item 1 below,  I have always converted >the paper to reference authors matrix and paper to term matrix to binary >matrices so that co-occurences can be calculated easily by multiplying >the matrices by their transpose.  I'd actually never thought of >using the cosine formula that you give below.  I did try that >calculation on non-binary paper to reference authors matrices using:
>
color="#0000ff" size="2">       >cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) >x(i)^2) * Sigma(i) y(i)^2))
>

>I crossplotted the similarity values thus >obtained against "binary" cosine similarity values.  The results >can be seen at:
href="http://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm">http >://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm   >
>There does appear to be a lot of scatter between these two measures, >though in most of the paper collections it doesn't appear to be biased >off the 1:1 line.  I don't know what effect this difference would >have on clustering of authors. I'm not sure I agree with you that using >the binary version of the cosine similarity is "throwing away >information."  After all, references are cited multiple times in >papers but the data we have available (from ISI) only shows that a >reference showed up at least once, yet the data is still very >useful.  Granted that knowing the exact number of times an author >was cited in a paper adds more information, I'm still not sure that >using the non-binary cosine formula above is the most appropriate way to >exploit that extra information.  Alternate approaches are >available, for example, using the 'overlap' measure. 

I >have tried using an "overlap" function to compute cocitation counts for >cosine calculations.  For a paper the overlap of ref author i and >ref author j  is defined as min[m(i), m(j)],  m(i) and m(j) >are the number of times author i and author j were cited in the paper >respectively.  This appears to be a reasonable measure of multiple >co-citation as it doesn't give a lot of weight to co-citations with >authors that tend to appear many times in papers.  So "overlap >cosine similarity" can be calculated using   s(i,,j)  = >sum[overlap(i,j)] / sqrt( n(i)*n(j)) ) , where the sum is over all >papers and n(i) and n(j) are the sum over all papers of the number of >citations to author i and j respectively.  For the datasets I have, >you can see crossplots of "overlap cosine similarity" against "binary >cosine similarity at:
href="http://samorris.ceat.okstate.edu/web/overlap/default.htm">http://s >amorris.ceat.okstate.edu/web/overlap/default.htm >.  These plots >show that overlap similarity tends to be a little larger than binary >similarity. This may imply the the overlap method generally tends to >increase similarity over the binary method, but proportionally, so that >there is no effect of distances between authors and thus no effect, bad >or good, on clustering.
 
>On point 2 below,  similarities between a pair of authors using a >co-citation count matrix is based on whether those two authors are >cocited in the same proportions among the other authors.  >Correlation seems a natural  measure for this, as it is the measure >used for estimating linear dependence.  Also it would seem that >negative correlation would be applicable:

Suppose there are two >"camps" among a group of 10 authors and that the 1st and 10th authors >are the leaders of the two groups respectively.  Assume
two >authors have the following co-citation counts:

x = [  >1     2     >3     4     >5     6     >7     8     9    >10 ]
y = [ 10     9     >8     7     >6     5     >4     3     >2     1 ]

so author x is in author 10's >camp and author y is in author 1's camp.

in this case rxy = -1, >and (1+rxy)/2 gives a similarity of 0.
   while cosine s >=  0.5714.  as cosine similarity.

so the rxy >similarity shows the authors as disimilar (logical since they belong to >different camps).
  while cosine similarity shows that they are >similar.  Wouldn't this type of effect be a problem with using the >cosine similarity for co-citation count matricies?

With >correlation there is still the problem of what to do with authors that >have zero variance or cocitation count matrices that have large numbers >of zeros. 

Thanks kindly,

Steven >Morris




Loet Leydesdorff wrote:
type="cite" cite="mid000f01c3c926$3c76ba60$1202a8c0 at loet"> > > Message > >
Message-ID: Dear Steven, Thank you for the reference. I think that one can do with your metaphor of a pot of chicken and dumplings when one is interested in the information retrieval (using Bradford's law). However, when the research question is, for example, about the structure of science, the delineations and the parameter choices become utmost important. Since there is a wealth of similarity criteria and clustering algorithms one would be able to produce almost any representation. But some representations are better than others! With kind regards, Loet _____ Loet Leydesdorff Amsterdam School of Communications Research (ASCoR) Kloveniersburgwal 48, 1012 CX Amsterdam Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 loet at leydesdorff.net ; http://www.leydesdorff.net/ The Challenge of Scientometrics ; The Self-Organization of the Knowledge-Based Society > -----Original Message----- > From: Steven A. Morris [mailto:samorri at OKSTATE.EDU] > Sent: Wednesday, January 07, 2004 4:57 PM > To: SIGMETRICS at LISTSERV.UTK.EDU; Loet Leydesdorff > Cc: Steven A. Morris > Subject: Re: White HD "Author cocitation analysis and ... > > > Dear Loet, > > Thank you for a fruitful discussion. I guess I don't have > much to add at this point. To be truthful, in practice I > haven't noticed a lot of difference in results when using > different similarity measures: rxy, "binary" cosine, or > "non-binary" cosine, or overlap. This isn't surprising, since > authors are usually clustered in a "core and scatter" > pattern, comprised of small, well-defined "core" groups of > closely related authors and a large "scatter" group of > authors with sparse and highly overlapping relations. So a > map of a collection of authors is like looking down on a pot > of chicken and dumplings, it's easy to spot the dumplings > (the core groups), but the stew (scatter authors) always > looks different depending on how you stirred the pot (rxy, or > cosine or whatever).. > > The overlap similarity is just another way of defining a > co-citation count. I think it was introduced by Salton or one > of his co-workers. You can find a good discussion of overlap > similarity in: Jones, W. P. and G. W. Furnas (1987). > "Pictures of relevance: A geometrical analysis of similarity > measures." Journal of the American Society for Information > Science and Technology 38(6): 420-442., This paper also > discusses many other similarity measures such as rxy, cosine, > dice and so forth, and I think it gives a good discussion of > the merits of each type of measure. > > Thanks kindly, > > S. Morris > > > > > > On Tue, 6 Jan 2004 20:20:20 +0100, Loet Leydesdorff > > wrote: > > >Dear Steven, > > > >Thank you for communicating these experimental results. They are > >interesting. > > > >It seems to me that you have convincingly shown that the two > measures > >(the binary and the non-binary one) are different in the case that > >there is information available at a measurement scale higher than > >dichotomous (e.g., at the interval level). Of course, if one > has only > >binary information, one can use the binary formulation of > the formula, > >but this is generated only because the square or the root of one is > >also one, and the square or root of zero is also zero. Thus, > the cosine > >is defined more generally in terms of what you call the non-binary > >formulation. > > > >I don't agree with the overlap function. It seems to me most > naturally > >to return to the original matrix of authors cited as cases and > >citations as variables (columns). A cocitation is then the case that > >two cells are filled in the same column. One can then > compute cosines > >between authors as the cases. Choose within SPSS for Analyze > > >Correlate > Distances and you find all the options, > including cosines > >between cases. There is no need for the invention of a new > function, in > >my opinion. > > > >With kind regards, > > > > > >Loet > > > > > > > >Dear Loet, > > > >Thanks very much for your interesting remarks. > >In answer to item 1 below, I have always converted the paper to > >reference authors matrix and paper to term matrix to binary > matrices so > >that co-occurences can be calculated easily by multiplying > the matrices > >by their transpose. I'd actually never thought of using the cosine > >formula that you give below. I did try that calculation on > non-binary > >paper to reference authors matrices using: > > > > cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) * > >Sigma(i) y(i)^2)) > > > >I crossplotted the similarity values thus obtained against "binary" > >cosine similarity values. The results can be seen at: > >http://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm > >There does appear to be a lot of scatter between these two measures, > >though in most of the paper collections it doesn't appear to > be biased > >off the 1:1 line. I don't know what effect this difference > would have > >on clustering of authors. I'm not sure I agree with you that > using the > >binary version of the cosine similarity is "throwing away > information." > > > >After all, references are cited multiple times in papers but > the data > >we > > > >have available (from ISI) only shows that a reference showed up at > >least > > > >once, yet the data is still very useful. Granted that knowing the > >exact > > > >number of times an author was cited in a paper adds more > information, > >I'm still not sure that using the non-binary cosine formula above is > >the > > > >most appropriate way to exploit that extra information. Alternate > >approaches are available, for example, using the 'overlap' measure. > > > >I have tried using an "overlap" function to compute > cocitation counts > >for cosine calculations. For a paper the overlap of ref > author i and > >ref author j is defined as min[m(i), m(j)], m(i) and m(j) are the > >number of times author i and author j were cited in the paper > >respectively. This appears to be a reasonable measure of multiple > >co-citation as it doesn't give a lot of weight to co-citations with > >authors that tend to appear many times in papers. So "overlap cosine > >similarity" can be calculated using s(i,,j) = sum[overlap(i,j)] / > >sqrt( n(i)*n(j)) ) , where the sum is over all papers and > n(i) and n(j) > >are the sum over all papers of the number of citations to > author i and > >j > > > >respectively. For the datasets I have, you can see crossplots of > >"overlap cosine similarity" against "binary cosine similarity at: > >http://samorris.ceat.okstate.edu/web/overlap/default.htm . > These plots > >show that overlap similarity tends to be a little larger than binary > >similarity. This may imply the the overlap method generally tends to > >increase similarity over the binary method, but > proportionally, so that > >there is no effect of distances between authors and thus no > effect, bad > >or good, on clustering. > > > >On point 2 below, similarities between a pair of authors using a > >co-citation count matrix is based on whether those two authors are > >cocited in the same proportions among the other authors. > Correlation > >seems a natural measure for this, as it is the measure used for > >estimating linear dependence. Also it would seem that negative > >correlation would be applicable: > > > >Suppose there are two "camps" among a group of 10 authors > and that the > >1st and 10th authors are the leaders of the two groups respectively. > >Assume two authors have the following co-citation counts: > > > >x = [ 1 2 3 4 5 6 7 8 9 10 ] > >y = [ 10 9 8 7 6 5 4 3 2 1 ] > > > >so author x is in author 10's camp and author y is in author > 1's camp. > > > >in this case rxy = -1, and (1+rxy)/2 gives a similarity of 0. > > while cosine s = 0.5714. as cosine similarity. > > > >so the rxy similarity shows the authors as disimilar (logical since > >they > > > >belong to different camps). > > while cosine similarity shows that they are similar. > Wouldn't this > >type of effect be a problem with using the cosine similarity for > >co-citation count matricies? > > > >With correlation there is still the problem of what to do > with authors > >that have zero variance or cocitation count matrices that have large > >numbers of zeros. > > > >Thanks kindly, > > > >Steven Morris > > > > > > > > > >Loet Leydesdorff wrote: > > > >> Dear Steve, > >> > >> Thank you for the interesting contribution. Let me make a few > >> remarks: > >> > >> 1. Why did you reduce the matrices studied to binary ones? ("The > >> (i,j)th element of O(p,ra) is unity if paper i cites > reference author > >> j one or more times, zero otherwise." at > >> http://samorris.ceat.okstate.edu/web/rxy/default.htm .) Both r and > >> the > > > >> cosine are well defined for frequency distributions. > >> > >> The cosine between two vectors x(i) and y(i) is defined as: > >> > >> cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) x(i)^2) * > >> Sigma(i) y(i)^2)) > >> > >> For those of you who read this in html: > >> > >> In the case of the binary matrix this formula degenerates to the > >> simpler format that you used: > >> > >> cos=n(i,j)/sqrt[n(i)*n(j)] > >> > >> SPSS calls this simpler format the "Ochiai". Salton & > McGill (1983) > >> provided the full formula in their "Introduction to Modern > >> Information > > > >> Retrieval" (Auckland, etc.: McGraw-Hill). > >> > >> There seems no reason to throw away part of the > information that is > >> available in your datasets. I would be curious to see how > your curves > >> would look like using the full data. I expect some effects. > >> > >> 2. Why would your reasoning not hold for ACA? For rough-and-ready > >> purposes, one may wish to use either measure as White > (2003) posits. > >> However, the fundamental points remain the same, isn't it? > One could > >> also have a zero variance in an ACA matrix or not? The > problem with > >> the zeros signalled by Ahlgren et al. (2003) remains also in this > >> case, isn't it? > >> > >> 3. In addition to the technical differences, there may be > differences > >> stemming from the research design that make the researcher > decide to > >> use one or the other measure. For example, in a factor analytic > >> design > > > >> one uses Pearson's r. For mapping purposes one may also > consider the > >> Euclidean distance, but this is expected to provide very different > >> results. The theoretical purposes of the research have first to be > >> specified, in my opinion. > >> > >> 4. My interest in this issue is driven by my interest in the > >> evolution of communication systems. One can expect communication > >> systems to develop in different phases like a segmentation, > >> stratification, and differentiation. In a segmented communication > >> system only mutual relations would count. Euclidean > distances may be > >> the right measure. > >> > >> In a fully differentiated one, one would expect eigenvector to be > >> spanned orthogonally at the network level. Here factor analysis > >> provides us with insights in the structural > differentiation. In the > >> in-between stage a stratified communication system is > expected to be > >> hierarchically organized. The grouping is then reduced to > a ranking. > >> For this case, the cosine seems a good mapping tool since it > >> organized > > > >> the "star" of the network in the center of the map (using a > >> visualization tool). Pearson's r in this case has the > disadvantages > >> mentioned previously during this discussion. > >> > >> The Jaccard index seems to operate somewhere between the Euclidean > >> distance and the cosine. It focusses on segments, but the > >> interpretation is closer to the cosine than to the > Euclidean distance > >> measure. Thus, I am not sure that one should use this > measure in an > >> evolutionary analysis. > >> > >> I mentioned the forthcoming paper of Caroline Wagner and me about > >> coauthorship relations > (http://www.leydesdorff.net/sciencenets ) in > >> which we showed how the cosine-based analysis and mapping > versus the > >> the Pearson-correlation based factor analysis enabled us > to explore > >> different aspects of the same matrix. These different > aspects can be > >> provided with different interpretations: the hierarchy in > the network > >> and the competitive relations among leading countries, > respectively. > >> But I still have to develop the fundamental argument more > >systematically. > >> > >> With kind regards, > >> > >> > >> Loet > >> > --------------------------------------------------------------------- > >> - > >> -- > >> Loet Leydesdorff > >> Amsterdam School of Communications Research (ASCoR) > >> Kloveniersburgwal 48, 1012 CX Amsterdam > >> Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 > >> loet at leydesdorff.net ; > >> http://www.leydesdorff.net/ > >> > >> The Challenge of Scientometrics > >> ; The > >> Self-Organization of the Knowledge-Based Society > >> > >> > >> > -----Original Message----- > >> > From: ASIS&T Special Interest Group on Metrics > >> > [mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Steven Morris > >> > Sent: Tuesday, December 23, 2003 3:26 AM > >> > To: SIGMETRICS at listserv.utk.edu > >> > Subject: Re: [SIGMETRICS] White HD "Author cocitation > analysis and > >> > ... > >> > > >> > > >> > Dear colleagues, > >> > > >> > Regarding rxy vs. cosine similarity: > >> > > >> > When working with a collection of papers downloaded from > the Web of > >> > Science, where a paper to reference author citation > matrix can be > >> > extracted, the calculation of cosine similarity and rxy, the > >> > correlation coefficient, are both straightforward. Similarity is > >> > based on the number of times a pair of authors are cited > together. > >> > N > > > >> > is the number of papers in the collection, n(i), n(j) is > the number > >> > of citations received by ref author i and j, n(i,j) is > the number > >> > of papers citing both ref author i and ref author j. The > >> > correlation coefficient is calculated from > >> > rxy=[N*n(i,j)-n(i)*n(j)]/sqrt[(N*n(i)-n(i)^2)*(N*n(j)-n(j)^2)] > >> > while the cosine similarity is calulated using > >> > s=n(i,j)/sqrt[n(i)*n(j)]. If N is large compared to the > product of > >> > the number of cites received by a pair of authors, then rxy and > >> > cosine formula give equal results. See > >> > http://samorris.ceat.okstate.edu/web/rxy/default.htm > >> > for crossplots of cosine similarity vs. rxy for > reference authors > >> > from several collections of papers. > >> > > >> > For collections of papers without domininant reference authors > >> > there > > > >> > is very little difference between cosine and rxy. For > collections > >> > with dominant reference authors that are cited by a > large fraction > >> > of the total number of papers, rxy can be much less than cosine > >> > similarity. > >> > > >> > Correlation coefficient is problematic in this case > because it is > >> > possible for pairs of authors with large co-citation > counts to have > >> > zero rxy. For example, two authors, both cited by half > the papers > >> > in the collection, but cocited by 1/4 of the papers will have a > >> > correlation coefficient of zero but a cosine similarity of 1/2. > >> > Also, the correlation coefficient is not defined for any author > >> > that > > > >> > is cited by all papers in the collection, since that author has > >> > zero variance. Recall that rxy is > cov(x,y)/sqrt[var(x)*var(y)], so > >> > zero variance drives the denominator to zero in the rxy > equation, > >> > thus undefined rxy. > >> > > >> > For this reason it's probably better to use cosine > similarity than > >> > rxy for ACA analysis based on a paper to ref author matrix. > >> > Converting similarities to distances for clustering is less > >> > problematic as well. > >> > > >> > The situation is different for ACA based on a co-citation count > >> > matrix. In this case the similarity between two authors is not > >> > based > > > >> > on how often they are cited together, but whether the > two authors > >> > are co-cited in the same proportions among the other authors in > >> > the > > > >> > collection. In this case it would seem that rxy would be the > >> > appropriate measure of similarity to use. > >> > > >> > S. Morris > >> > > >> > > >> > > >> > Loet Leydesdorff wrote: > >> > > > -----Original Message----- > >> > > > From: ASIS&T Special Interest Group on Metrics > >> > > > [mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Eugene > >> > Garfield > >> > > > Sent: Monday, December 01, 2003 9:57 PM > To: > >> > > SIGMETRICS at LISTSERV.UTK.EDU > Subject: [SIGMETRICS] White > >> > HD "Author > >> > > cocitation analysis > and Pearson's r" Journal of the > American > >> > > Society for > Information Science and Technology > >> > > 54(13):1250-1259 > > > >> > > November 2003, > > > >> > > > Howard D. White : Howard.Dalby.White at drexel.edu > >> > > > > >> > > > TITLE Author cocitation analysis and Pearson's r > >> > > > > >> > > > AUTHOR White HD > >> > > > > >> > > > JOURNAL JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION > >> > > > SCIENCE AND TECHNOLOGY 54 (13): 1250-1259 NOV 2003 > >> > > > >> > > Dear Howard and colleagues, > >> > > > >> > > I read this article with interest and I agree that for most > >> > practical > >> > > purposes Pearson's r will do a job similar to Salton's cosine. > >> > > Nevertheless, the argument of Ahlgren et al. (2002) seems > >> > convincing > >> > > to me. Scientometric distributions are often highly skewed and > >> > > the > > > >> > > mean can easily be distorted by the zeros. The cosine > >> > elegantly solves > >> > > this problem. > >> > > > >> > > A disadvantage of the cosine (in comparison to the r) may > >> > be that it > >> > > does not become negative in order to indicate > >> > dissimilarity. This is > >> > > particularly important for the factor analysis. I have > >> > thought about > >> > > input-ing the cosine matrix into the factor analysis (SPSS > >> > allows for > >> > > importing a matrix in this analysis), but that seems a bit > >> > > tricky. > >> > > > >> > > Caroline Wagner and I did a study on coauthorship relations > >> > entitled > >> > > "Mapping Global Science using International Coauthorships: A > >> > > comparison of 1990 and 2000" (Intern. J. of Technology and > >> > > Globalization, > >> > > forthcoming) in which we used the same matrix for > mapping using > >> > > the cosine (and then Pajek for the visualization) and for the > >> > > factor analysis using Pearson's r. The results are provided as > >> > factor plots in > >> > > the preprint version of the paper at > >> > > http://www.leydesdorff.net/sciencenets/mapping.pdf . > >> > > > >> > > While the cosine maps exhibit the hierarchy by placing the > >> > > central > > > >> > > cluster in the center (including the U.S.A. and some > >> > Western-European > >> > > countries), the factor analysis reveals the main > structural axes > >> > > of the system as competitive relations between the > U.S.A., U.K., > >> > > and continental Europe (Germany + Russia). The French > system can > >> > > be considered as a fourth axis. These eigenvectors function as > >> > > competitors for collaboration with authors from other > >> > (smaller or more > >> > > peripheral) countries. > >> > > > >> > > Thus, the two measures enable us to show something > differently: > >> > > Salton's cosine exhibits the hierarchy and one might > say that the > >> > > factor analysis on the basis of Pearson's r enables us to show > >> > > the > > > >> > > heterarchy among competing axes in the system. > >> > > > >> > > With kind regards, > >> > > > >> > > Loet > >> > > > >> > > > >> > > ------------------------------------------------------------------- > >> > - > >> > -- > >> > > -- > >> > > Loet Leydesdorff > >> > > Amsterdam School of Communications Research (ASCoR) > >> > > Kloveniersburgwal 48, 1012 CX Amsterdam > >> > > Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 > >> > > loet at leydesdorff.net > > > >> > > ; http://www.leydesdorff.net/ > >> > > > >> > > The Challenge of Scientometrics > >> > > ; The > >> > > Self-Organization of the Knowledge-Based Society > >> > > > >> > > > >> > > > >> > > > >> > > > > >> > > > > >> > > > Document type: Article Language: English Cited > >> > References: > 20 > >> > > Times Cited: 0 > > >> > > > Abstract: > >> > > > In their article "Requirements for a cocitation > similarity > > >> > > measure, with special reference to Pearson's correlation > > >> > > coefficient," Ahlgren, Jarneving, and Rousseau fault > > >> > > traditional author cocitation analysis (ACA) for using > > >> > > Pearson's r as a measure of similarity between authors > > because > >> > > it fails two tests of stability of measurement. The > > >> > > instabilities arise when rs are recalculated after a first > > >> > > coherent group of authors has been augmented by a second > > >> > > coherent group with whom the first has little or no > > >> > > cocitation. > > > >> > > However, AJ&R neither cluster nor map their data > to > >> > > demonstrate > > > >> > > how fluctuations in rs will mislead the > analyst, and the > >> > > problem they pose is remote from both theory > and > practice in > >> > > traditional ACA. By entering their own rs > into > >> > > multidimensional > > > >> > > scaling and clustering routines, I show > that, despite rs > >> > > fluctuations, clusters based on it are much > the > same for the > >> > > combined groups as for the separate groups. > The combined > >> > > groups > > > >> > > when mapped appear as polarized clumps of > points in > >> > > two-dimensional space, confirming that differences > > between the > >> > > groups have become much more important than > > differences within > >> > > the groups-an accurate portrayal of what > has > happened to the > >> > > data. Moreover, r produces clusters and > maps very > like those > >> > > based on other coefficients that AJ&R > mention as possible > >> > > replacements, such as a cosine similarity > measure or a chi > >> > > square dissimilarity measure. Thus, r > performs well > enough for > >> > > the purposes of ACA. Accordingly, I > argue that qualitative > >> > > information revealing why authors are > cocited is more > >> > > important > > > >> > > than the cautions proposed in the > AJ&R critique. I include > >> > > notes on topics such as handling the > diagonal in author > >> > > cocitation matrices, lognormalizing data, > and testing r for > >> > > significance. > > KeyWords Plus: > >> > > > INTELLECTUAL STRUCTURE, SCIENCE > >> > > > > >> > > > Addresses: > >> > > > White HD, Drexel Univ, Coll Informat Sci & Technol, 3152 > >> > > > Chestnut St, Philadelphia, PA 19104 USA Drexel Univ, Coll > >> > > > Informat Sci & Technol, Philadelphia, PA 19104 USA > >> > > > > >> > > > Publisher: > >> > > > JOHN WILEY & SONS INC, 111 RIVER ST, HOBOKEN, NJ 07030 USA > >> > > > > >> > > > IDS Number: > >> > > > 730VQ > >> > > > > >> > > > > >> > > > Cited Author Cited Work Volume > >> > > > Page Year > >> > > > ID > >> > > > > >> > > > AHLGREN P J AM SOC INF SCI TEC 54 > >> > > > 550 2003 > >> > > > BAYER AE J AM SOC INFORM SCI 41 > >> > > > 444 1990 > >> > > > BORGATTI SP UCINET WINDOWS SOFTW > >> > > > 2002 > >> > > > BORGATTI SP WORKSH SUNB 20 INT S > >> > > > 2000 > >> > > > DAVISON ML MULTIDIMENSIONAL SCA > >> > > > 1983 > >> > > > EOM SB J AM SOC INFORM SCI 47 > >> > > > 941 1996 > >> > > > EVERITT B CLUSTER ANAL > >> > > > 1974 > >> > > > GRIFFITH BC KEY PAPERS INFORMATI > >> > > > R6 1980 > >> > > > HOPKINS FL SCIENTOMETRICS 6 > >> > > > 33 1984 > >> > > > HUBERT L BRIT J MATH STAT PSY 29 > >> > > > 190 1976 > >> > > > LEYDESDORFF L INFORMERICS 87 88 > >> > > > 105 1988 > >> > > > MCCAIN KW J AM SOC INFORM SCI 41 > >> > > > 433 1990 > >> > > > MCCAIN KW J AM SOC INFORM SCI 37 > >> > > > 111 1986 > >> > > > MCCAIN KW J AM SOC INFORM SCI 35 > >> > > > 351 1984 > >> > > > MULLINS NC THEORIES THEORY GROU > >> > > > 1973 > >> > > > WHITE HD BIBLIOMETRICS SCHOLA > >> > > > 84 1990 > >> > > > WHITE HD J AM SOC INF SCI TEC 54 > >> > > > 423 2003 > >> > > > WHITE HD J AM SOC INFORM SCI 49 > >> > > > 327 1998 > >> > > > WHITE HD J AM SOC INFORM SCI 41 > >> > > > 430 1990 > >> > > > WHITE HD J AM SOC INFORM SCI 32 > >> > > > 163 1981 > >> > > > > >> > > > > >> > > > When responding, please attach my original message > >> > > > > ______________________________________________________________ > >> > > > _________ > >> > > > Eugene Garfield, PhD. email: garfield at codex.cis.upenn.edu > >> > > > home page: www.eugenegarfield.org > >> > > > Tel: 215-243-2205 Fax 215-387-1266 > >> > > > President, The Scientist LLC. www.the-scientist.com > >> > > > Chairman Emeritus, ISI www.isinet.com > >> > > > Past President, American Society for Information Science and > >> > > > Technology > >> > > > (ASIS&T) www.asis.org > >> > > > > ______________________________________________________________ > >> > > > _________ > >> > > > > >> > > > > >> > > > > >> > > > ISSN: > >> > > > 1532-2882 > >> > > > > >> > > > >> > > >> > > >> > -- > >> > --------------------------------------------------------------- > >> > Steven A. Morris samorri at okstate.edu > >> > Electrical and Computer Engineering office: 405-744-1662 > >> > 202 Engineering So. > >> > Oklahoma State University > >> > Stillwater, Oklahoma 74078 http://samorris.ceat.okstate.edu > >> > > >> > > Transitional//EN"> > > > > >content="text/html;charset=ISO-8859-1"> > > > > > > > > content="text/html;charset=ISO-8859-1"> > > > >Dear Dr. Leysesdorff,
> >
> >The message below was sent as a reply to you on the > Sigmetrics mailing > >list about a week ago. However, I'm not sure if the list server is > >working at the moment. If you've received this before then please > >forgive me for having sent it to you twice. 

> Very kind > >regards,

Steven Morris

> >------------------------------------------------------------- > ---------- > >- > >-------------
> >
> >
> >
> >
> >
> >Dear Loet,
> >
> >Thanks very much for your > interesting > >remarks.
In answer to item 1 below,  I have always converted > >the paper to reference authors matrix and paper to term > matrix to binary > >matrices so that co-occurences can be calculated easily by > multiplying > >the matrices by their transpose.  I'd actually never thought of > >using the cosine formula that you give below.  I did try that > >calculation on non-binary paper to reference authors > matrices using:
> >
>color="#0000ff" size="2">       > >cosine(x,y) = Sigma(i) x(i)y(i) / sqrt(Sigma(i) > >x(i)^2) * Sigma(i) y(i)^2))
> >

> >I crossplotted the similarity values thus > >obtained against "binary" cosine similarity values.  The results > >can be seen at:
>href="http://samorris.ceat.okstate.edu/web/non_bin_cos/defaul > t.htm">http > >://samorris.ceat.okstate.edu/web/non_bin_cos/default.htm& > nbsp;  > >
> >There does appear to be a lot of scatter between these two measures, > >though in most of the paper collections it doesn't appear to > be biased > >off the 1:1 line.  I don't know what effect this > difference would > >have on clustering of authors. I'm not sure I agree with you > that using > >the binary version of the cosine similarity is "throwing away > >information."  After all, references are cited multiple times in > >papers but the data we have available (from ISI) only shows that a > >reference showed up at least once, yet the data is still very > >useful.  Granted that knowing the exact number of times > an author > >was cited in a paper adds more information, I'm still not sure that > >using the non-binary cosine formula above is the most > appropriate way to > >exploit that extra information.  Alternate approaches are > >available, for example, using the 'overlap' measure.  >

I > >have tried using an "overlap" function to compute cocitation > counts for > >cosine calculations.  For a paper the overlap of ref > author i and > >ref author j  is defined as min[m(i), m(j)],  m(i) and m(j) > >are the number of times author i and author j were cited in the paper > >respectively.  This appears to be a reasonable measure > of multiple > >co-citation as it doesn't give a lot of weight to co-citations with > >authors that tend to appear many times in papers.  So "overlap > >cosine similarity" can be calculated using   > s(i,,j)  = > >sum[overlap(i,j)] / sqrt( n(i)*n(j)) ) , where the sum is over all > >papers and n(i) and n(j) are the sum over all papers of the number of > >citations to author i and j respectively.  For the > datasets I have, > >you can see crossplots of "overlap cosine similarity" against "binary > >cosine similarity at:
>href="http://samorris.ceat.okstate.edu/web/overlap/default.ht > m">http://s > >amorris.ceat.okstate.edu/web/overlap/default.htm > >.  These plots > >show that overlap similarity tends to be a little larger than binary > >similarity. This may imply the the overlap method generally tends to > >increase similarity over the binary method, but > proportionally, so that > >there is no effect of distances between authors and thus no > effect, bad > >or good, on clustering.
  >
> >On point 2 below,  similarities between a pair of > authors using a > >co-citation count matrix is based on whether those two authors are > >cocited in the same proportions among the other authors.  > >Correlation seems a natural  measure for this, as it is > the measure > >used for estimating linear dependence.  Also it would seem that > >negative correlation would be applicable:

Suppose > there are two > >"camps" among a group of 10 authors and that the 1st and 10th authors > >are the leaders of the two groups respectively.  Assume
two > >authors have the following co-citation counts:

x = [  > >1     2     > >3     4     > >5     6     > >7     8     > 9    > >10 ]
y = [ 10     9     > >8     7     > >6     5     > >4     3     > >2     1 ]

so author x is in author 10's > >camp and author y is in author 1's camp.

in this > case rxy = -1, > >and (1+rxy)/2 gives a similarity of 0.
   > while cosine s > >=  0.5714.  as cosine similarity.

so the rxy > >similarity shows the authors as disimilar (logical since > they belong to > >different camps).
  while cosine similarity shows > that they are > >similar.  Wouldn't this type of effect be a problem > with using the > >cosine similarity for co-citation count matricies?

With > >correlation there is still the problem of what to do with > authors that > >have zero variance or cocitation count matrices that have > large numbers > >of zeros. 

Thanks kindly,

Steven > >Morris




Loet Leydesdorff wrote:
>
>type="cite" cite="mid000f01c3c926$3c76ba60$1202a8c0 at loet"> > > > > Message > > > >
-------------- next part -------------- An HTML attachment was scrubbed... URL: From bernies at UILLINOIS.EDU Tue Jan 13 11:53:21 2004 From: bernies at UILLINOIS.EDU (Sloan, Bernie) Date: Tue, 13 Jan 2004 10:53:21 -0600 Subject: Qualitative citation analysis? Message-ID: Quick question... Is anyone aware of any studies of citation analysis from a qualitative perspective? By "qualitative citation analysis" I mean looking at how the author(s) of papers use the citations, rather than simply counting citations. Thanks! Bernie Sloan Senior Library Information Systems Consultant, ILCSO University of Illinois Office for Planning and Budgeting 616 E. Green Street, Suite 213 Champaign, IL 61820 Phone: (217) 333-4895 Fax: (217) 265-0454 E-mail: bernies at uillinois.edu From Garfield at CODEX.CIS.UPENN.EDU Tue Jan 13 14:50:01 2004 From: Garfield at CODEX.CIS.UPENN.EDU (Garfield, Eugene) Date: Tue, 13 Jan 2004 14:50:01 -0500 Subject: Qualitative citation analysis? Message-ID: I think this paper may be the type you are looking for. http://www.garfield.library.upenn.edu/papers/libquart66(4)p449y1996.pdf When to cite? Library Quarterly 1996. Best wishes. EG When responding, please attach my original message __________________________________________________ Eugene Garfield, PhD. email: garfield at codex.cis.upenn.edu -----Original Message----- From: Sloan, Bernie [mailto:bernies at UILLINOIS.EDU] Sent: Tuesday, January 13, 2004 11:53 AM To: SIGMETRICS at LISTSERV.UTK.EDU Subject: [SIGMETRICS] Qualitative citation analysis? Quick question... Is anyone aware of any studies of citation analysis from a qualitative perspective? By "qualitative citation analysis" I mean looking at how the author(s) of papers use the citations, rather than simply counting citations. Thanks! Bernie Sloan Senior Library Information Systems Consultant, ILCSO University of Illinois Office for Planning and Budgeting 616 E. Green Street, Suite 213 Champaign, IL 61820 Phone: (217) 333-4895 Fax: (217) 265-0454 E-mail: bernies at uillinois.edu ________________________________________________________________________ This email has been scanned for all viruses by the MessageLabs Email Security System. ________________________________________________________________________ This email has been scanned for all viruses by the MessageLabs Email Security System. For more information on a proactive email security service working around the clock, around the globe, visit http://www.messagelabs.com ________________________________________________________________________ From loet at LEYDESDORFF.NET Tue Jan 13 15:04:41 2004 From: loet at LEYDESDORFF.NET (Loet Leydesdorff) Date: Tue, 13 Jan 2004 21:04:41 +0100 Subject: Qualitative citation analysis? In-Reply-To: Message-ID: Dear Bernie, I worked extensively with Olga Amsterdamska on this question for the case of a set of papers in biochemistry. The textual usages of citations in citing texts were published in: "Citations: Indicators of Significance," Scientometrics 15(5-6) (1989) 449-471. The results of a questionnaire among the citing authors in: Dimensions of Citation Analysis, Science, Technology and Human Values 15 (1990) 305-335. With kind regards, Loet _____ Loet Leydesdorff Amsterdam School of Communications Research (ASCoR) Kloveniersburgwal 48, 1012 CX Amsterdam Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 loet at leydesdorff.net ; http://www.leydesdorff.net/ The Challenge of Scientometrics ; The Self-Organization of the Knowledge-Based Society > -----Original Message----- > From: ASIS&T Special Interest Group on Metrics > [ mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Sloan, Bernie > Sent: Tuesday, January 13, 2004 5:53 PM > To: SIGMETRICS at listserv.utk.edu > Subject: [SIGMETRICS] Qualitative citation analysis? > > > Quick question... > > Is anyone aware of any studies of citation analysis from a > qualitative perspective? > > By "qualitative citation analysis" I mean looking at how the > author(s) of papers use the citations, rather than simply > counting citations. > > Thanks! > > Bernie Sloan > Senior Library Information Systems Consultant, ILCSO > University of Illinois Office for Planning and Budgeting > 616 E. Green Street, Suite 213 > Champaign, IL 61820 > > Phone: (217) 333-4895 > Fax: (217) 265-0454 > E-mail: bernies at uillinois.edu > -------------- next part -------------- An HTML attachment was scrubbed... URL: From samorri at OKSTATE.EDU Tue Jan 13 23:18:33 2004 From: samorri at OKSTATE.EDU (Steven A. Morris) Date: Tue, 13 Jan 2004 23:18:33 -0500 Subject: Qualitative citation analysis? Message-ID: Bernie, A few months ago I constructed a timeline of Scientometrics papers and papers citing Scientometrics that may be of use to you; it has many papers dealing with your subject of interest. The case study itself is at: http://samorris.ceat.okstate.edu/web/sciento/default.asp If you click on the link labeled "Timeline" it will take you to a timeline visualization, where papers show as circles in horizontal tracks. (The clustering of papers into tracks was done using bibliographic coupling, the papers are plotted by publication date.) Note that track number 23 is labeled "citation behavior", which I assume is your topic of interest. (The track labels were generated manually, by browsing titles of papers). On the left side of the plot, where the label for track 23 is, you'll see 6 hyperlinks, P, R, AP, AR, JP and JR. (Papers, References, Paper authors, Reference authors, paper journals, and reference journals). If you click on 'P' you'll get a list of papers in track 23, most of which should be on the topic of citation behavior. You should be able to browse the titles in this list for papers pertinent to your research topic. More useful maybe, would be to click on 'R', which will give you a list of references used by the papers in track 23, ranked by number of citations received. The top ranked references are the "base references" for papers in the track and can be considered as knowledge symbols of the concepts used by papers in the track. The top four cited references are: MORAVCSIK MJ, 1975, SOC STUD SCI, V5, P86 "Some results on the function and quality of citations" CHUBIN DE, 1975, SOC STUD SCI, V5, P423 "Content analysis of references..." GILBERT GN, 1977, SOC STUD SCI, V7, P113 "Referencing as persuasion" SMALL HG, 1978, SOC STUD SCI, V8, P327 "Cited documents as concept symbols" It's nice to see that Small's 1978 paper on references as concept symbols has itself become a concept symbol [of the concept of references as concept symbols :-) ]. I think it would be useful for you to review these highly cited base references before looking for more recent papers. Please let me know if you find this helpful. Steven Morris Oklahoma State University On Tue, 13 Jan 2004 10:53:21 -0600, Sloan, Bernie wrote: >Quick question... > >Is anyone aware of any studies of citation analysis from a qualitative >perspective? > >By "qualitative citation analysis" I mean looking at how the author(s) of >papers use the citations, rather than simply counting citations. > >Thanks! > >Bernie Sloan >Senior Library Information Systems Consultant, ILCSO >University of Illinois Office for Planning and Budgeting >616 E. Green Street, Suite 213 >Champaign, IL 61820 > >Phone: (217) 333-4895 >Fax: (217) 265-0454 >E-mail: bernies at uillinois.edu From subbiah_a at YAHOO.COM Tue Jan 13 23:25:29 2004 From: subbiah_a at YAHOO.COM (=?iso-8859-1?q?Subbiah=20Arunachalam?=) Date: Wed, 14 Jan 2004 04:25:29 +0000 Subject: Qualitative citation analysis? In-Reply-To: <004e01c3da10$7d5dd860$1202a8c0@loet> Message-ID: A long time ago the late Mike Moravcsik and Poovanalingam Murugesan, two physicists, have looked at this question. Blaise Cronin has covered this question in his cute little book "The Citation Process" published more than two decades ago. Surely there will be references to such papers in Gene Garfield's Essays of an Information Scientist. Arun --- Loet Leydesdorff wrote: > Dear Bernie, > > I worked extensively with Olga Amsterdamska on this > question for the > case of a set of papers in biochemistry. The textual > usages of citations > in citing texts were published in: "Citations: > Indicators of > Significance," Scientometrics 15(5-6) (1989) > 449-471. The results of a > questionnaire among the citing authors in: > Dimensions of > Citation Analysis, > Science, Technology and Human Values 15 (1990) > 305-335. > > With kind regards, > > Loet > > _____ > > Loet Leydesdorff > Amsterdam School of Communications Research (ASCoR) > Kloveniersburgwal 48, 1012 CX Amsterdam > Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 > loet at leydesdorff.net > ; > > http://www.leydesdorff.net/ > > > > The Challenge of > Scientometrics ; > The > Self-Organization of the Knowledge-Based Society > > > > > > -----Original Message----- > > From: ASIS&T Special Interest Group on Metrics > > [ > mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of > Sloan, Bernie > > Sent: Tuesday, January 13, 2004 5:53 PM > > To: SIGMETRICS at listserv.utk.edu > > Subject: [SIGMETRICS] Qualitative citation > analysis? > > > > > > Quick question... > > > > Is anyone aware of any studies of citation > analysis from a > > qualitative perspective? > > > > By "qualitative citation analysis" I mean looking > at how the > > author(s) of papers use the citations, rather than > simply > > counting citations. > > > > Thanks! > > > > Bernie Sloan > > Senior Library Information Systems Consultant, > ILCSO > > University of Illinois Office for Planning and > Budgeting > > 616 E. Green Street, Suite 213 > > Champaign, IL 61820 > > > > Phone: (217) 333-4895 > > Fax: (217) 265-0454 > > E-mail: bernies at uillinois.edu > > > > ________________________________________________________________________ Yahoo! Messenger - Communicate instantly..."Ping" your friends today! Download Messenger Now http://uk.messenger.yahoo.com/download/index.html From BKN at DB.DK Wed Jan 14 03:24:21 2004 From: BKN at DB.DK (Kirkegaard Nielsen, Brian) Date: Wed, 14 Jan 2004 09:24:21 +0100 Subject: SV: [SIGMETRICS] Qualitative citation analysis? Message-ID: Dear Bernie Apart from the references mentioned by others I believe it could be inspiring to look at two review articles: Small, H. G. (1982). Citation context analysis. Norwood, N. J. (red.) Progress in communication science, 3, 287-310 Liu, M. (1993). Progress in documentation the complexities of citation practice : a review of citation studies. Journal of Documentation, 49(4), 370-408. Let me know if this was helpful in answering your question. With kind regards Brian Kirkegaard Department of Information Studies Royal School of Library & Information Science, Aalborg Branch Sohngaardsholmsvej 2, DK-9000 Aalborg, DENMARK Tel. +45 98157922, Direct Tel. +45 98773049, Fax. +45 98151042 E-mail: bkn at db.dk -----Oprindelig meddelelse----- Fra: Sloan, Bernie [mailto:bernies at UILLINOIS.EDU] Sendt: 13. januar 2004 17:53 Til: SIGMETRICS at LISTSERV.UTK.EDU Emne: [SIGMETRICS] Qualitative citation analysis? Quick question... Is anyone aware of any studies of citation analysis from a qualitative perspective? By "qualitative citation analysis" I mean looking at how the author(s) of papers use the citations, rather than simply counting citations. Thanks! Bernie Sloan Senior Library Information Systems Consultant, ILCSO University of Illinois Office for Planning and Budgeting 616 E. Green Street, Suite 213 Champaign, IL 61820 Phone: (217) 333-4895 Fax: (217) 265-0454 E-mail: bernies at uillinois.edu From pmeindl at CHEM.UTORONTO.CA Wed Jan 14 09:20:27 2004 From: pmeindl at CHEM.UTORONTO.CA (Patricia Meindl) Date: Wed, 14 Jan 2004 09:20:27 -0500 Subject: SIGMETRICS Digest - 12 Jan 2004 to 13 Jan 2004 (#2004-6) Message-ID: Bernie, A number of people have developed taxonomies for classifying the types of citations: Brooks, T. A. (1986). Evidence of complex citer motivations. Journal of the American Society for Information Science, 37(1), 34-36. Chubin, D. E., & Moitra, S. D. (1975). Content-Analysis of References - Adjunct or Alternative to Citation Counting. Social Studies of Science, 5(4), 423-441. Cronin, B. (1980). Some reflections on citation habits in psychology. Journal of Information Science, 2(6), 309-311. Frost, C. (1979). The use of citations in literary research: A preliminary classification of citation functions. Library Quarterly, 49 , 399-414. Hooten, P. A. (1991). Frequency and functional use of cited documents in information- science. Journal of the American Society for Information Science, 42(6), 397-404. Moravcsik, M. J., & Murugesan, P. (1975). Some results on function and quality of citations. Social Studies of Science, 5(1), 86-92. Peritz, B. C. (1983). A classification of citation roles for the social-sciences and related fields. Scientometrics, 5(5), 303-312. As you can tell, I am interested in this topic too. Patricia Meindl Chemistry Library University of Toronto Automatic digest processor wrote: >There are 5 messages totalling 431 lines in this issue. > >Topics of the day: > > 1. Qualitative citation analysis? (5) > >---------------------------------------------------------------------- > >Date: Tue, 13 Jan 2004 10:53:21 -0600 >From: "Sloan, Bernie" >Subject: Qualitative citation analysis? > >Quick question... > >Is anyone aware of any studies of citation analysis from a qualitative >perspective? > >By "qualitative citation analysis" I mean looking at how the author(s) of >papers use the citations, rather than simply counting citations. > >Thanks! > >Bernie Sloan >Senior Library Information Systems Consultant, ILCSO >University of Illinois Office for Planning and Budgeting >616 E. Green Street, Suite 213 >Champaign, IL 61820 > >Phone: (217) 333-4895 >Fax: (217) 265-0454 >E-mail: bernies at uillinois.edu > > > > From harnad at ECS.SOTON.AC.UK Wed Jan 14 16:41:32 2004 From: harnad at ECS.SOTON.AC.UK (Stevan Harnad) Date: Wed, 14 Jan 2004 21:41:32 +0000 Subject: What is the threshold for open access Nirvana? In-Reply-To: <92F9C07FBAB86D4D9B76656180AC94770715FFDB@isi-mail.isinet.com> Message-ID: On Wed, 14 Jan 2004, Garfield, Eugene wrote: > You have avoided my main point by regurgitating to me what you have stated > before. However, I appreciate your prompt response. Don't you ever sleep? > When responding, please attach my original message Gene, sorry I passed over your main point! (I am usually accused of not letting anything pass! Maybe it *is* lack of sleep!) Here again is the whole of your original message (to which I replied at: http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3427.html ). To this first paragraph: > I have generally avoided discussion in this listserv but I think you have > introduced a significant distortion to the discussion by quoting the figure > of 24,000 scientific journals which allegedly produce 2,500,000 articles per > year. I presume someone has estimated the average of 100 articles per year. > A more realistic figure for journals would be ten to fifteen thousand > scientific journals putting aside the crucial question of definition. I replied that the 24K figure comes from ulrichs and that it is not for *scientific* journals, but for *peer-reviewed* journals, both scientific and scholarly. (But this was not your main point, apparently.) http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3427.html Your second paragraph, to which I did not reply first time, was: > If open access is to become viable it seems to me the key factor is the > group of 500 to 1000 highest impact journals which account for a substantial > portion of the significant articles which are published and most cited. > Unless these journals make it possible for authors to self-archive or to be > freely accessible you cannot achieve open access nirvana. One might argue > that once e.g. 50% or more of these most important journals are in the fold > the breakthrough threshold has been reached. Please look at the Romeo Journals Table: http://www.lboro.ac.uk/departments/ls/disresearch/romeo/Romeo%20Publisher%20Policies.htm It shows that 55% of the journals sampled (the Romeo sample was of the top 7000 of the 24,000) are already OA ("gold") journals (about 5%) and 50% are "green" (TA journals that support author self-archiving). http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0050.gif An undetermined portion of the remaining 45% will also agree to author self-archiving if asked. (I expect that the rising tide of OA consciousness in the research community today will raise the 55% figure considerably.) I leave it to you to tell us whether the top 500-1000 journals are among the 55% listed as green or gold. But as you see, we are already past 55% overall, which proves only one thing: That the problem is not the publishing community! For although at least 55% of journals are already gold or green/blue, far from 55% of articles are OA! http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0049.gif What that means is (1) far from all authors who have a suitable gold journal to publish in are publishing in gold journals, and (2) far from all authors who publish in a green journal are self-archiving their articles. (The shortfall is far more striking and ironic in the case of self-archiving, because its ceiling is so much higher.) So what does this say about your suggestion of a 50% "breakthrough threshold"? That the 50% breakthrough point may need to be the percentage of the research community actually grasping the OA that is within their reach, rather than just the percentage of the publishing community that puts it within their reach (in response to the ostensible demand, to publishers, by the research community, for the benefits of open access!) "Petitions, Boycotts, and Liberating the Refereed Literature Online" http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/0933.html http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2053.html http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3061.html http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3089.html This is why I have been beating the drums about the need for a systematic policy of open-access provision by institutions and research funders. This natural extension of the "publish or perish" rule is needed to induce the research community to reach for what is in their own best interest, and within its grasp: http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0028.gif To your third paragraph: > Since it has been demonstrated that on line access improves both readership > and citation impact we can certainly expect that the vast majority of the > low impact journals would be well advised to make their journals open > access. Whether this increases their impact remains to be seen, but > increased readership or attention seems inevitable. I replied with a list of references on the empirical evidence for the fact that increasing access increases impact -- both download (reading) impact and citation impact (the former coming before the latter, and strongly correlated with it, hence predictive of it). http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3427.html I should have added, though, that as far as I know, no one has reported any evidence suggesting that the impact-enhancing effects of open-access are limited to articles in low-impact journals! *All* articles and *all* authors stand to benefit from open access: There *might* be some ceiling effects there, but I doubt it. There are just too many would-be users at Have-Not institutions worldwide who would read, use and cite your article if only they could access it! http://www.eprints.org/self-faq/#29.Sitting http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3177.html Best wishes, Stevan Harnad NOTE: A complete archive of the ongoing discussion of providing open access to the peer-reviewed research literature online (1998-2004) is available at the American Scientist Open Access Forum: To join the Forum: http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html Post discussion to: american-scientist-open-access-forum at amsci.org Hypermail Archive: http://www.cogsci.soton.ac.uk/~harnad/Hypermail/Amsci/index.html Unified Dual Open-Access-Provision Policy: BOAI-2 ("gold"): Publish your article in a suitable open-access journal whenever one exists. http://www.earlham.edu/~peters/fos/boaifaq.htm#journals BOAI-1 ("green"): Otherwise, publish your article in a suitable toll-access journal and also self-archive it. http://www.eprints.org/self-faq/ http://www.soros.org/openaccess/read.shtml http://www.ecs.soton.ac.uk/~harnad/Temp/berlin.htm From loet at LEYDESDORFF.NET Thu Jan 15 11:33:20 2004 From: loet at LEYDESDORFF.NET (Loet Leydesdorff) Date: Thu, 15 Jan 2004 17:33:20 +0100 Subject: White HD "Author cocitation analysis and ... In-Reply-To: Message-ID: Dear Steven and colleagues, I now read the paper of William P. Jones and George W. Furnas, JASIST, 38(6), 1987, 420-442, and it is really enlightening because they explain the difference between the cosine and the Pearson so that I as a non-mathematician can clearly understand it. The Pearson correlation is just the cosine applied to the vectors after normalization to the mean. Thus, while the cosine can be written: cos(x,y) = Sigma (x * y) / {sqrt(Sigma x)^2 * sqrt(Sigma y)^2) The Pearson equivalently is precisely the same formula with the x replaced with {x - mean(x)} and y with {y - mean(y)} I had never seen this connection. (I apologize for the notation in ASCII.) It follows clearly that the effects of this normalization will be minimal if the distribution is normal, but the more the distribution deviates from normal, the less the mean is meaningful as a parameter, and the cosine then outperforms the Pearson. The authors say it as follows: "The use of moment normalization in these measures introduces additional potential drawbacks. (...) Moment normalization removes a degree of freedom from the expressive power of query and object vectors." Does this all lead to a recommendation of using the cosine matrix instead of the Pearson matrix as input to (for example) factor analysis in the case of non-normal distributions? Is there a statistician listening on the list who can answer this question? Or does the factor analysis require the parametric statistics as input? References? (SPSS allows for the input of an external matrix in the multivariate routines.) With kind regards, Loet _____ Loet Leydesdorff Amsterdam School of Communications Research (ASCoR) Kloveniersburgwal 48, 1012 CX Amsterdam Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 loet at leydesdorff.net ; http://www.leydesdorff.net/ The Challenge of Scientometrics ; The Self-Organization of the Knowledge-Based Society -------------- next part -------------- An HTML attachment was scrubbed... URL: From notsjb at LSU.EDU Fri Jan 16 12:35:53 2004 From: notsjb at LSU.EDU (Stephen J Bensman) Date: Fri, 16 Jan 2004 11:35:53 -0600 Subject: Pearson's r and ACA Message-ID: Loet and Steven, Don Kraft forwarded to me your discussion on the utilization of measures in ACA, and I have decided to add my two cents. Please excuse my interference in your discussion, but I have already become involved in this debate in various ways. A comment of mine on this matter will shortly be published JASIST, so I might as well make clearer some of the reasoning that underlies my position. I am neither an expert in author cocitation analysis (ACA) nor a mathematical expert. In general I may be considered a data person, who uses standard statistical techniques, and it is from this position that I approached the controvsery that erupted on the pages JASIST in the two articles cited below. In general I favor the Pearson approach that was developed by White over the measures suggest by Ahlgren, Jarvening, and Rousseau. This is for the following reasons. 1) Utilization of the Pearson has been the standard method in ACA, and it has worked quite well until now. There is now a body of research based upon it, to which one compare one's results to gain insights from the work of others. 2) The Pearson operates within an established system of hypothesis testing. However, for it to operate properly within this system, all the variables have to be normally distributed. This is usually accomplished by testing for the underlying distributions, and, if these are not normal, one performs a mathematical transformation to accomplish this objective--square root if the distributions are Poisson, and logarithmic, if the distributions are highly skewed. As in most social and biological research, where the processes are also multiplicative, information science data usually requires the logarithmic transformation to meet the linear and additive requirements of the Pearson. However, as White demonstrates, the Pearson is very distributionally robust. In general, White seems to consider the Pearson as only a sorting mechanism and does not consider tests of significance as important in ACA. 3) The Pearson measures both similaties and dissimilarities, showing similarities as positive correlations and dissimilarities as negative correlations. This is crucial in ACA, whose main purpose is to partition authors into different sets. One can easily see partitions in matrices, because similarities are positive, dissimilarities are negative, and no relationship is zero. Therefore, there is a full scale of measurement. The measures suggested by Ahlgren, Jarvening, and Rousseau measure only similarities on a scale from 0 to 1 with partition somewhat incongruously at 0.5, and it is difficult for me to see the logic of such measures in ACA. 4) The Pearson is sensitive to zeros, and correlations do change when persons not related to members of a given set of persons are added to this set. However, unlike Ahlgren, Jarvening, and Rousseau, who regard this as a major fault of the Pearson and axiomatically posit that the relations must remain invariant, I regard this as a major advantage of the Pearson, First, it is only natural for relations among persons to change when foreign persons are added to their mix. One can think of numerous social situations where this happens. Second, the very changes may be themselves informative and lead to further analyses and understanding of the relationships. Ahlgren, Jarvening, and Rousseau axiomatically block this. However, there may be situations, where such changes in correlations upon the addition of zeros may be detrimental to the proper understanding of the data. This Ahlgren, Jarvening, and Rousseau have adamantly refused to demonstrate, baldly stating to me that their job is only to dream up axioms and does not include testing them against any reality. In his demonstration White clearly showed that the Pearson leads to exactly the same results as the measures proposed by Ahlgren, Jarvening, and Rousseau, and Rousseau has admitted to me that their measures are not superior to the Pearson. However, White was only working with the same extreme case set up by Ahlgren, Jarvening, and Rousseau, and what really needs to be done is to take a set a data, apply both measures to it, and see how their different actions could lead to different interpretations of the data. If both measures lead to the same result, then there is no case at all for utilizing the measures proposed by Ahlgren, Jarvening, and Rousseau. I hope you find the above useful. If you have any comments or criticisms. please send them to me. I would be interested in hearing your opinions. Respectfully, Stephen J. Bensman LSU Libraries Louisiana State University Baton Rouge, LA USA ATTACHMENT BELOW: (See attached file: Rousseau-White.doc) Ahlgren, P., Jarneving, B., and Rousseau, R. (2003). Requirements for a cocitation similarity measurewith special reference to Pearson?s correlation coefficient. Journal of the American Society for Information Science and Technology, 54, 550-560. White, H.D. (2003.). Author cocitation analysis and Pearson?s r. Journal of the American Society for Information Science and Technology, 54, 1250-1259.. From samorri at OKSTATE.EDU Fri Jan 16 15:19:20 2004 From: samorri at OKSTATE.EDU (Steven A. Morris) Date: Fri, 16 Jan 2004 15:19:20 -0500 Subject: Pearson's r and ACA Message-ID: I don't think the attachment letter on Dr. Bensman's posting made it through the listerserver. A copy of his letter can be found at: http://samorris.ceat.okstate.edu/web/Rousseau-White.doc Stephen, A couple of quick comments: 1) How is hypothesis testing applied to ACA? Do you mean that after clustering authors, we can apply a hypothesis test to confirm that each author is indeed a member of a specific group? 2) "Utilization of the Pearson has been the standard method in ACA" is actually a good argument, but remember, paradigms were meant to be overthrown. Maybe ACA is going through some sort of "Kuhnian crisis" at the moment. More likely, this discussion is just a "tempest in a teapot." ;-) 3) I wonder if there is some dataset out there where the "true" clustering of authors is known well enough to allow direct comparison of clustering and mapping based on different similarity measures. I think this is would help answer the "so what?" question that was posed by Dr. Kraft. It would be nice to have a "chili cookoff" style contest, similar to some of the signal processing contests at certain IEEE conferences, to show off ACA author classification algorithms. Thanks, Steven Morris On Fri, 16 Jan 2004 11:35:53 -0600, Stephen J Bensman wrote: >Loet and Steven, > >Don Kraft forwarded to me your discussion on the utilization of measures in >ACA, and I have decided to add my two cents. Please excuse my interference >in your discussion, but I have already become involved in this debate in >various ways. A comment of mine on this matter will shortly be published >JASIST, so I might as well make clearer some of the reasoning that >underlies my position. I am neither an expert in author cocitation >analysis (ACA) >nor a mathematical expert. In general I may be considered a data person, >who >uses standard statistical techniques, and it is from this position that I >approached the controvsery that erupted on the pages JASIST in the two >articles cited below. > >In general I favor the Pearson approach that was developed by White over >the measures suggest by Ahlgren, Jarvening, and Rousseau. This is for the >following reasons. > >1) Utilization of the Pearson has been the standard method in ACA, and it >has worked quite well until now. There is now a body of research based >upon it, to which one compare one's results to gain insights from the work >of others. > >2) The Pearson operates within an established system of hypothesis >testing. However, for it to operate properly within this system, all the >variables have to be normally distributed. This is usually accomplished by >testing for the underlying distributions, and, if these are not normal, one >performs a mathematical transformation to accomplish this objective--square >root if the distributions are Poisson, and logarithmic, if the >distributions are highly skewed. As in most social and biological >research, where the processes are also multiplicative, information science >data usually requires the logarithmic transformation to meet the linear and >additive requirements of the Pearson. However, as White demonstrates, the >Pearson is very distributionally robust. In general, White seems to >consider the Pearson as only a sorting mechanism and does not consider >tests of significance as important in ACA. > >3) The Pearson measures both similaties and dissimilarities, showing >similarities as positive correlations and dissimilarities as negative >correlations. This is crucial in ACA, whose main purpose is to partition >authors into different sets. One can easily see partitions in matrices, >because similarities are positive, dissimilarities are negative, and no >relationship is zero. Therefore, there is a full scale of measurement. >The measures suggested by Ahlgren, Jarvening, and Rousseau measure only >similarities on a scale from 0 to 1 with partition somewhat incongruously >at 0.5, and it is difficult for me to see the logic of such measures in >ACA. > >4) The Pearson is sensitive to zeros, and correlations do change when >persons not related to members of a given set of persons are added to this >set. However, unlike Ahlgren, Jarvening, and Rousseau, who regard this as >a major fault of the Pearson and axiomatically posit that the relations >must remain invariant, I regard this as a major advantage of the Pearson, >First, it is only natural for relations among persons to change when >foreign persons are added to their mix. One can think of numerous social >situations where this happens. Second, the very changes may be themselves >informative and lead to further analyses and understanding of the >relationships. Ahlgren, Jarvening, and Rousseau axiomatically block this. > >However, there may be situations, where such changes in correlations upon >the addition of zeros may be detrimental to the proper understanding of the >data. This Ahlgren, Jarvening, and Rousseau have adamantly refused to >demonstrate, baldly stating to me that their job is only to dream up axioms >and does not include testing them against any reality. In his >demonstration White clearly showed that the Pearson leads to exactly the >same results as the measures proposed by Ahlgren, Jarvening, and Rousseau, >and Rousseau has admitted to me that their measures are not superior to the >Pearson. However, White was only working with the same extreme case set up >by Ahlgren, Jarvening, and Rousseau, and what really needs to be done is to >take a set a data, apply both measures to it, and see how their different >actions could lead to different interpretations of the data. If both >measures lead to the same result, then there is no case at all for >utilizing the measures proposed by Ahlgren, Jarvening, and Rousseau. > >I hope you find the above useful. If you have any comments or criticisms. >please send them to me. I would be interested in hearing your opinions. > >Respectfully, >Stephen J. Bensman >LSU Libraries >Louisiana State University >Baton Rouge, LA >USA > >ATTACHMENT BELOW: > > > >(See attached file: Rousseau-White.doc) > > > > > > > >Ahlgren, P., Jarneving, B., and Rousseau, R. (2003). Requirements for a >cocitation similarity measurewith special reference to Pearson???s >correlation coefficient. Journal of the American Society for Information >Science and Technology, 54, 550-560. > >White, H.D. (2003.). Author cocitation analysis and Pearson???s r. Journal >of the American Society for Information Science and Technology, 54, >1250-1259.. From Chaomei.Chen at CIS.DREXEL.EDU Fri Jan 16 15:33:35 2004 From: Chaomei.Chen at CIS.DREXEL.EDU (Chaomei Chen) Date: Fri, 16 Jan 2004 15:33:35 -0500 Subject: CiteSpace Message-ID: Dear All, CiteSpace is an experimental Java program that I have been developing for co-citation analysis, especially for visualizing co-citation networks. The underlying techniques implemented in CiteSpace are explained in a PNAS paper published this week: Chen, C. (2004) Searching for intellectual turning points: Progressive Knowledge Domain Visualization. Proceedings of the National Academy of Sciences of the United States of America (PNAS). http://www.pnas.org/cgi/reprint/0307513100v1.pdf Currently, it takes citation data in the ISI Export format and generates node-and-link drawings of co-citation networks. A typical way to use it is to slice a time interval into smaller segments and study how co-citation networks over individual time slices are patched together. The PNAS paper illustrates how to find intellectual turning points by visually searching for pivot points in such patched co-citation networks. While it is still under construction, if anyone here would like to try out some of its main functions at this stage and provide feedback/comments for further improvements, please get in touch with me via email. For more information, see also http://www.pages.drexel.edu/~cc345/citespace/ Best wishes, Chaomei Chen College of Information Science and Technology Drexel University From notsjb at LSU.EDU Fri Jan 16 16:36:29 2004 From: notsjb at LSU.EDU (Stephen J Bensman) Date: Fri, 16 Jan 2004 15:36:29 -0600 Subject: Pearson's r and ACA Message-ID: Steven, As I have pointed out, I am not an expert in ACA. Therefore, my answer may not cover all the possibilites of hypothesis testing in ACA. The main purpose of hypothesis testing in ACA, as near as I can judge, would be to see if the relationships were significant or not. It provides just another basis of judgment on the strength of the relationship. Moreover, another advantage of utilizing the Pearson is that SAS and SPSS automatically give you the test. Therefore, it entails no further effort on your part. However, as I pointed out, Howard White--who, after all, played a major role in developing this technique--does not seem to regard tests of significance as crucial in ACA. His opinion would trump mine, as he is the expert here. Second, if it were the purpose of the Ahlgren, Jarneving, and Rousseau to accomplish a Kuhnian overthrow of the established paradigm, then they did a poor job of it. To accomplish a Kuhnian overthrow, they would not only have had to prove the Pearson incorrect but to have substituted a better measure for it. They refused to demonstrate how their measures are better, leaving your "tempest in a teapot" hypothesis still standing. Finally, Don Kraft did not pose the "so what" question. I did. He was only posting it for me, for which he was roundly berated by a reader of this LISTSERV for posting Bensman's "diatribe." Respectfully, Steve B. "Steven A. Morris" @LISTSERV.UTK.EDU> on 01/16/2004 02:19:20 PM Please respond to ASIS&T Special Interest Group on Metrics Sent by: ASIS&T Special Interest Group on Metrics To: SIGMETRICS at LISTSERV.UTK.EDU cc: (bcc: Stephen J Bensman/notsjb/LSU) Subject: Re: [SIGMETRICS] Pearson's r and ACA I don't think the attachment letter on Dr. Bensman's posting made it through the listerserver. A copy of his letter can be found at: http://samorris.ceat.okstate.edu/web/Rousseau-White.doc Stephen, A couple of quick comments: 1) How is hypothesis testing applied to ACA? Do you mean that after clustering authors, we can apply a hypothesis test to confirm that each author is indeed a member of a specific group? 2) "Utilization of the Pearson has been the standard method in ACA" is actually a good argument, but remember, paradigms were meant to be overthrown. Maybe ACA is going through some sort of "Kuhnian crisis" at the moment. More likely, this discussion is just a "tempest in a teapot." ;-) 3) I wonder if there is some dataset out there where the "true" clustering of authors is known well enough to allow direct comparison of clustering and mapping based on different similarity measures. I think this is would help answer the "so what?" question that was posed by Dr. Kraft. It would be nice to have a "chili cookoff" style contest, similar to some of the signal processing contests at certain IEEE conferences, to show off ACA author classification algorithms. Thanks, Steven Morris On Fri, 16 Jan 2004 11:35:53 -0600, Stephen J Bensman wrote: >Loet and Steven, > >Don Kraft forwarded to me your discussion on the utilization of measures in >ACA, and I have decided to add my two cents. Please excuse my interference >in your discussion, but I have already become involved in this debate in >various ways. A comment of mine on this matter will shortly be published >JASIST, so I might as well make clearer some of the reasoning that >underlies my position. I am neither an expert in author cocitation >analysis (ACA) >nor a mathematical expert. In general I may be considered a data person, >who >uses standard statistical techniques, and it is from this position that I >approached the controvsery that erupted on the pages JASIST in the two >articles cited below. > >In general I favor the Pearson approach that was developed by White over >the measures suggest by Ahlgren, Jarvening, and Rousseau. This is for the >following reasons. > >1) Utilization of the Pearson has been the standard method in ACA, and it >has worked quite well until now. There is now a body of research based >upon it, to which one compare one's results to gain insights from the work >of others. > >2) The Pearson operates within an established system of hypothesis >testing. However, for it to operate properly within this system, all the >variables have to be normally distributed. This is usually accomplished by >testing for the underlying distributions, and, if these are not normal, one >performs a mathematical transformation to accomplish this objective--square >root if the distributions are Poisson, and logarithmic, if the >distributions are highly skewed. As in most social and biological >research, where the processes are also multiplicative, information science >data usually requires the logarithmic transformation to meet the linear and >additive requirements of the Pearson. However, as White demonstrates, the >Pearson is very distributionally robust. In general, White seems to >consider the Pearson as only a sorting mechanism and does not consider >tests of significance as important in ACA. > >3) The Pearson measures both similaties and dissimilarities, showing >similarities as positive correlations and dissimilarities as negative >correlations. This is crucial in ACA, whose main purpose is to partition >authors into different sets. One can easily see partitions in matrices, >because similarities are positive, dissimilarities are negative, and no >relationship is zero. Therefore, there is a full scale of measurement. >The measures suggested by Ahlgren, Jarvening, and Rousseau measure only >similarities on a scale from 0 to 1 with partition somewhat incongruously >at 0.5, and it is difficult for me to see the logic of such measures in >ACA. > >4) The Pearson is sensitive to zeros, and correlations do change when >persons not related to members of a given set of persons are added to this >set. However, unlike Ahlgren, Jarvening, and Rousseau, who regard this as >a major fault of the Pearson and axiomatically posit that the relations >must remain invariant, I regard this as a major advantage of the Pearson, >First, it is only natural for relations among persons to change when >foreign persons are added to their mix. One can think of numerous social >situations where this happens. Second, the very changes may be themselves >informative and lead to further analyses and understanding of the >relationships. Ahlgren, Jarvening, and Rousseau axiomatically block this. > >However, there may be situations, where such changes in correlations upon >the addition of zeros may be detrimental to the proper understanding of the >data. This Ahlgren, Jarvening, and Rousseau have adamantly refused to >demonstrate, baldly stating to me that their job is only to dream up axioms >and does not include testing them against any reality. In his >demonstration White clearly showed that the Pearson leads to exactly the >same results as the measures proposed by Ahlgren, Jarvening, and Rousseau, >and Rousseau has admitted to me that their measures are not superior to the >Pearson. However, White was only working with the same extreme case set up >by Ahlgren, Jarvening, and Rousseau, and what really needs to be done is to >take a set a data, apply both measures to it, and see how their different >actions could lead to different interpretations of the data. If both >measures lead to the same result, then there is no case at all for >utilizing the measures proposed by Ahlgren, Jarvening, and Rousseau. > >I hope you find the above useful. If you have any comments or criticisms. >please send them to me. I would be interested in hearing your opinions. > >Respectfully, >Stephen J. Bensman >LSU Libraries >Louisiana State University >Baton Rouge, LA >USA > >ATTACHMENT BELOW: > > > >(See attached file: Rousseau-White.doc) > > > > > > > >Ahlgren, P., Jarneving, B., and Rousseau, R. (2003). Requirements for a >cocitation similarity measurewith special reference to Pearson???s >correlation coefficient. Journal of the American Society for Information >Science and Technology, 54, 550-560. > >White, H.D. (2003.). Author cocitation analysis and Pearson???s r. Journal >of the American Society for Information Science and Technology, 54, >1250-1259.. From loet at LEYDESDORFF.NET Fri Jan 16 23:21:52 2004 From: loet at LEYDESDORFF.NET (Loet Leydesdorff) Date: Sat, 17 Jan 2004 05:21:52 +0100 Subject: Pearson's r and ACA In-Reply-To: Message-ID: Dear Stephen and colleagues, I have a higher appreciation of Ahlgren et al.'s contribution because the problem of the non-normality of the distributions is a serious one, particularly if one extends the multivariate perspective with the time-series one. Problems of auto-correlation and auto-covariation make the design almost imtractible. Non-parametric statistics can provide a much more transparant solution than making transformations in order to rescue the assumptions of normality. These considerations brought me to an interest in information theory. In "The Static and Dynamic Analysis of Network Data Using Information Theory," Social Networks 13 (1991) 301-345 I provided a set of algorithms that enabled me to elaborate in concrete studies collected later in "The Challenge of Scientometrics" (Leiden: DSWO Press/Leiden University, 1995). Thus, my critique of Ahlgren et al. (2003) would be that they do not set the next step and more definitively move into non-parametric statistics. Given the increasing interest in the last decade or so in entropical systems, entropy statistics seems an obvious candidate. The explanatory power of these statistics is considerable. For example, using entropy statistics one can provide an exact solution for the divisive clustering problem. I provide the proof at pp. 166 ff. of the second (2001) edition of "The Challenge of Scientometrics" and apply it there to a small set of chemistry journals. Furthermore one can allow for asymmetries in distances while similarity criteria in the parametric tradition are always symmetrical. This is just an example. All measurement can be expressed in bits of information and therefore be compared. With kind regards, Loet > -----Original Message----- > From: ASIS&T Special Interest Group on Metrics > [mailto:SIGMETRICS at listserv.utk.edu] On Behalf Of Stephen J Bensman > Sent: Friday, January 16, 2004 6:36 PM > To: SIGMETRICS at listserv.utk.edu > Subject: [SIGMETRICS] Pearson's r and ACA > > > Loet and Steven, > > Don Kraft forwarded to me your discussion on the utilization > of measures in ACA, and I have decided to add my two cents. > Please excuse my interference in your discussion, but I have > already become involved in this debate in various ways. A > comment of mine on this matter will shortly be published > JASIST, so I might as well make clearer some of the reasoning > that underlies my position. I am neither an expert in author > cocitation analysis (ACA) nor a mathematical expert. In > general I may be considered a data person, who uses standard > statistical techniques, and it is from this position that I > approached the controvsery that erupted on the pages JASIST > in the two articles cited below. > > In general I favor the Pearson approach that was developed by > White over > the measures suggest by Ahlgren, Jarvening, and Rousseau. > This is for the > following reasons. > > 1) Utilization of the Pearson has been the standard method > in ACA, and it has worked quite well until now. There is now > a body of research based upon it, to which one compare one's > results to gain insights from the work of others. > > 2) The Pearson operates within an established system of > hypothesis testing. However, for it to operate properly > within this system, all the variables have to be normally > distributed. This is usually accomplished by testing for the > underlying distributions, and, if these are not normal, one > performs a mathematical transformation to accomplish this > objective--square root if the distributions are Poisson, and > logarithmic, if the distributions are highly skewed. As in > most social and biological research, where the processes are > also multiplicative, information science data usually > requires the logarithmic transformation to meet the linear > and additive requirements of the Pearson. However, as White > demonstrates, the Pearson is very distributionally robust. > In general, White seems to consider the Pearson as only a > sorting mechanism and does not consider tests of significance > as important in ACA. > > 3) The Pearson measures both similaties and dissimilarities, > showing similarities as positive correlations and > dissimilarities as negative correlations. This is crucial in > ACA, whose main purpose is to partition authors into > different sets. One can easily see partitions in matrices, > because similarities are positive, dissimilarities are > negative, and no relationship is zero. Therefore, there is a > full scale of measurement. The measures suggested by Ahlgren, > Jarvening, and Rousseau measure only similarities on a scale > from 0 to 1 with partition somewhat incongruously at 0.5, and > it is difficult for me to see the logic of such measures in ACA. > > 4) The Pearson is sensitive to zeros, and correlations do > change when persons not related to members of a given set of > persons are added to this set. However, unlike Ahlgren, > Jarvening, and Rousseau, who regard this as a major fault of > the Pearson and axiomatically posit that the relations must > remain invariant, I regard this as a major advantage of the > Pearson, First, it is only natural for relations among > persons to change when foreign persons are added to their > mix. One can think of numerous social situations where this > happens. Second, the very changes may be themselves > informative and lead to further analyses and understanding of > the relationships. Ahlgren, Jarvening, and Rousseau > axiomatically block this. > > However, there may be situations, where such changes in > correlations upon the addition of zeros may be detrimental to > the proper understanding of the data. This Ahlgren, > Jarvening, and Rousseau have adamantly refused to > demonstrate, baldly stating to me that their job is only to > dream up axioms and does not include testing them against any > reality. In his demonstration White clearly showed that the > Pearson leads to exactly the same results as the measures > proposed by Ahlgren, Jarvening, and Rousseau, and Rousseau > has admitted to me that their measures are not superior to > the Pearson. However, White was only working with the same > extreme case set up by Ahlgren, Jarvening, and Rousseau, and > what really needs to be done is to take a set a data, apply > both measures to it, and see how their different actions > could lead to different interpretations of the data. If both > measures lead to the same result, then there is no case at > all for utilizing the measures proposed by Ahlgren, > Jarvening, and Rousseau. > > I hope you find the above useful. If you have any comments > or criticisms. please send them to me. I would be interested > in hearing your opinions. > > Respectfully, > Stephen J. Bensman > LSU Libraries > Louisiana State University > Baton Rouge, LA > USA > > ATTACHMENT BELOW: > > > > (See attached file: Rousseau-White.doc) > > > > > > > > Ahlgren, P., Jarneving, B., and Rousseau, R. (2003). > Requirements for a cocitation similarity measurewith special > reference to Pearson's correlation coefficient. Journal of > the American Society for Information Science and Technology, > 54, 550-560. > > White, H.D. (2003.). Author cocitation analysis and > Pearson's r. Journal of the American Society for Information > Science and Technology, 54, 1250-1259.. > From notsjb at LSU.EDU Tue Jan 20 10:40:15 2004 From: notsjb at LSU.EDU (Stephen J Bensman) Date: Tue, 20 Jan 2004 09:40:15 -0600 Subject: Pearson's r and ACA Message-ID: Dear Loet et al. In respect to your suggestion of using nonparametric statistics to handle non-normal distributions, I will answer you only in general terms. I was trained in statistics by an ecologist, who introduced me to biometric statistics. Through him I became intrigued how biological, social, and information phenomena act precisely in the same way and that biostatistics are therefore the statistics applicable to information science. I became absolutely fascinated by the unity of society and nature in this respect. My first inclination was to use nonparametric statistics to counter the nonnormal distributions, but he just took my model and contemptously threw it in the waste basket. He insisted that you must use the more powerful parametric statistics whenever possible, using the logarithmic transformation. To emphasize his point, he took off a shelf above his desk a little log from his brother's woodlot in Maine, which had a little "n" painted on its end. He slammed it on his desk and stated, "This is my log natural." The use of mathematical transformations to normalize distributions raises some interesting philosophical questions. From the perspective of the normal law of error, biological, social, and information reality makes a person feel that he is caught in a fun house full of distorting mirrors. In order to see and measure error, you have to put on mathematical eye glasses, which transform the reality to that of the perspective of the normal distribution. This makes you wonder--what is actual reality--that of the raw data, or that of the data logaritmically transformed to the requirements of the normal distribution? B. C. Brookes in the article below dealt with this philosophical question , and, basing himself on the psychometric work of Gustav Fechner, Brookes argued that the logarithmic perspective was the proper one for information science. Interestingly enough John Maynard Keynes in his treatise on probability thought that the lognormal distribution centered on the geometric mean was the proper law of error for society. However, lately I have been switching over to nonparametric techniques for reasons stemming out of what seems to be your main research interest--classifying phenomena into sets or groups with mathematical and statistical techniques such as clustering, factor analysis, etc. Precise mathematical techniques including many statistical ones are really not applicable to information science due to Bradford's Law of Scattering, which causes all information science sets to be fuzzy. Therefore, your sets are always plagued by foreign contaminants that distort estimates of parameters and result in tremendous outliers. To counter this, I have been switching to nonparametric techniques like the chi-squared test for homogeneity instead of correlation because of the ability to work within broad categories instead in terms of precise fits. In other words, one has to use cruder methods to counter the fuzzy outliers unless one can more precisely define set membership. To tell you the honest truth, defining precise sets with mathematical techniques like cluster analysis is probably beyond my mental capacities and will have to be done by the likes of you. All I can say is that you should use whatever works as long as you can explain to laymen like me what does work and why it does work. This would be tremendously helpful. In respect to entropy I did take a fling at this at the end of the article below. I did it on the basis of the theories of the famous French statistician Emile Borel, who postulated total homogeneity and randomness as a function of entropy. According to Borel, tremendously skewed distributions resulting in vast inhomogeneities--like those found in information science--require vast energy inputs, and, as energy inputs decline, the entire system collapses with a declining mean and variance around this mean until the system can be modeled by the Poisson distribution. A very interesting way to model obselescence for purposes of weeding library collections. However, in general, I prefer biological models to physical ones such as Borel's borrowing from thermodynamics. Anyhow, I hope the above did not bore you and that you find the observations useful. Respectfully, Stephen J. Bensman Brookes, Bertram C. 1980a. The foundations of information science, part I: Philosophical aspects. Journal of information science 2: 125?33. Bensman, Stephen J. 2000. Probability Distributions in Library and Information Science: A Historical and Practitioner Viewpoint. Journal of the American Society for Information Science 51: 816-833 . From Garfield at CODEX.CIS.UPENN.EDU Tue Jan 20 14:49:32 2004 From: Garfield at CODEX.CIS.UPENN.EDU (Garfield, Eugene) Date: Tue, 20 Jan 2004 14:49:32 -0500 Subject: Delayed Recognition discussed by Glanzel et al Message-ID: Wolfgang.Glanzel at econ.kuleuven.ac.be and colleagues Schlemmer and Thijs have published an excellent article on the topic of Delayed Recognition. This is a subject not often discussed in the literature. The abstract and cited references follow. Since the papers by me that were published in Current Contents are not any longer easily available in print I include the URLs below and urge readers of our listserv to add any others that are relevant. Better late than never? On the chance to become highly cited only beyond the standard bibliometric time horizon Glanzel W, Schlemmer B, Thijs B SCIENTOMETRICS 58 (3): 571-586 2003 Document type: Article Language: English Cited References: 12 Times Cited: 0 Find Related Records Explanation Abstract: According to GARFIELD (1980), most scientists can name an example of an important discovery that had little initial impact on contemporary research. And he uses by Mendel's work as a classical example. Delayed recognition is sometimes used by scientists as an argument against citation-based indicators based on citation windows defined for a short- or medium-term initial period beginning with the paper's publication year. This study is focused on a large-scale analysis of the citation history of all papers indexed in the 1980 annual volume of the Science Citation Index. The objective is two-fold, particularly, to analyze whether the share of delayed recognition papers is significant and whether such papers are typical of the work of their authors at that time. In a first step, the background of advanced bibliometric models by Glanzel, Egghe, Rousseau and Burrell of stochastic citation processes and first-citation distributions is described briefly. The second part is devoted to the bibliometric analysis of first-citation statistics and of the phenomenon of citation delay. In a third step, finally, delayed reception publications have been studied individually. Their topics and the citation patterns of other papers by the same authors have been studied to uncover principles of regularity or exceptionality of delayed reception publications. KeyWords Plus: SCIENTIFIC LITERATURE, CITATION PROCESSES, STOCHASTIC-MODEL Addresses: Glanzel W, Katholieke Univ Leuven, Steunpunt O&O Stat, Dekenstr 2, B-3000 Louvain, Belgium Katholieke Univ Leuven, Steunpunt O&O Stat, B-3000 Louvain, Belgium Hungarian Acad Sci, Inst Res Org, Budapest, Hungary cited references from WOS _____ BURRELL QL SCIENTOMETRICS 52 3 2001 EGGHE L INTRO INFORMETRICS Q 1990 GARFIELD E CURR CONTENTS 19 5 1980 GLANZEL W INFORM PROCESS MANAG 31 69 1995 GLANZEL W INFORM PROCESS MANAG 28 53 1992 GLANZEL W J INFORM SCI 21 37 1995 GLANZEL W SCIENTOMETRICS 55 335 2002 GLANZEL W SCIENTOMETRICS 30 49 1994 GLANZEL W SCIENTOMETRICS 25 373 1992 ROUSSEAU R INFORMETRICS 87 88 249 1988 ROUSSEAU R SCIENTOMETRICS 30 213 1994 SCHUBERT A CZECH J PHYS 36 121 1986 Essays by E. Garfield from Current Contents available in full text at: www.eugnegarfield.org Premature Discovery or Delayed Recognition -- Why? http://www.garfield.library.upenn.edu/essays/v4p488y1979-80.pdf Delayed Recognition in Scientific Discovery: Citation Frequency Analysis Aids the Search for Case Histories http://www.garfield.library.upenn.edu/essays/v12p154y1989.pdf More Delayed Recognition. Part 2. From Inhibin to Scanning Electron Microscopy http://www.garfield.library.upenn.edu/essays/v13p068y1990.pdf More Delayed Recognition. Part 1. Examples from the Genetics of Color Blindness, the Entropy of Short-Term Memory, Phosphoinositides, and Polymer Rheology http://www.garfield.library.upenn.edu/essays/v12p264y1989.pdf More Delayed Recognition. Part 1. Examples from the Genetics of Color Blindness, the Entropy of Short-Term Memory, Phosphoinositides, and Polymer Rheology http://www.garfield.library.upenn.edu/essays/v12p267y1989.pdf Lyme Disease Research Uncovers a Case of Delayed Recognition: Arvid Afzelius and His Successors http://www.garfield.library.upenn.edu/essays/v12p345y1989.pdf Postmature Scientific Discovery and the Sexual Recombination of Bacteria - The Shared Perspectives of a Scientist and a Sociologist http://www.garfield.library.upenn.edu/essays/v12p016y1989.pdf Mapping Cholera Research and the Impact of Shambu Nath De of Calcutta http://www.garfield.library.upenn.edu/essays/v9p103y1986.pdf When responding, please attach my original message __________________________________________________ Eugene Garfield, PhD. email: garfield at codex.cis.upenn.edu home page: www.eugenegarfield.org Tel: 215-243-2205 Fax 215-387-1266 President, The Scientist LLC. www.the-scientist.com 3535 Market St., Phila. PA 19104-3389 Chairman Emeritus, ISI www.isinet.com 3501 Market Street, Philadelphia, PA 19104-3302 Past President, American Society for Information Science and Technology (ASIS&T) www.asis.org ________________________________________________________________________ This email has been scanned for all viruses by the MessageLabs Email Security System. For more information on a proactive email security service working around the clock, around the globe, visit http://www.messagelabs.com ________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Garfield at CODEX.CIS.UPENN.EDU Tue Jan 20 16:16:54 2004 From: Garfield at CODEX.CIS.UPENN.EDU (Garfield, Eugene) Date: Tue, 20 Jan 2004 16:16:54 -0500 Subject: Paper by Mike Thelwall in JASIST Jan. 15, 2004 about the online i mpact of highly rated scholars Message-ID: e-mail: m.thelwall at wlv.ac.uk or g.harries at wlv.ac.uk _____ Do the Web sites of higher rated scholars have significantly more online impact? Thelwall M, Harries G JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY 55 (2): 149-159 JAN 15 2004 Document type: Review Language: English Cited References: 101 Times Cited: 0 Find Related Records Explanation Abstract: The quality and impact of academic Web sites is of interest to many audiences, including the scholars who use them and Web educators who need to identify best practice. Several large-scale European Union research projects have been funded to build new indicators for online scientific activity, reflecting recognition of the importance of the Web for scholarly communication. In this paper we address the key question of whether higher rated scholars produce higher impact Web sites, using the United Kingdom as a case study and measuring scholars' quality in terms of university-wide average research ratings. Methodological issues concerning the measurement of the online impact are discussed, leading to the adoption of counts of links to a university's constituent single domain Web sites from an aggregated counting metric. The findings suggest that universities with higher rated scholars produce significantly more Web content but with a similar average online impact. Higher rated scholars therefore attract more total links from their peers, but only by being more prolific, refuting earlier suggestions. It can be surmised that general Web publications are very different from scholarly journal articles and conference papers, for which scholarly quality does associate with citation impact. This has important implications for the construction of new Web indicators, for example that online impact should not be used to assess the quality of small groups of scholars, even within a single discipline. KeyWords Plus: WORLD-WIDE-WEB, CITATION ANALYSIS, BIBLIOMETRIC METHODS, SEARCH ENGINE, SCIENCE, UNIVERSITY, LINKS, INFORMATION, DEPARTMENTS, COMMUNICATION Addresses: Thelwall M, Wolverhampton Univ, Sch Comp & Informat Technol, Wulfruna St, Wolverhampton WV1 1SB, England Wolverhampton Univ, Sch Comp & Informat Technol, Wolverhampton WV1 1SB, England Publisher: When responding, please attach my original message __________________________________________________ Eugene Garfield, PhD. email: garfield at codex.cis.upenn.edu home page: www.eugenegarfield.org Tel: 215-243-2205 Fax 215-387-1266 President, The Scientist LLC. www.the-scientist.com 3535 Market St., Phila. PA 19104-3389 Chairman Emeritus, ISI www.isinet.com 3501 Market Street, Philadelphia, PA 19104-3302 Past President, American Society for Information Science and Technology (ASIS&T) www.asis.org ________________________________________________________________________ This email has been scanned for all viruses by the MessageLabs Email Security System. For more information on a proactive email security service working around the clock, around the globe, visit http://www.messagelabs.com ________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From KPBACS at ENGR.PSU.EDU Tue Jan 20 16:29:27 2004 From: KPBACS at ENGR.PSU.EDU (Karen P. Brooks) Date: Tue, 20 Jan 2004 16:29:27 -0500 Subject: SIGNOFF SIGMETRICS Message-ID: Please take me off of the SIGMETRICS LISTSERV. I received notification (twice) that I had been removed from the listserv. Today I received three more listserv e-mails, one of which is below. I may be on as KPBACS at PSU.EDU or kpb2 at psu.edu. Please confirm Thank you, Karen Brooks -----Original Message----- From: Garfield, Eugene [mailto:Garfield at CODEX.CIS.UPENN.EDU] Sent: Tuesday, January 20, 2004 4:17 PM To: SIGMETRICS at LISTSERV.UTK.EDU Subject: [SIGMETRICS] Paper by Mike Thelwall in JASIST Jan. 15, 2004 about the online i mpact of highly rated scholars e-mail: m.thelwall at wlv.ac.uk or g.harries at wlv.ac.uk _____ Do the Web sites of higher rated scholars have significantly more online impact? Thelwall M, Harries G JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY 55 (2): 149-159 JAN 15 2004 Document type: Review Language: English Cited References: 101 Times Cited: 0 Find Related Records Explanation Abstract: The quality and impact of academic Web sites is of interest to many audiences, including the scholars who use them and Web educators who need to identify best practice. Several large-scale European Union research projects have been funded to build new indicators for online scientific activity, reflecting recognition of the importance of the Web for scholarly communication. In this paper we address the key question of whether higher rated scholars produce higher impact Web sites, using the United Kingdom as a case study and measuring scholars' quality in terms of university-wide average research ratings. Methodological issues concerning the measurement of the online impact are discussed, leading to the adoption of counts of links to a university's constituent single domain Web sites from an aggregated counting metric. The findings suggest that universities with higher rated scholars produce significantly more Web content but with a similar average online impact. Higher rated scholars therefore attract more total links from their peers, but only by being more prolific, refuting earlier suggestions. It can be surmised that general Web publications are very different from scholarly journal articles and conference papers, for which scholarly quality does associate with citation impact. This has important implications for the construction of new Web indicators, for example that online impact should not be used to assess the quality of small groups of scholars, even within a single discipline. KeyWords Plus: WORLD-WIDE-WEB, CITATION ANALYSIS, BIBLIOMETRIC METHODS, SEARCH ENGINE, SCIENCE, UNIVERSITY, LINKS, INFORMATION, DEPARTMENTS, COMMUNICATION Addresses: Thelwall M, Wolverhampton Univ, Sch Comp & Informat Technol, Wulfruna St, Wolverhampton WV1 1SB, England Wolverhampton Univ, Sch Comp & Informat Technol, Wolverhampton WV1 1SB, England Publisher: When responding, please attach my original message __________________________________________________ Eugene Garfield, PhD. email: garfield at codex.cis.upenn.edu home page: www.eugenegarfield.org Tel: 215-243-2205 Fax 215-387-1266 President, The Scientist LLC. www.the-scientist.com 3535 Market St., Phila. PA 19104-3389 Chairman Emeritus, ISI www.isinet.com 3501 Market Street, Philadelphia, PA 19104-3302 Past President, American Society for Information Science and Technology (ASIS&T) www.asis.org ________________________________________________________________________ This email has been scanned for all viruses by the MessageLabs Email Security System. For more information on a proactive email security service working around the clock, around the globe, visit http://www.messagelabs.com ________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Garfield at CODEX.CIS.UPENN.EDU Tue Jan 20 17:27:03 2004 From: Garfield at CODEX.CIS.UPENN.EDU (Garfield, Eugene) Date: Tue, 20 Jan 2004 17:27:03 -0500 Subject: FW: Biomedcentral paper on citation of reviews that may interest you. Message-ID: This article is available free of charge at BiomedCentral. http://www.biomedcentral.com/1741-7015/1/2/ Research article Systematic reviews: a cross-sectional study of location and citation counts Victor M Montori, Nancy L Wilczynski, Douglas Morgan, R Brian Haynes, the Hedges Team BMC Medicine 2003, 1:2 (24 November 2003) [Abstract] [Full text] http://www.biomedcentral.com/1741-7015/1/2/ 2278 accesses in the last 30 days Abstract Background Systematic reviews summarize all pertinent evidence on a defined health question. They help clinical scientists to direct their research and clinicians to keep updated. Our objective was to determine the extent to which systematic reviews are clustered in a large collection of clinical journals and whether review type (narrative or systematic) affects citation counts. Methods We used hand searches of 170 clinical journals in the fields of general internal medicine, primary medical care, nursing, and mental health to identify review articles (year 2000). We defined 'review' as any full text article that was bannered as a review, overview, or meta-analysis in the title or in a section heading, or that indicated in the text that the intention of the authors was to review or summarize the literature on a particular topic. We obtained citation counts for review articles in the five journals that published the most systematic reviews. Results 11% of the journals concentrated 80% of all systematic reviews. Impact factors were weakly correlated with the publication of systematic reviews (R2 = 0.075, P = 0.0035). There were more citations for systematic reviews (median 26.5, IQR 12 ? 56.5) than for narrative reviews (8, 20, P <.0001 for the difference). Systematic reviews had twice as many citations as narrative reviews published in the same journal (95% confidence interval 1.5 ? 2.7). Conclusions A few clinical journals published most systematic reviews. Authors cited systematic reviews more often than narrative reviews, an indirect endorsement of the 'hierarchy of evidence'. Outline ________________________________________________________________________ This email has been scanned for all viruses by the MessageLabs Email Security System. For more information on a proactive email security service working around the clock, around the globe, visit http://www.messagelabs.com ________________________________________________________________________ From loet at LEYDESDORFF.NET Wed Jan 21 03:11:23 2004 From: loet at LEYDESDORFF.NET (Loet Leydesdorff) Date: Wed, 21 Jan 2004 09:11:23 +0100 Subject: Pearson's r and ACA In-Reply-To: Message-ID: Dear Stephen, You mail clarifies the incomprehensible use of the statistics in Ahlgren et al. (2003) because they indicate r = 0.89 between the variables "Braun" and "Schubert" while upon computation one only finds r = 0.456. However, they state in very cryptic wordings (on p. 555) that they performed a logarithmic transformation and then, indeed, one finds r = 0.89. I understand that this has been done for reasons of significance testing given the assumption of a bivariate normal distribution. (Peter van den Besselaar and I had an exchange in a recent issue of JASIST on significance testing in the case of descriptive statistics versus inferential statistics.) However, these authors do not wish to test for significance. It is confusing. Even more confusing is on p. 556 that the correlation (after logarithmic transformation) would go to r = 0.94 in this case by adding only zeros. The zeros should have no effect after the transformation, shouldn't they? But this is crucial to the argument of the paper ??? Anyhow, my point was about using information theory. This implies a logarithmic transformation as you wish to emphasize. More importantly, it allows for a unique and exact solution to the problem of the dividedness. I have proven that in "The Challenge of Scientometrics" (Chapter 9, pp. 166 ff. of the 2001-edition). I'll apply the algorithm to the matrix under discussion and submit a brief contribution to JASIST on the subject. The decomposition using information theory is not disturbed by outliers because the measure in non-parametric. Perhaps, you can do me the favour to explain the difference in the value of the correlations between Table 8 and Table 9 in Ahlgren et al. (2003). Let us focus on "Braun" and "Schubert" as variables. How did they arrive at r = 0.94? With kind regards, Loet _____ Loet Leydesdorff Amsterdam School of Communications Research (ASCoR) Kloveniersburgwal 48, 1012 CX Amsterdam Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 loet at leydesdorff.net ; http://www.leydesdorff.net/ The Challenge of Scientometrics ; The Self-Organization of the Knowledge-Based Society > -----Original Message----- > From: ASIS&T Special Interest Group on Metrics > [ mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Stephen J Bensman > Sent: Tuesday, January 20, 2004 4:40 PM > To: SIGMETRICS at LISTSERV.UTK.EDU > Subject: [SIGMETRICS] Pearson's r and ACA > > > Dear Loet et al. > > In respect to your suggestion of using nonparametric > statistics to handle > non-normal distributions, I will answer you only in general > terms. I was > trained in statistics by an ecologist, who introduced me to > biometric statistics. Through him I became intrigued how > biological, social, and information phenomena act precisely > in the same way and that biostatistics are therefore the > statistics applicable to information science. I became > absolutely fascinated by the unity of society and nature in > this respect. My first inclination was to use nonparametric > statistics to counter the nonnormal distributions, but he > just took my model and contemptously threw it in the waste > basket. He insisted that you must use the more powerful > parametric statistics whenever possible, using the > logarithmic transformation. To emphasize his point, he took > off a shelf above his desk a little log from his brother's > woodlot in Maine, which had a little "n" painted on its end. > He slammed it on his desk and stated, "This is my log natural." > > The use of mathematical transformations to normalize > distributions raises some interesting philosophical > questions. From the perspective of the normal law of error, > biological, social, and information reality makes a person > feel that he is caught in a fun house full of distorting > mirrors. In order to see and measure error, you have to put > on mathematical eye glasses, which transform the reality to > that of the perspective of the normal distribution. This > makes you wonder--what is actual reality--that of the raw > data, or that of the data logaritmically transformed to the > requirements of the normal distribution? B. C. Brookes in > the article below dealt with this philosophical question , > and, basing himself on the psychometric work of Gustav > Fechner, Brookes argued that the logarithmic perspective was > the proper one for information science. Interestingly enough > John Maynard Keynes in his treatise on probability thought > that the lognormal distribution centered on the geometric > mean was the proper law of error for society. > > However, lately I have been switching over to nonparametric > techniques for reasons stemming out of what seems to be your > main research interest--classifying phenomena into sets or > groups with mathematical and statistical techniques such as > clustering, factor analysis, etc. Precise mathematical > techniques including many statistical ones are really not > applicable to information science due to Bradford's Law of > Scattering, which causes all information science sets to be > fuzzy. Therefore, your sets are always plagued by foreign > contaminants that distort estimates of parameters and result > in tremendous outliers. To counter this, I have been > switching to nonparametric techniques like the chi-squared > test for homogeneity instead of correlation because of the > ability to work within broad categories instead in terms of > precise fits. In other words, one has to use cruder methods > to counter the fuzzy outliers unless one can more precisely > define set membership. To tell you the honest truth, > defining precise sets with mathematical techniques like > cluster analysis is probably beyond my mental capacities and > will have to be done by the likes of you. All I can say is > that you should use whatever works as long as you can explain > to laymen like me what does work and why it does work. This > would be tremendously helpful. > > In respect to entropy I did take a fling at this at the end > of the article below. I did it on the basis of the theories > of the famous French statistician Emile Borel, who postulated > total homogeneity and randomness as a function of entropy. > According to Borel, tremendously skewed distributions > resulting in vast inhomogeneities--like those found in > information science--require vast energy inputs, and, as > energy inputs decline, the entire system collapses with a > declining mean and variance around this mean until the system > can be modeled by the Poisson distribution. A very > interesting way to model obselescence for purposes of weeding > library collections. However, in general, I prefer > biological models to physical ones such as Borel's borrowing > from thermodynamics. > > Anyhow, I hope the above did not bore you and that you find > the observations useful. > > Respectfully, > > Stephen J. Bensman > > Brookes, Bertram C. 1980a. The foundations of information > science, part I: Philosophical aspects. Journal of > information science 2: 125-33. > > Bensman, Stephen J. 2000. Probability Distributions in Library and > Information Science: A Historical and Practitioner Viewpoint. > Journal of the American Society for Information Science 51: 816-833 . > -------------- next part -------------- An HTML attachment was scrubbed... URL: From loet at LEYDESDORFF.NET Wed Jan 21 05:20:40 2004 From: loet at LEYDESDORFF.NET (Loet Leydesdorff) Date: Wed, 21 Jan 2004 11:20:40 +0100 Subject: Pearson's r and ACA In-Reply-To: Message-ID: ps. I understood now from a private email of one of the authors that they did not apply the logarithmic transformation to the data (despite some text on p. 555 responding to the referee). The high values for the Pearson are generated by treating the diagonal values as missing data and not as zeros. This is noted on p. 554. Of course, zeros depress the Pearson correlation in the case of otherwise positive values. This explanation completely clarifies the misunderstanding. Perhaps, it is useful in this context to note that the treatment of the main diagonal has been the subject of some early work in scientometrics by Noma and Price. The references are: Noma, Elliott (1982). An Improved Method for Analyzing Square Scientometric Transaction Matrices. Scientometrics 4, 297-316. Price, Derek J. de Solla (1981). The Analysis of Square Matrices of Scientometric Transactions. Scientometrics 3, 55-63. With kind regards, Loet Dear Stephen, You mail clarifies the incomprehensible use of the statistics in Ahlgren et al. (2003) because they indicate r = 0.89 between the variables "Braun" and "Schubert" while upon computation one only finds r = 0.456. However, they state in very cryptic wordings (on p. 555) that they performed a logarithmic transformation and then, indeed, one finds r = 0.89. I understand that this has been done for reasons of significance testing given the assumption of a bivariate normal distribution. (Peter van den Besselaar and I had an exchange in a recent issue of JASIST on significance testing in the case of descriptive statistics versus inferential statistics.) However, these authors do not wish to test for significance. It is confusing. Even more confusing is on p. 556 that the correlation (after logarithmic transformation) would go to r = 0.94 in this case by adding only zeros. The zeros should have no effect after the transformation, shouldn't they? But this is crucial to the argument of the paper ??? Anyhow, my point was about using information theory. This implies a logarithmic transformation as you wish to emphasize. More importantly, it allows for a unique and exact solution to the problem of the dividedness. I have proven that in "The Challenge of Scientometrics" (Chapter 9, pp. 166 ff. of the 2001-edition). I'll apply the algorithm to the matrix under discussion and submit a brief contribution to JASIST on the subject. The decomposition using information theory is not disturbed by outliers because the measure in non-parametric. Perhaps, you can do me the favour to explain the difference in the value of the correlations between Table 8 and Table 9 in Ahlgren et al. (2003). Let us focus on "Braun" and "Schubert" as variables. How did they arrive at r = 0.94? With kind regards, Loet _____ Loet Leydesdorff Amsterdam School of Communications Research (ASCoR) Kloveniersburgwal 48, 1012 CX Amsterdam Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 loet at leydesdorff.net ; http://www.leydesdorff.net/ The Challenge of Scientometrics ; The Self-Organization of the Knowledge-Based Society > -----Original Message----- > From: ASIS&T Special Interest Group on Metrics > [ mailto:SIGMETRICS at LISTSERV.UTK.EDU] On Behalf Of Stephen J Bensman > Sent: Tuesday, January 20, 2004 4:40 PM > To: SIGMETRICS at LISTSERV.UTK.EDU > Subject: [SIGMETRICS] Pearson's r and ACA > > > Dear Loet et al. > > In respect to your suggestion of using nonparametric > statistics to handle > non-normal distributions, I will answer you only in general > terms. I was > trained in statistics by an ecologist, who introduced me to > biometric statistics. Through him I became intrigued how > biological, social, and information phenomena act precisely > in the same way and that biostatistics are therefore the > statistics applicable to information science. I became > absolutely fascinated by the unity of society and nature in > this respect. My first inclination was to use nonparametric > statistics to counter the nonnormal distributions, but he > just took my model and contemptously threw it in the waste > basket. He insisted that you must use the more powerful > parametric statistics whenever possible, using the > logarithmic transformation. To emphasize his point, he took > off a shelf above his desk a little log from his brother's > woodlot in Maine, which had a little "n" painted on its end. > He slammed it on his desk and stated, "This is my log natural." > > The use of mathematical transformations to normalize > distributions raises some interesting philosophical > questions. From the perspective of the normal law of error, > biological, social, and information reality makes a person > feel that he is caught in a fun house full of distorting > mirrors. In order to see and measure error, you have to put > on mathematical eye glasses, which transform the reality to > that of the perspective of the normal distribution. This > makes you wonder--what is actual reality--that of the raw > data, or that of the data logaritmically transformed to the > requirements of the normal distribution? B. C. Brookes in > the article below dealt with this philosophical question , > and, basing himself on the psychometric work of Gustav > Fechner, Brookes argued that the logarithmic perspective was > the proper one for information science. Interestingly enough > John Maynard Keynes in his treatise on probability thought > that the lognormal distribution centered on the geometric > mean was the proper law of error for society. > > However, lately I have been switching over to nonparametric > techniques for reasons stemming out of what seems to be your > main research interest--classifying phenomena into sets or > groups with mathematical and statistical techniques such as > clustering, factor analysis, etc. Precise mathematical > techniques including many statistical ones are really not > applicable to information science due to Bradford's Law of > Scattering, which causes all information science sets to be > fuzzy. Therefore, your sets are always plagued by foreign > contaminants that distort estimates of parameters and result > in tremendous outliers. To counter this, I have been > switching to nonparametric techniques like the chi-squared > test for homogeneity instead of correlation because of the > ability to work within broad categories instead in terms of > precise fits. In other words, one has to use cruder methods > to counter the fuzzy outliers unless one can more precisely > define set membership. To tell you the honest truth, > defining precise sets with mathematical techniques like > cluster analysis is probably beyond my mental capacities and > will have to be done by the likes of you. All I can say is > that you should use whatever works as long as you can explain > to laymen like me what does work and why it does work. This > would be tremendously helpful. > > In respect to entropy I did take a fling at this at the end > of the article below. I did it on the basis of the theories > of the famous French statistician Emile Borel, who postulated > total homogeneity and randomness as a function of entropy. > According to Borel, tremendously skewed distributions > resulting in vast inhomogeneities--like those found in > information science--require vast energy inputs, and, as > energy inputs decline, the entire system collapses with a > declining mean and variance around this mean until the system > can be modeled by the Poisson distribution. A very > interesting way to model obselescence for purposes of weeding > library collections. However, in general, I prefer > biological models to physical ones such as Borel's borrowing > from thermodynamics. > > Anyhow, I hope the above did not bore you and that you find > the observations useful. > > Respectfully, > > Stephen J. Bensman > > Brookes, Bertram C. 1980a. The foundations of information > science, part I: Philosophical aspects. Journal of > information science 2: 125-33. > > Bensman, Stephen J. 2000. Probability Distributions in Library and > Information Science: A Historical and Practitioner Viewpoint. > Journal of the American Society for Information Science 51: 816-833 . > -------------- next part -------------- An HTML attachment was scrubbed... URL: From garfield at CODEX.CIS.UPENN.EDU Thu Jan 22 16:50:33 2004 From: garfield at CODEX.CIS.UPENN.EDU (Eugene Garfield) Date: Thu, 22 Jan 2004 16:50:33 -0500 Subject: Little book, big book: before and after Little science, big science: a review article, Part I and II, JOURNAL OF LIBRARIANSHIP AND INFORMATION SCIENCE 35 (2): 115-125 JUN 2003 and 35 (3): 189-201 SEP 2003 Message-ID: A "MUST READ" FOR FANS OF DEREK JOHN DESOLLA PRICE. IN THE NEAR FUTURE WE ARE POSTING DEREK'S COMPLETE CV AND BIBLIOGRAPHY. _________________________________________________ Eugene Garfield, PhD. email: garfield at codex.cis.upenn.edu Author : J. Furner : jfurner at ucla.edu Author's Website : http://polaris.gseis.ucla.edu/jfurner/jfurner.html List of Publications by J. Furner : http://polaris.gseis.ucla.edu/jfurner/jfurner.html Here is a two-part article that appeared in J. Lib. Info Sci. These are available as pdf files at : http://polaris.gseis.ucla.edu/jfurner/03jolis-pt1-compact.pdf - Part I http://polaris.gseis.ucla.edu/jfurner/03jolis-pt2-compact.pdf - Part II __________________________________________________________________________________________ full text pdf file : http://polaris.gseis.ucla.edu/jfurner/03jolis-pt1-compact.pdf TITLE Little book, big book: before and after Little science, big science: a review article, Part I AUTHOR Furner J SOURCE JOURNAL OF LIBRARIANSHIP AND INFORMATION SCIENCE 35 (2): 115-125 JUN 2003 Document type: Review Language: English Cited References: 62 Times Cited: 0 Abstract: Since its publication in 1963, Derek Price's Little science, big science (LSBS) has achieved 'citation classic' status. Examination of the genesis of LSBS and the state of the discipline of the history of science in the UK and the USA in the late 1950s demonstrates that Price's ideas were formulated during a pivotal period in the development of socio-historical studies of science. Price's talent for innovation and synthesis at an unsettled but highly charged time, and his appreciation of the pioneering work in science studies of the crystallographer J.D. Bernal, are reflected in the uniquely profound and wide-ranging respects in which LSBS has contributed to the development of scientometric and sociological theory. KeyWords Plus: HISTORY, SOCIOLOGY Addresses: Furner J, Univ Calif Los Angeles, Dept Informat Studies, 300 Young Dr N,Mailbox 951520, Los Angeles, CA 90095 USA Univ Calif Los Angeles, Dept Informat Studies, Los Angeles, CA 90095 USA Publisher: SAGE PUBLICATIONS LTD, 6 BONHILL STREET, LONDON EC2A 4PU, ENGLAND IDS Number: 701DU ISSN: 0961-0006 Cited Author Cited Work Volume Page Year ID *ROYAL SOC ROYAL SOC EMP SCI C 1984 *ROYAL SOC ROYAL SOC SCI IINF C 1948 BARBER B SCI SOCIAL ORDER 1952 BEAVER D SOCIOL INQ 48 140 1978 BEAVER DD SCI SOCIAL ORDER 76 371 1985 BEDINI SA ANN I MUSEO STORIA S 9 95 1984 BERNAL JD SCI HIST 1954 BERNAL JD SCI IND 19 CENTURY 1953 BERNAL JD SOCIAL FUNCTION SCI 1939 BRUSH SG SCIENCE 183 1164 1974 BUTTERFIELD H ORIGINS MODERN SCI 1 1949 BUTTERFIELD H WHIG INTERPRETATION 1931 CARR EH WHAT HIST 1961 CONANT JB UNDERSTANDGING SCI H 1947 CRAWFORD S B MED LIB ASS 72 238 1984 GALISON P BIG SCI GROWTH LARGE 1992 GARFIELD E CITATION INDEXING IT 1979 GARFIELD E CURR CONTENTS 5 1982 GARFIELD E CURRENT CONTENTS 43 3 1985 GARFIELD E CURRENT CONTENTS 23 3 1984 GARFIELD E SCIENTOMETRICS 7 487 1985 GRIFFITH BC SCIENTOMETRICS 6 5 1984 HAGSTROM WO SCI COMMUNITY 1965 HALL AR SCI COMMUNITY 75 22 1984 HERNER S SUPPLEMENT DICT AM L 98 1990 HESSEN B SCI CROSS ROADS 149 1931 JUSTICE A 2 C HIST HER SCI TEC 2002 KOCHEN M J AM SOC INFORM SCI 35 147 1984 KOYRE A ETUDES GALILEENNES 1939 KUHN TS ETUDES GALILEENNES 75 29 1984 KUHN TS STRUCTURE SCI REVOLU 1962 LOTKA AJ J WASHINGTON ACADEMY 16 317 1926 MACKAY A SOC STUD SCI 14 315 1984 MERTON RK AM SOCIOL REV 22 635 1957 MERTON RK J LEGAL POLITICAL SO 1 115 1942 MERTON RK SCI TECHNOLOGY SOC 1 1938 MUDDIMAN D 2 C HIST HER SCI TEC 2002 NEEDHAM J HEAVENLY CLOCKWORK G 1960 NEEDHAM J SCI CIVILISATION CHI 1 1954 PRICE DD GEARS GREEKS ANTIKYT 1974 PRICE DD METRIC SCI 69 1978 PRICE DJ 6 C INT HIST SCI AMS 1 413 1950 PRICE DJ ARCH INT HIST SCI 4 85 1951 PRICE DJ EQUATORIE PLANETIS 1955 PRICE DJ SOCIOL SCI 516 1962 PRICE DJD BASIC COLL Q 4 6 1959 PRICE DJD CURR CONTENTS 29 18 1983 PRICE DJD DISCOVERY 17 240 1956 PRICE DJD LITTLE SCI BIG SCI 1986 PRICE DJD LITTLE SCI BIG SCI 1963 PRICE DJD SCI BABYLON 1961 PRICE DJD SCI SCI 195 1964 PRICE DJD SCIENCE 149 510 1965 ROSE H JD BERNAL LIFE SCI P 132 1999 ROSSITER MW JD BERNAL LIFE SCI P 75 95 1984 SARTON G HIST SCI NEW HUMANIS 1931 SARTON G INTRO HIST SCI 1 1927 SEGLEN PO J AM SOC INFORM SCI 43 628 1992 SNOW CP 2 CULTURES SCI REVOL 1959 TOULMIN S DAEDALUS 106 143 1977 VICKERY B J DOC 54 281 1998 WEINBERG AM SCIENCE 134 161 1961 ____________________________________________________________________________________________ http://polaris.gseis.ucla.edu/jfurner/03jolis-pt2-compact.pdf TITLE Little book, big book: before and after little science, big science: a review article, Part II AUTHOR Furner J SOURCE JOURNAL OF LIBRARIANSHIP AND INFORMATION SCIENCE 35 (3): 189-201 SEP 2003 Document type: Review Language: English Cited References: 74 Times Cited: 0 Abstract: A bibliometric analysis, updating Garfield's previous study of 1985, shows that Derek Price's Little science, big science (LSBS) has been cited on more than 1,500 occasions since its publication in 1963. Content analysis of these citations shows that Price's work has inspired the formation and subsequent development of several distinct communities of scholarly practice, including the history, sociology, politics and 'science' of science. In library and information science, LSBS is best remembered for its model of the exponential growth of scientific literature. Recent scientometric work has demonstrated that other mathematical models may prove a better fit to data on the actual growth of the literature in various fields; of these alternatives, the power model is most commonly invoked as an all-purpose descriptor of growth processes in electronic communication systems such as the Web. Price's work thus retains its relevance, 40 years on, for the webometricians of today. KeyWords Plus: SQUARE-ROOT LAW, GROWTH, OBSOLESCENCE, SIZE, COLLABORATION, PRODUCTIVITY, NETWORKS, LIBRARY, IMPACT, NUMBER Addresses: Furner J, Univ Calif Los Angeles, Dept Informat Studies, 300 Young Dr N,Mailbox 951520, Los Angeles, CA 90095 USA Univ Calif Los Angeles, Dept Informat Studies, Los Angeles, CA 90095 USA Publisher: SAGE PUBLICATIONS LTD, 6 BONHILL STREET, LONDON EC2A 4PU, ENGLAND IDS Number: 728VJ ISSN: Cited Author Cited Work Volume Page Year ULRICHS PERIODICALS 1932 ARCHIBALD G SCIENTOMETRICS 20 173 1991 BARABASI AL LINKED NEW SCI NETW 2002 BARR KP J DOC 23 110 1967 BEAVER DD ISIS 76 371 1985 BERNAL JD SOCIAL FUNCTION SCI 1939 BRADFORD SC ENGINEERING-LONDON 137 85 1934 CHUBIN D SOCIOLOGY SCI ANNOTA 1983 COLE FJ SCI PROGR 11 578 1917 COZZENS SE SCIENTOMETRICS 7 431 1985 CRANE D AM SOCIOL REV 30 699 1965 CRANE D INVISIBLE COLL 1972 DREHER C NATION 197 14 1963 EGGHE L INFORM PROCESS MANAG 28 201 1992 EGGHE L J AM SOC INFORM SCI 51 1004 2000 EGGHE L J INFORM SCI 12 193 1986 EGGHE L SCIENTOMETRICS 25 5 1992 ELKANA Y METRIC SCI ADVENT SC 1978 GARFIELD E CURRENT CONTENTS 43 3 1985 GARFIELD E CURRENT CONTENTS 23 3 1984 GARFIELD E SCIENTOMETRICS 7 487 1985 GILBERT GN SCI STUD 4 279 1974 GILBERT GN SCIENTOMETRICS 1 9 1974 GLANZEL W SCIENTOMETRICS 7 211 1985 GOLDBERG S SCIENCE 140 639 1963 GOLDSMITH M SCI SCI 1964 GOTTSCHALK CM AM DOC 14 188 1963 GROSS PLK SCIENCE 66 385 1927 GUPTA BM SCIENTOMETRICS 53 161 2002 HAGSTROM WO ADM SCI Q 9 241 1964 HAGSTROM WO SCI COMMUNITY 1965 HARGENS L SOCIOL INQ 48 121 1978 HESS DJ SCI STUDIES ADV INTR 1997 HOLTON G DAEDALUS 91 362 1962 HUBERMAN BA LAWS WEB PATTERNS EC 2001 HULME EW STAT BIBLIO RELATION 1923 KAPLAN NR J AM SOC INFORM SCI 51 324 2000 LEYDESDORFF LA CHALLENGE SCIENTOMET 1995 LIEVROUW LA SCHOLARLY COMMUNICAT 59 1990 LINE M INT SOC SCI J 28 122 1976 LINE MB J AM SOC INFORM SCI 38 307 1987 LINE MB J DOC 30 283 1974 LOTKA AJ J WASHINGTON ACADEMY 16 317 1926 MEADOWS J WEB KNOWLEDGE FESTSC 87 2000 MERTON RK SCIENCE 159 56 1968 MOLYNEUX RE GOOD ORDER ESSAYS HO 85 1994 MOLYNEUX RE LIB CULTURE 29 297 1994 MOLYNEUX RE LIBR INFORM SCI RES 8 5 1986 MORAVCSIK MJ RES POLICY 2 266 1973 MULLINS N THEORIES THEORY GROU 1973 NEWMAN MEJ P NATL ACAD SCI USA 98 404 2001 NICHOLLS PT INFORMATION PROCESSI 24 469 1988 OTLET P TRAITE DOCUMENTATION 1934 PERSSON O CURR SCI INDIA 79 590 2000 PRICE DJ ARCH INT HIST SCI 4 85 1951 PRICE DJD AM PSYCHOL 21 1011 1966 PRICE DJD J AM SOC INFORM SCI 27 292 1976 PRICE DJD LITTLE SCI BIG SCI 1986 PRICE DJD SCI BABYLON 1961 PRICE DJD SCIENCE 149 510 1965 RESCHER N SCI PROGR PHILOS ESS 1978 RIDER F SCHOLAR FUTURE RES L 1944 ROUSSEAU JJ SOCIAL CONTRACT 101 1762 TABAH AN INFORM PROCESS MANAG 28 61 1992 TAGUE J LIBR TRENDS 30 125 1981 TODOROV R J INFORM SCI 14 47 1988 VANRAAN AFJ HDB QUANTITATIVE STU 1988 VICKERY BC INFORMATION ENV WORL 101 1990 VICKERY BC SCI COMMUNICATION HI 2000 VLACHY J SCIENTOMETRICS 7 505 1985 WHITE HD J AM SOC INF SCI TEC 52 87 2001 WOLFRAM D INFORMETRICS 89 90 355 1990 WOOTTON CB TRENDS SIZE GROWTH C 1977 ZIMAN J INTRO SCI STUDIES PH 1984 __________________________________________________ Eugene Garfield, PhD. email: garfield at codex.cis.upenn.edu From loet at LEYDESDORFF.NET Fri Jan 23 04:25:18 2004 From: loet at LEYDESDORFF.NET (Loet Leydesdorff) Date: Fri, 23 Jan 2004 10:25:18 +0100 Subject: Chinese Science Citation Index Message-ID: Mapping the Chinese Science Citation Database available at http://www.leydesdorff.net/china01/art/index.htm or http://www.leydesdorff.net/china01/art/china01.pdf Methods developed for mapping the journal structures contained in aggregated journal-journal citations in the Science Citation Index are applied to the Chinese Science Citation Database of the Chinese Academy of Sciences. This database covers 991 journals, of which only 37 had originally English titles. Using factor-analytical and graph-analytical techniques we show that this data is dually structured. The main structure is the intellectual organization of the journals in journal groups (as in the international SCI), but the university-based journals provide an institutional layer that orients this structure towards practical ends (e.g., agriculture). The Chinese Science Citation Database exhibits the characteristics of "Mode 2" in the production of scientific knowledge more than its western counterparts. The context of application leads to correlation (interfactorial complexity) among the components. The maps are available at http://www.leydesdorff.net/china01 _____ Loet Leydesdorff Amsterdam School of Communications Research (ASCoR) Kloveniersburgwal 48, 1012 CX Amsterdam Tel.: +31-20- 525 6598; fax: +31-20- 525 3681 loet at leydesdorff.net ; http://www.leydesdorff.net/ The Challenge of Scientometrics ; The Self-Organization of the Knowledge-Based Society -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwhitney at UTK.EDU Fri Jan 23 10:14:17 2004 From: gwhitney at UTK.EDU (Gretchen Whitney) Date: Fri, 23 Jan 2004 10:14:17 -0500 Subject: Sente - search tool for PubMed Message-ID: Sente (sen-TAY) is a new search tool for PubMed, allowing the user to collect citations and abstracts into categories, sort them, filter them, and automatically update searches on a daily basis. Not a new concept, but a new implementation. See http://www.thirdstreetsoftware.com/ MacOS X.2 and above. --gw <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Gretchen Whitney, PhD tel 865.974.7919 School of Information Sciences fax 865.974.4967 University of Tennessee, Knoxville TN 37996 USA gwhitney at utk.edu http://web.utk.edu/~gwhitney/ jESSE:http://web.utk.edu/~gwhitney/jesse.html SIGMETRICS:http://web.utk.edu/~gwhitney/sigmetrics.html <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> From Chaomei.Chen at CIS.DREXEL.EDU Wed Jan 28 15:03:02 2004 From: Chaomei.Chen at CIS.DREXEL.EDU (Chaomei Chen) Date: Wed, 28 Jan 2004 15:03:02 -0500 Subject: Call for Papers: The 3rd International Symposiium on Knowledge Domain Visualization (KDViz'04) Message-ID: Dear All, The 3rd International Symposium on Knowledge Domain Visualization (KDViz'04) is to be held as part of the 8th International Conference on Information Visualization in London, England, July 14-16, 2004. Papers due: March 1, 2004 The official webpage of the sympoisum: http://www.graphicslink.demon.co.uk/IV04/KDViz.htm Additional information is available at: http://www.pages.drexel.edu/~cc345/kdviz/kdviz04/ Best wishes, Chaomei From kate.mccain at CIS.DREXEL.EDU Wed Jan 28 15:59:17 2004 From: kate.mccain at CIS.DREXEL.EDU (Kate McCain) Date: Wed, 28 Jan 2004 15:59:17 -0500 Subject: Fw: reading versus citing Message-ID: All, This was posted on CHMINF-L today. I thought it might be of interest to people who don't lurk on that listserv. Apologies for any duplication. Regards, Kate McCain College of Information Science & Technology Drexel University ----- Forwarded by Kate McCain/Drexel_IST on 01/28/2004 03:59 PM ----- Brian Simboli Sent by: CHEMICAL INFORMATION SOURCES DISCUSSION LIST 01/28/2004 03:44 PM Please respond to CHEMICAL INFORMATION SOURCES DISCUSSION LIST To CHMINF-L at LISTSERV.INDIANA.EDU cc Subject Re: reading versus citing Persons interested in the issue that Flora Grabowska refers to below may want to see the following, which Mikhail Simkin brought to my attention yesterday. It is a detailed version of the paper about copied citations: http://arxiv.org/abs/cond-mat/0401529 Brian Simboli Lehigh University --- Flora Grabowska wrote: > Christina's last point is well taken although I missed the posting she > refers to. It may also be that scientists skim/read articles cited. > Last summer an MIT team made CNN news which was not really news > because a Vassar scientist had published the same findings 5 years > earlier *and was cited by the MIT group* in their Nature article. > > see MIT grahics at > > WATER-WALKING HAS HIDDEN DEPTHS > Infographic explains how water striders skim the surface with such > apparent ease. > http://info.nature.com/cgi-bin24/DM/y/eLiL0CgtHx0C30DQe0Ao > > I understand the MIT scientists have since acknowledged that they > essentially re-researched earlier work which at least one of them had > seen but perhaps not digested fully. Bob Suter's earlier work had > even made its way into books on locomotion so it was not exactly obscure! > > Flora > > >> >> Also, it may be worth looking at the work posted here a few weeks ago >> about how scientists only read a small percent of the articles they cite. > >> > >> > >> Christina K. Pikas, MLS >> R.E. Gibson Library & Information Center >> The Johns Hopkins University Applied Physics Laboratory >> Voice 240.228.4812 (Washington), 443.778.4812 (Baltimore) > >> Fax 443.778.5353 > > >-- > > > ___________________________________________________________________ > Flora Grabowska, Science Librarian phone 845 437 5788 > Box 553 > Vassar College, fax 845 437 5864 > 124 Raymond Ave, e-mail:flgrabowska at vassar.edu > Poughkeepsie, NY 12604-0553 > Vassar College Library Website: http://library.vassar.edu/vcl/index.html -- Brian Simboli Science Librarian Library & Technology Services E.W. Fairchild Martindale 8A East Packer Avenue Bethlehem, PA 18015-3170 (610) 758-5003 E-mail: brs4 at lehigh.edu CHMINF-L Archives (also to join or leave CHMINF-L, etc.) http://listserv.indiana.edu/archives/chminf-l.html Search the CHMINF-L archives at: http://listserv.indiana.edu/cgi-bin/wa?S1=chminf-l Sponsors of CHMINF-L: http://www.indiana.edu/~cheminfo/chminf-l_support.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From PI at DB.DK Fri Jan 30 13:07:26 2004 From: PI at DB.DK (Ingwersen Peter) Date: Fri, 30 Jan 2004 19:07:26 +0100 Subject: ISSI 2005 Call f. Papers Message-ID: Dear colleagues in the Informetric and associated research communities, Below, I have attached the Call for Papers for the ISSI 2005 Conference on research in Informetrics, Scientometrics, Bibliometrics and Webometrics - all fields that are belonging to LIS. ISSI 2005 takes place in: Stockholm, Sweden - July 25-28 - 2005 - Karolinska Institute Full paper submision dealine: 31st January, 2005 Consult also the conference website: http://www.umu.se/inforsk/ISSI2005/ You may distribute to colleagues for whom you find this event relevant. Many regards - Yours Peter Ingwersen, Professor & Programme Chair -------------- next part -------------- A non-text attachment was scrubbed... Name: ISSI 2005 Call for papers.doc Type: application/msword Size: 34816 bytes Desc: not available URL: From felix at UGR.ES Fri Jan 30 13:25:50 2004 From: felix at UGR.ES (=?ISO-8859-1?B?RulsaXggZGUgTW95YQ==?==?ISO-8859-1?B?IEFuZWfzbg==?=) Date: Fri, 30 Jan 2004 19:25:50 +0100 Subject: Call for Papers: The 3rd International Symposiium on Knowledge Domain Visualization (KDViz'04) In-Reply-To: Message-ID: Esto tambi?n es ingteresante. Un saludo. Mensaje citado por Chaomei Chen : > Dear All, > > The 3rd International Symposium on Knowledge Domain Visualization > (KDViz'04) is to be held as part of the 8th International Conference on > Information Visualization in London, England, July 14-16, 2004. > > Papers due: March 1, 2004 > > The official webpage of the sympoisum: > http://www.graphicslink.demon.co.uk/IV04/KDViz.htm > > Additional information is available at: > http://www.pages.drexel.edu/~cc345/kdviz/kdviz04/ > > > Best wishes, > Chaomei > -- ************************************************** FELIX DE MOYA ANEGON VICERRECTOR UNIVERSIDAD DE GRANADA ************************************************** From ruben at UCR.EDU Fri Jan 30 17:12:55 2004 From: ruben at UCR.EDU (ruben urbizagastegui) Date: Fri, 30 Jan 2004 14:12:55 -0800 Subject: Chinese Science Citation Index In-Reply-To: <000f01c3e192$d131c300$1402a8c0@loet> Message-ID: Hi all, Could somebody, please, help me to get a copy of this paper? Kretschmer, Hildrum. Distribution of co-author couples in journals: "continuation" of Lotka's law on the 3rd dimension. In: 8th International Conference on Scientometrics and Informetrics. Proceedings - ISSI-2001 -. BIRG, Univ. New South Wales. Vol.1, 2001, pp.317-325 vol.1. Sydney, NSW, Australia. Thank you very much. Ruben -------------------------------- Ruben Urbizagastegui Librarian Science Library University of California Riverside, CA 92517 - 5900 USA From Adrian.Dale at CREATIFICA.COM Sat Jan 31 13:48:25 2004 From: Adrian.Dale at CREATIFICA.COM (Adrian Dale (Journal of Information Science)) Date: Sat, 31 Jan 2004 18:48:25 -0000 Subject: A connectionist and multivariate approach to science maps: SOM and statistics techniques applied to Library & Information Science research Message-ID: We have received the following paper for review: A connectionist and multivariate approach to science maps: SOM and statistics techniques applied to Library & Information Science research Abstract -------- The visualization of scientific field structures is a classic application of scientometric studies. This paper presents a domain analysis of the Library & Information Science discipline based on author cocitation analysis (ACA) and journal cocitation analysis (JCA). The techniques used for map construction are the Self-Organizing Map (SOM) neural algorithm, Ward?s clustering method and Multidimensional Scaling (MDS). The results of this study are compared with similar research developed by Howard White and Katherine McCain. We are looking for a panel of suitably qualified referees for this paper - if you feel you have the necessary background and could complete the review in 28 days, please drop me an e-mail and I'll forward the paper. The paper remains the copyright of the author until it is accepted for publication - at which point the copyright is assigned to the Chartered Institute of Library and Information Professionals - CILIP. Adrian Dale Editor Journal of Information Science ------------------------------ EMail: Adrian.Dale at Creatifica.com Tel: +44 1933 622624 Fax: +44 870 127 8215 Mobile: +44 7850 570007 Paper: Creatifica House, 21 Water Lane, Chelveston, Wellingborough, Northants, NN9 6AP, UK From j.hartley at PSY.KEELE.AC.UK Fri Jan 2 22:12:33 2004 From: j.hartley at PSY.KEELE.AC.UK (James Hartley) Date: Sat, 3 Jan 2004 03:12:33 +0000 Subject: 2 new papers Message-ID: Colleagues may be interesed in two new papers One is on my research on structured abstracts, and one on new ways of making academic articles to read. The abstracts are attached. Copiesof both papers are available from me. Season's greetings Jim James Hartley School of Psychology Keele University Staffordshire http://www.keele.ac.uk/psychology/people/hartleyjames/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Abstracts for RHE.doc Type: application/msword Size: 27136 bytes Desc: Abstracts for RHE.doc URL: