Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Katy Borner katy at INDIANA.EDU
Sun Mar 20 23:47:32 EDT 2011

Dear all,
many of the datasets used in the below study were made available at in support of replicability and to inspire future 

On 3/18/2011 10:07 AM, Kevin Boyack wrote:
> Adminstrative info for SIGMETRICS (for example unsubscribe): 
> Dear Colleagues,
> Several months ago we published an article in JASIST ("Co-citation 
> analysis, bibliographic coupling, and direct citation: Which citation 
> approach represents the research front most accurately?") that was 
> only part of a much larger study in which we compared the map results 
> of three citation approaches, nine text approaches, and one hybrid 
> approach on a single corpus.
> An article with the results from the nine text-based approaches was 
> published yesterday in PLoS ONE. This new article is the result of a 
> collaborative effort between SciTech Strategies, Katy Börner's team at 
> Indiana University, David Newman at UC Irvine, André Skupin at SDSU, 
> and Bob Schjivenaars at Collexis.
> With best regards,
> Kevin Boyack and Dick Klavans
> SciTech Strategies, Inc.
> -------------------------------------------
> *Clustering More than Two Million Biomedical Publications: Comparing 
> the Accuracies of Nine Text-Based Similarity Approaches*
> PLoS ONE (article freely available online at 
> *Background:* We investigate the accuracy of different similarity 
> approaches for clustering over two million biomedical documents. 
> Clustering large sets of text documents is important for a variety of 
> information needs and applications such as collection management and 
> navigation, summary and analysis. The few comparisons of clustering 
> results from different similarity approaches have focused on small 
> literature sets and have given conflicting results. Our study was 
> designed to seek a robust answer to the question of which similarity 
> approach would generate the most coherent clusters of a biomedical 
> literature set of over two million documents.
> *Methodology:* We used a corpus of 2.15 million recent (2004-2008) 
> records from MEDLINE, and generated nine different document-document 
> similarity matrices from information extracted from their 
> bibliographic records, including titles, abstracts and subject 
> headings. The nine approaches were comprised of five different 
> analytical techniques with two data sources. The five analytical 
> techniques are cosine similarity using term frequency-inverse document 
> frequency vectors (tf-idf cosine), latent semantic analysis (LSA), 
> topic modeling, and two Poisson-based language models -- BM25 and PMRA 
> (PubMed Related Articles). The two data sources were a) MeSH subject 
> headings, and b) words from titles and abstracts. Each similarity 
> matrix was filtered to keep the top-n highest similarities per 
> document and then clustered using a combination of graph layout and 
> average-link clustering. Cluster results from the nine similarity 
> approaches were compared using (1) within-cluster textual coherence 
> based on the Jensen-Shannon divergence, and (2) two concentration 
> measures based on grant-to-article linkages indexed in MEDLINE.
> *Conclusions:* PubMed's own related article approach (PMRA) generated 
> the most coherent and most concentrated cluster solution of the nine 
> text-based similarity approaches tested, followed closely by the BM25 
> approach using titles and abstracts. Approaches using only MeSH 
> subject headings were not competitive with those based on titles and 
> abstracts.

Katy Borner
Victor H. Yngve Professor of Information Science
Director, CI for Network Science Center,
Curator, Mapping Science exhibit,

School of Library and Information Science, Indiana University
Wells Library 021, 1320 E. Tenth Street, Bloomington, IN 47405, USA
Phone: (812) 855-3256  Fax: -6166

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the SIGMETRICS mailing list