Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Katy Borner katy at INDIANA.EDU
Sun Mar 20 23:47:32 EDT 2011


Dear all,
many of the datasets used in the below study were made available at 
http://sts.cns.iu.edu in support of replicability and to inspire future 
comparisons.
k


On 3/18/2011 10:07 AM, Kevin Boyack wrote:
> Adminstrative info for SIGMETRICS (for example unsubscribe): 
> http://web.utk.edu/~gwhitney/sigmetrics.html
>
> Dear Colleagues,
>
> Several months ago we published an article in JASIST ("Co-citation 
> analysis, bibliographic coupling, and direct citation: Which citation 
> approach represents the research front most accurately?") that was 
> only part of a much larger study in which we compared the map results 
> of three citation approaches, nine text approaches, and one hybrid 
> approach on a single corpus.
>
> An article with the results from the nine text-based approaches was 
> published yesterday in PLoS ONE. This new article is the result of a 
> collaborative effort between SciTech Strategies, Katy Börner's team at 
> Indiana University, David Newman at UC Irvine, André Skupin at SDSU, 
> and Bob Schjivenaars at Collexis.
>
> With best regards,
>
> Kevin Boyack and Dick Klavans
>
> SciTech Strategies, Inc.
>
> -------------------------------------------
>
> *Clustering More than Two Million Biomedical Publications: Comparing 
> the Accuracies of Nine Text-Based Similarity Approaches*
>
> PLoS ONE (article freely available online at 
> http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0018029) 
>
>
> *Background:* We investigate the accuracy of different similarity 
> approaches for clustering over two million biomedical documents. 
> Clustering large sets of text documents is important for a variety of 
> information needs and applications such as collection management and 
> navigation, summary and analysis. The few comparisons of clustering 
> results from different similarity approaches have focused on small 
> literature sets and have given conflicting results. Our study was 
> designed to seek a robust answer to the question of which similarity 
> approach would generate the most coherent clusters of a biomedical 
> literature set of over two million documents.
>
> *Methodology:* We used a corpus of 2.15 million recent (2004-2008) 
> records from MEDLINE, and generated nine different document-document 
> similarity matrices from information extracted from their 
> bibliographic records, including titles, abstracts and subject 
> headings. The nine approaches were comprised of five different 
> analytical techniques with two data sources. The five analytical 
> techniques are cosine similarity using term frequency-inverse document 
> frequency vectors (tf-idf cosine), latent semantic analysis (LSA), 
> topic modeling, and two Poisson-based language models -- BM25 and PMRA 
> (PubMed Related Articles). The two data sources were a) MeSH subject 
> headings, and b) words from titles and abstracts. Each similarity 
> matrix was filtered to keep the top-n highest similarities per 
> document and then clustered using a combination of graph layout and 
> average-link clustering. Cluster results from the nine similarity 
> approaches were compared using (1) within-cluster textual coherence 
> based on the Jensen-Shannon divergence, and (2) two concentration 
> measures based on grant-to-article linkages indexed in MEDLINE.
>
> *Conclusions:* PubMed's own related article approach (PMRA) generated 
> the most coherent and most concentrated cluster solution of the nine 
> text-based similarity approaches tested, followed closely by the BM25 
> approach using titles and abstracts. Approaches using only MeSH 
> subject headings were not competitive with those based on titles and 
> abstracts.
>

-- 
Katy Borner
Victor H. Yngve Professor of Information Science
Director, CI for Network Science Center, http://cns.slis.indiana.edu
Curator, Mapping Science exhibit, http://scimaps.org

School of Library and Information Science, Indiana University
Wells Library 021, 1320 E. Tenth Street, Bloomington, IN 47405, USA
Phone: (812) 855-3256  Fax: -6166


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20110320/2fdca6ce/attachment.html>


More information about the SIGMETRICS mailing list