Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Fri Mar 18 10:07:22 EDT 2011

Dear Colleagues,

Several months ago we published an article in JASIST (“Co-citation analysis,
bibliographic coupling, and direct citation: Which citation approach
represents the research front most accurately?”) that was only part of a
much larger study in which we compared the map results of three citation
approaches, nine text approaches, and one hybrid approach on a single
corpus.

An article with the results from the nine text-based approaches was
published yesterday in PLoS ONE. This new article is the result of a
collaborative effort between SciTech Strategies, Katy Börner’s team at
Indiana University, David Newman at UC Irvine, André Skupin at SDSU, and Bob
Schjivenaars at Collexis.

With best regards,

Kevin Boyack and Dick Klavans

SciTech Strategies, Inc.

-------------------------------------------

Clustering More than Two Million Biomedical Publications: Comparing the
Accuracies of Nine Text-Based Similarity Approaches

PLoS ONE (article freely available online at
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0018029) 

Background: We investigate the accuracy of different similarity approaches
for clustering over two million biomedical documents. Clustering large sets
of text documents is important for a variety of information needs and
applications such as collection management and navigation, summary and
analysis. The few comparisons of clustering results from different
similarity approaches have focused on small literature sets and have given
conflicting results. Our study was designed to seek a robust answer to the
question of which similarity approach would generate the most coherent
clusters of a biomedical literature set of over two million documents.

Methodology: We used a corpus of 2.15 million recent (2004-2008) records
from MEDLINE, and generated nine different document-document similarity
matrices from information extracted from their bibliographic records,
including titles, abstracts and subject headings. The nine approaches were
comprised of five different analytical techniques with two data sources. The
five analytical techniques are cosine similarity using term
frequency-inverse document frequency vectors (tf-idf cosine), latent
semantic analysis (LSA), topic modeling, and two Poisson-based language
models – BM25 and PMRA (PubMed Related Articles). The two data sources were
a) MeSH subject headings, and b) words from titles and abstracts. Each
similarity matrix was filtered to keep the top-n highest similarities per
document and then clustered using a combination of graph layout and
average-link clustering. Cluster results from the nine similarity approaches
were compared using (1) within-cluster textual coherence based on the
Jensen-Shannon divergence, and (2) two concentration measures based on
grant-to-article linkages indexed in MEDLINE.

Conclusions: PubMed’s own related article approach (PMRA) generated the most
coherent and most concentrated cluster solution of the nine text-based
similarity approaches tested, followed closely by the BM25 approach using
titles and abstracts. Approaches using only MeSH subject headings were not
competitive with those based on titles and abstracts.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20110318/ef86089d/attachment.html>