Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
Katy Borner
katy at INDIANA.EDU
Sun Mar 20 23:47:32 EDT 2011
Dear all,
many of the datasets used in the below study were made available at
http://sts.cns.iu.edu in support of replicability and to inspire future
comparisons.
k
On 3/18/2011 10:07 AM, Kevin Boyack wrote:
> Adminstrative info for SIGMETRICS (for example unsubscribe):
> http://web.utk.edu/~gwhitney/sigmetrics.html
>
> Dear Colleagues,
>
> Several months ago we published an article in JASIST ("Co-citation
> analysis, bibliographic coupling, and direct citation: Which citation
> approach represents the research front most accurately?") that was
> only part of a much larger study in which we compared the map results
> of three citation approaches, nine text approaches, and one hybrid
> approach on a single corpus.
>
> An article with the results from the nine text-based approaches was
> published yesterday in PLoS ONE. This new article is the result of a
> collaborative effort between SciTech Strategies, Katy Börner's team at
> Indiana University, David Newman at UC Irvine, André Skupin at SDSU,
> and Bob Schjivenaars at Collexis.
>
> With best regards,
>
> Kevin Boyack and Dick Klavans
>
> SciTech Strategies, Inc.
>
> -------------------------------------------
>
> *Clustering More than Two Million Biomedical Publications: Comparing
> the Accuracies of Nine Text-Based Similarity Approaches*
>
> PLoS ONE (article freely available online at
> http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0018029)
>
>
> *Background:* We investigate the accuracy of different similarity
> approaches for clustering over two million biomedical documents.
> Clustering large sets of text documents is important for a variety of
> information needs and applications such as collection management and
> navigation, summary and analysis. The few comparisons of clustering
> results from different similarity approaches have focused on small
> literature sets and have given conflicting results. Our study was
> designed to seek a robust answer to the question of which similarity
> approach would generate the most coherent clusters of a biomedical
> literature set of over two million documents.
>
> *Methodology:* We used a corpus of 2.15 million recent (2004-2008)
> records from MEDLINE, and generated nine different document-document
> similarity matrices from information extracted from their
> bibliographic records, including titles, abstracts and subject
> headings. The nine approaches were comprised of five different
> analytical techniques with two data sources. The five analytical
> techniques are cosine similarity using term frequency-inverse document
> frequency vectors (tf-idf cosine), latent semantic analysis (LSA),
> topic modeling, and two Poisson-based language models -- BM25 and PMRA
> (PubMed Related Articles). The two data sources were a) MeSH subject
> headings, and b) words from titles and abstracts. Each similarity
> matrix was filtered to keep the top-n highest similarities per
> document and then clustered using a combination of graph layout and
> average-link clustering. Cluster results from the nine similarity
> approaches were compared using (1) within-cluster textual coherence
> based on the Jensen-Shannon divergence, and (2) two concentration
> measures based on grant-to-article linkages indexed in MEDLINE.
>
> *Conclusions:* PubMed's own related article approach (PMRA) generated
> the most coherent and most concentrated cluster solution of the nine
> text-based similarity approaches tested, followed closely by the BM25
> approach using titles and abstracts. Approaches using only MeSH
> subject headings were not competitive with those based on titles and
> abstracts.
>
--
Katy Borner
Victor H. Yngve Professor of Information Science
Director, CI for Network Science Center, http://cns.slis.indiana.edu
Curator, Mapping Science exhibit, http://scimaps.org
School of Library and Information Science, Indiana University
Wells Library 021, 1320 E. Tenth Street, Bloomington, IN 47405, USA
Phone: (812) 855-3256 Fax: -6166
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.asis.org/pipermail/sigmetrics/attachments/20110320/2fdca6ce/attachment.html>
More information about the SIGMETRICS
mailing list