Cohen, KB; Johnson, HL; Verspoor, K; Roeder, C; Hunter, LE. 2010. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC BIOINFORMATICS 11: art. no.-492
Eugene Garfield
garfield at CODEX.CIS.UPENN.EDU
Fri Dec 31 10:55:17 EST 2010
Cohen, KB; Johnson, HL; Verspoor, K; Roeder, C; Hunter, LE. 2010. The
structural and content aspects of abstracts versus bodies of full text journal
articles are different. BMC BIOINFORMATICS 11: art. no.-492.
Author Full Name(s): Cohen, K. Bretonnel; Johnson, Helen L.; Verspoor, Karin;
Roeder, Christophe; Hunter, Lawrence E.
Language: English
Document Type: Article
KeyWords Plus: INFORMATION EXTRACTION; BIOLOGY
Abstract: Background: An increase in work on the full text of journal articles
and the growth of PubMedCentral have the opportunity to create a major
paradigm shift in how biomedical text mining is done. However, until now there
has been no comprehensive characterization of how the bodies of full text
journal articles differ from the abstracts that until now have been the subject
of most biomedical text mining research.
Results: We examined the structural and linguistic aspects of abstracts and
bodies of full text articles, the performance of text mining tools on both, and
the distribution of a variety of semantic classes of named entities between
them. We found marked structural differences, with longer sentences in the
article bodies and much heavier use of parenthesized material in the bodies
than in the abstracts. We found content differences with respect to linguistic
features. Three out of four of the linguistic features that we examined were
statistically significantly differently distributed between the two genres. We
also found content differences with respect to the distribution of semantic
features. There were significantly different densities per thousand words for
three out of four semantic classes, and clear differences in the extent to which
they appeared in the two genres. With respect to the performance of text
mining tools, we found that a mutation finder performed equally well in both
genres, but that a wide variety of gene mention systems performed much
worse on article bodies than they did on abstracts. POS tagging was also more
accurate in abstracts than in article bodies.
Conclusions: Aspects of structure and content differ markedly between article
abstracts and article bodies. A number of these differences may pose problems
as the text mining field moves more into the area of processing full-text
articles. However, these differences also present a number of opportunities for
the extraction of data types, particularly that found in parenthesized text, that
is present in article bodies but not in article abstracts.
Addresses: [Cohen, K. Bretonnel; Johnson, Helen L.; Verspoor, Karin; Roeder,
Christophe; Hunter, Lawrence E.] Univ Colorado, Sch Med, Dept Pharmacol, Ctr
Computat Pharmacol, Aurora, CO USA; [Cohen, K. Bretonnel] Univ Colorado,
Dept Linguist, Boulder, CO 80309 USA
Reprint Address: Cohen, KB, Univ Colorado, Sch Med, Dept Pharmacol, Ctr
Computat Pharmacol, Aurora, CO USA.
E-mail Address: kevin.cohen at gmail.com
ISSN: 1471-2105
DOI: 10.1186/1471-2105-11-492
fulltext: http://www.biomedcentral.com/1471-2105/11/492/abstract
More information about the SIGMETRICS
mailing list