Cohen, KB; Johnson, HL; Verspoor, K; Roeder, C; Hunter, LE. 2010. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC BIOINFORMATICS 11: art. no.-492

Eugene Garfield garfield at CODEX.CIS.UPENN.EDU
Fri Dec 31 10:55:17 EST 2010


Cohen, KB; Johnson, HL; Verspoor, K; Roeder, C; Hunter, LE. 2010. The 
structural and content aspects of abstracts versus bodies of full text journal 
articles are different. BMC BIOINFORMATICS 11: art. no.-492.

Author Full Name(s): Cohen, K. Bretonnel; Johnson, Helen L.; Verspoor, Karin; 
Roeder, Christophe; Hunter, Lawrence E.
Language: English
Document Type: Article
KeyWords Plus: INFORMATION EXTRACTION; BIOLOGY

Abstract: Background: An increase in work on the full text of journal articles 
and the growth of PubMedCentral have the opportunity to create a major 
paradigm shift in how biomedical text mining is done. However, until now there 
has been no comprehensive characterization of how the bodies of full text 
journal articles differ from the abstracts that until now have been the subject 
of most biomedical text mining research.
Results: We examined the structural and linguistic aspects of abstracts and 
bodies of full text articles, the performance of text mining tools on both, and 
the distribution of a variety of semantic classes of named entities between 
them. We found marked structural differences, with longer sentences in the 
article bodies and much heavier use of parenthesized material in the bodies 
than in the abstracts. We found content differences with respect to linguistic 
features. Three out of four of the linguistic features that we examined were 
statistically significantly differently distributed between the two genres. We 
also found content differences with respect to the distribution of semantic 
features. There were significantly different densities per thousand words for 
three out of four semantic classes, and clear differences in the extent to which 
they appeared in the two genres. With respect to the performance of text 
mining tools, we found that a mutation finder performed equally well in both 
genres, but that a wide variety of gene mention systems performed much 
worse on article bodies than they did on abstracts. POS tagging was also more 
accurate in abstracts than in article bodies.
Conclusions: Aspects of structure and content differ markedly between article 
abstracts and article bodies. A number of these differences may pose problems 
as the text mining field moves more into the area of processing full-text 
articles. However, these differences also present a number of opportunities for 
the extraction of data types, particularly that found in parenthesized text, that 
is present in article bodies but not in article abstracts.

Addresses: [Cohen, K. Bretonnel; Johnson, Helen L.; Verspoor, Karin; Roeder, 
Christophe; Hunter, Lawrence E.] Univ Colorado, Sch Med, Dept Pharmacol, Ctr 
Computat Pharmacol, Aurora, CO USA; [Cohen, K. Bretonnel] Univ Colorado, 
Dept Linguist, Boulder, CO 80309 USA

Reprint Address: Cohen, KB, Univ Colorado, Sch Med, Dept Pharmacol, Ctr 
Computat Pharmacol, Aurora, CO USA.

E-mail Address: kevin.cohen at gmail.com
ISSN: 1471-2105
DOI: 10.1186/1471-2105-11-492
fulltext: http://www.biomedcentral.com/1471-2105/11/492/abstract



More information about the SIGMETRICS mailing list