Sun, BJ et al. 2011. Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents. ACM TRANSACTIONS ON INFORMATION SYSTEMS 29 (2): art. no.-12

Thu Jun 2 15:24:23 EDT 2011

Sun, BJ; Mitra, P; Giles, CL; Mueller, KT. 2011. Identifying, Indexing, and 
Ranking Chemical Formulae and Chemical Names in Digital Documents. ACM 
TRANSACTIONS ON INFORMATION SYSTEMS 29 (2): art. no.-12..

Author Full Name(s): Sun, Bingjun; Mitra, Prasenjit; Giles, C. Lee; Mueller, Karl 
T.
Language: English
Document Type: Article

Author Keywords: Algorithms; Design; Experimentation; Documentation; 
Chemical name; chemical formula; entity extraction; conditional random fields; 
support vector machines; independent frequent subsequence; hierarchical text 
segmentation; index pruning; query models; similarity search; ranking
KeyWords Plus: PROBABILISTIC FUNCTIONS; COMPUTER TRANSLATION; 
MARKOV CHAINS; IUPAC; TEXT; IDENTIFICATION; NOMENCLATURE; 
RECOGNITION; INFORMATION; ALGORITHM

Abstract: End-users utilize chemical search engines to search for chemical 
formulae and chemical names. Chemical search engines identify and index 
chemical formulae and chemical names appearing in text documents to support 
efficient search and retrieval in the future. Identifying chemical formulae and 
chemical names in text automatically has been a hard problem that has met 
with varying degrees of success in the past. We propose algorithms for 
chemical formula and chemical name tagging using Conditional Random Fields 
(CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy 
than existing (published) methods. After chemical entities have been identified 
in text documents, they must be indexed. In order to support user-provided 
search queries that require a partial match between the chemical name 
segment used as a keyword or a partial chemical formula, all possible (or a 
significant number of) subformulae of formulae that appear in any document 
and all possible subterms (e.g., "methyl") of chemical names (e.g., "methylethyl 
ketone") must be indexed. Indexing all possible subformulae and subterms 
results in an exponential increase in the storage and memory requirements as 
well as the time taken to process the indices. We propose techniques to prune 
the indices significantly without reducing the quality of the returned results 
significantly. Finally, we propose multiple query semantics to allow users to 
pose different types of partial search queries for chemical entities. We 
demonstrate empirically that our search engines improve the relevance of the 
returned results for search queries involving chemical entities.

Addresses: [Sun, Bingjun] Penn State Univ, Dept Comp Sci & Engn, University 
Pk, PA 16802 USA; [Mitra, Prasenjit; Giles, C. Lee] Penn State Univ, Coll 
Informat Sci & Technol, University Pk, PA 16802 USA; [Mueller, Karl T.] Penn 
State Univ, Dept Chem, University Pk, PA 16802 USA
Reprint Address: Sun, BJ, Penn State Univ, Dept Comp Sci & Engn, University 
Pk, PA 16802 USA.

E-mail Address: sunbingjun at gmail.com; pmitra at ist.psu.edu; giles at ist.psu.edu; 
ktm2 at psu.edu
ISSN: 1046-8188
DOI: 10.1145/1961209.1961215
URL: http://portal.acm.org/citation.cfm?
id=1961215&dl=ACM&coll=DL&CFID=26992261&CFTOKEN=66190610