Sun, BJ et al. 2011. Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents. ACM TRANSACTIONS ON INFORMATION SYSTEMS 29 (2): art. no.-12
Eugene Garfield
garfield at CODEX.CIS.UPENN.EDU
Thu Jun 2 15:24:23 EDT 2011
Sun, BJ; Mitra, P; Giles, CL; Mueller, KT. 2011. Identifying, Indexing, and
Ranking Chemical Formulae and Chemical Names in Digital Documents. ACM
TRANSACTIONS ON INFORMATION SYSTEMS 29 (2): art. no.-12..
Author Full Name(s): Sun, Bingjun; Mitra, Prasenjit; Giles, C. Lee; Mueller, Karl
T.
Language: English
Document Type: Article
Author Keywords: Algorithms; Design; Experimentation; Documentation;
Chemical name; chemical formula; entity extraction; conditional random fields;
support vector machines; independent frequent subsequence; hierarchical text
segmentation; index pruning; query models; similarity search; ranking
KeyWords Plus: PROBABILISTIC FUNCTIONS; COMPUTER TRANSLATION;
MARKOV CHAINS; IUPAC; TEXT; IDENTIFICATION; NOMENCLATURE;
RECOGNITION; INFORMATION; ALGORITHM
Abstract: End-users utilize chemical search engines to search for chemical
formulae and chemical names. Chemical search engines identify and index
chemical formulae and chemical names appearing in text documents to support
efficient search and retrieval in the future. Identifying chemical formulae and
chemical names in text automatically has been a hard problem that has met
with varying degrees of success in the past. We propose algorithms for
chemical formula and chemical name tagging using Conditional Random Fields
(CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy
than existing (published) methods. After chemical entities have been identified
in text documents, they must be indexed. In order to support user-provided
search queries that require a partial match between the chemical name
segment used as a keyword or a partial chemical formula, all possible (or a
significant number of) subformulae of formulae that appear in any document
and all possible subterms (e.g., "methyl") of chemical names (e.g., "methylethyl
ketone") must be indexed. Indexing all possible subformulae and subterms
results in an exponential increase in the storage and memory requirements as
well as the time taken to process the indices. We propose techniques to prune
the indices significantly without reducing the quality of the returned results
significantly. Finally, we propose multiple query semantics to allow users to
pose different types of partial search queries for chemical entities. We
demonstrate empirically that our search engines improve the relevance of the
returned results for search queries involving chemical entities.
Addresses: [Sun, Bingjun] Penn State Univ, Dept Comp Sci & Engn, University
Pk, PA 16802 USA; [Mitra, Prasenjit; Giles, C. Lee] Penn State Univ, Coll
Informat Sci & Technol, University Pk, PA 16802 USA; [Mueller, Karl T.] Penn
State Univ, Dept Chem, University Pk, PA 16802 USA
Reprint Address: Sun, BJ, Penn State Univ, Dept Comp Sci & Engn, University
Pk, PA 16802 USA.
E-mail Address: sunbingjun at gmail.com; pmitra at ist.psu.edu; giles at ist.psu.edu;
ktm2 at psu.edu
ISSN: 1046-8188
DOI: 10.1145/1961209.1961215
URL: http://portal.acm.org/citation.cfm?
id=1961215&dl=ACM&coll=DL&CFID=26992261&CFTOKEN=66190610
More information about the SIGMETRICS
mailing list