Geng J. and Yang J. AUTOBIB: Automatic Extraction of Bibliographic Information on the web. Proceedings of the IDEAS p.193-204, 2004.

Eugene Garfield garfield at CODEX.CIS.UPENN.EDU
Thu Apr 14 17:09:10 EDT 2005


Junfei Geng : geng at cs.duke.edu
Jun  Yang   : junyang at cs.duke.edu


TITLE : AUTOBIB: Automatic Extraction of Bibliographic Information on the

        Web


International Database Engineering and Applications Symposium (IDEAS'04)
July 07 - 09, 2004
Coimbra, Portugal

pp. 193-204

ABSTRACT:

The Web has greatly facilitated access to information. However, information
presented in HTML is mainly intended to be browsed by humans, and the
problem of automatically extracting such information remains an important
and challenging task. In this work, we focus on building a system called
AUTOBIB to automate extraction of bibliographic information on the Web. We
use a combination of bootstrapping, statistical, and heuristic methods to
achieve a high degree of automation. To set up extraction from a new site,
we only need to provide a few lines of code specifying how to download pages
containing bibliographic information. We do not need to be concerned with
each site’s presentation format, and the system can cope with changes in the
presentation format without human intervention.

AUTOBIB bootstraps itself with a small seed database of structured
bibliographic records. For each bibliographicWeb site, we identify segments
within its pages that represent bibliographic records, using
state-of-the-art record-boundary discovery techniques. Next, we find matches
for some of these "raw records" in the seed database using a set of
heuristics. These matches serve as a training set for a parser based on the
Hidden Markov Model (HMM), which is then used to parse the rest of the raw
records into structured records. We have found an effectiveHMM structure
with special states that correspond to delimiters and HTML tags in raw
records. Experiments demonstrate that for our application, this HMM
structure achieves high success rates without the complexity of previously
proposed structures.



More information about the SIGMETRICS mailing list