Papers from Proceedings of the National Academy of Sciences of the USA 101 (Suppl) April 6 2004

Wed Jun 2 16:51:18 EDT 2004

All of the following articles are available in full text at :
http://www.pnas.org/content/vol101/suppl_1/

TITLE:          Extracting knowledge from the World Wide Web (Article,
                English)
AUTHOR:         Henzinger, M; Lawrence, S
SOURCE:         PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE
                UNITED STATES OF AMERICA 101 (SUPPL). APR 6 2004.
                p.5186-5191 NATL ACAD SCIENCES, WASHINGTON

ABSTRACT:       The World Wide Web provides a unprecedented opportunity
to automatically analyze a large sample of interests and activity in the
world. We discuss methods for extracting knowledge from the web by
randomly sampling and analyzing hosts and pages, and by analyzing the
link structure of the web and how links accumulate over time. A variety
of interesting and valuable information can be extracted, such as the
distribution of web pages over domains, the distribution of interest in
different areas, communities related to different topics, the nature of
competition in different categories of sites, and the degree of
communication between different communities or countries.

AUTHOR ADDRESS: M Henzinger, Google Inc, 2400 Bayshore Pkwy, Mountain View,
                CA 94043 USA

--------------------------------------------------------------------------
TITLE:          Mapping knowledge domains: Characterizing PNAS (Article,
                English)
AUTHOR:         Boyack, KW
SOURCE:         PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE
                UNITED STATES OF AMERICA 101 (SUPPL). APR 6 2004.
                p.5192-5199 NATL ACAD SCIENCES, WASHINGTON

ABSTRACT:       A review of data mining and analysis techniques that can
be used for the mapping of knowledge domains is given. Literature mapping
techniques can be based on authors, documents, journals, words, and/or
indicators. Most mapping questions are related to research assessment or
to the structure and dynamics of disciplines or networks. Several mapping
techniques are demonstrated on a data set comprising 20 years of papers
published in PNAS. Data from a variety of sources are merged to provide
unique indicators of the domain bounded by PNAS. By using funding source
information and citation counts, it is shown that, on an aggregate basis,
papers funded jointly by the U.S. Public Health Service (which includes
the National Institutes of Health) and non-U.S. government sources
outperform papers funded by other sources, including by the U.S. Public
Health Service alone. Grant data from the National Institute on Aging
show that, on average, papers from large grants are cited more than those
from small grants, with performance increasing with grant amount. A map
of the highest performing papers over the 20-year period was generated by
using citation analysis. Changes and trends in the subjects of highest
impact within the PNAS domain are described. Interactions between topics
over the most recent 5-year period are also detailed.

AUTHOR ADDRESS: KW Boyack, Sandia Natl Labs, Computat Comp Informat & Math
                Ctr, POB 5800, Albuquerque, NM 87185 USA

--------------------------------------------------------------------------
TITLE:          Coauthorship networks and patterns of scientific
                collaboration (Article, English)
AUTHOR:         Newman, MEJ
SOURCE:         PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE
                UNITED STATES OF AMERICA 101 (SUPPL). APR 6 2004.
                p.5200-5205 NATL ACAD SCIENCES, WASHINGTON

ABSTRACT:       By using data from three bibliographic databases in
biology, physics, and mathematics, respectively, networks are constructed
in which the nodes are scientists, and two scientists are connected if
they have coauthored a paper. We use these networks to answer a broad
variety of questions about collaboration patterns, such as the numbers of
papers authors write, how many people they write them with, what the
typical distance between scientists is through the network, and how
patterns of collaboration vary between subjects and over time. We also
summarize a number of recent results by other authors on coauthorship
patterns.

AUTHOR ADDRESS: MEJ Newman, Univ Michigan, Ctr Study Complex Syst, Ann
                Arbor, MI 48109 USA

--------------------------------------------------------------------------
TITLE:          Tracking evolving communities in large linked networks
                (Article, English)
AUTHOR:         Hopcroft, J; Khan, O; Kulis, B; Selman, B
SOURCE:         PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE
                UNITED STATES OF AMERICA 101 (SUPPL). APR 6 2004.
                p.5249-5253 NATL ACAD SCIENCES, WASHINGTON

ABSTRACT:       We are interested in tracking changes in large-scale data
by periodically creating an agglomerative clustering and examining the
evolution of clusters (communities) over time. We examine a large real-
world data set: the NEC CiteSeer database, a linked network of >250,000
papers. Tracking changes over time requires a clustering algorithm that
produces clusters stable under small perturbations of the input data.
However, small perturbations of the CiteSeer data lead to significant
changes to most of the clusters. One reason for this is that the order in
which papers within communities are combined is somewhat arbitrary.
However, certain subsets of papers, called natural communities,
correspond to real structure in the CiteSeer database and thus appear in
any clustering. By identifying the subset of clusters that remain stable
under multiple clustering runs, we get the set of natural communities
that we can track over time. We demonstrate that such natural communities
allow us to identify emerging communities and track temporal changes in
the underlying structure of our network data.

AUTHOR ADDRESS: B Selman, Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA

--------------------------------------------------------------------------
TITLE:          Evolution of document networks (Article, English)
AUTHOR:         Menczer, F
SOURCE:         PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE
                UNITED STATES OF AMERICA 101 (SUPPL). APR 6 2004.
                p.5261-5265 NATL ACAD SCIENCES, WASHINGTON

ABSTRACT:       How does a network of documents grow without centralized
control? This question is becoming crucial as we try to explain the
emergent scale-free topology of the World Wide Web and use link analysis
to identify important information resources. Existing models of growing
information networks have focused on the structure of links but neglected
the content of nodes. Here I show that the current models fail to
reproduce a critical characteristic of information networks, namely the
distribution of textual similarity among linked documents. I propose a
more realistic model that generates links by using both popularity and
content. This model yields remarkably accurate predictions of both degree
and similarity distributions in networks of web pages and scientific
literature.

AUTHOR ADDRESS: F Menczer, Indiana Univ, Sch Informat, Bloomington, IN
                47408 USA

--------------------------------------------------------------------------
TITLE:          The simultaneous evolution of author and paper networks
                (Article, English)
AUTHOR:         Borner, K; Maru, JT; Goldstone, RL
SOURCE:         PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE
                UNITED STATES OF AMERICA 101 (SUPPL). APR 6 2004.
                p.5266-5273 NATL ACAD SCIENCES, WASHINGTON

ABSTRACT:       There has been a long history of research into the
structure and evolution of mankind's scientific endeavor. However, recent
progress in applying the tools of science to understand science itself
has been unprecedented because only recently has there been access to
high-volume and high-quality data sets of scientific output (e.g.,
publications, patents, grants) and computers and algorithms capable of
handling this enormous stream of data. This article reviews major work on
models that aim to capture and recreate the structure and dynamics of
scientific evolution. We then introduce a general process model that
simultaneously grows coauthor and paper citation networks. The
statistical and dynamic properties of the networks generated by this
model are validated against a 20-year data set of articles published in
PNAS. Systematic deviations from a power law distribution of citations to
papers are well fit by a model that incorporates a partitioning of
authors and papers into topics, a bias for authors to cite recent papers,
and a tendency for authors to cite papers cited by papers that they have
read. In this TARL model (for topics, aging, and recursive linking), the
number of topics is linearly related to the clustering coefficient of the
simulated paper citation network.

AUTHOR ADDRESS: K Borner, Indiana Univ, Sch Lib & Informat Sci,
                Bloomington, IN 47405 USA

--------------------------------------------------------------------------
TITLE:          Crossmaps: Visualization of overlapping relationships in
                collections of journal papers (Article, English)
AUTHOR:         Morris, SA; Yen, GG
SOURCE:         PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE
                UNITED STATES OF AMERICA 101 (SUPPL). APR 6 2004.
                p.5291-5296 NATL ACAD SCIENCES, WASHINGTON

ABSTRACT:       A crossmapping technique is introduced for visualizing
multiple and overlapping relations among entity types in collections of
journal articles. Groups of entities from two entity types are
crossplotted to show correspondence of relations. For example, author
collaboration groups are plotted on the x axis against groups of
papers(research fronts)on the y axis. At the intersection of each pair of
author group/research front pairs a circular symbol is plotted whose size
is proportional to the number of times that authors in the group appear
as authors in papers in the research front. Entity groups are found by
agglomerative hierarchical clustering using conventional similarity
measures. Crossmaps comprise a simple technique that is particularly
suited to showing overlap in relations among entity groups. Particularly
useful crossmaps are: research fronts against base reference clusters,
research fronts against author collaboration groups, and research fronts
against term co-occurrence clusters. When exploring the knowledge domain
of a collection of journal papers, it is useful to have several crossmaps
of different entity pairs, complemented by research front timelines and
base reference cluster timelines.

AUTHOR ADDRESS: SA Morris, Oklahoma State Univ, 202 Engn S, Stillwater, OK
                74078 USA

--------------------------------------------------------------------------
TITLE:          User-controlled mapping of significant literatures
                (Article, English)
AUTHOR:         White, HD; Lin, X; Buzydlowski, JW; Chen, CM
SOURCE:         PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE
                UNITED STATES OF AMERICA 101 (SUPPL). APR 6 2004.
                p.5297-5302 NATL ACAD SCIENCES, WASHINGTON

ABSTRACT:       We apply a version of our web-based literature-mapping
system to PNAS for 1971-2002, as indexed by the National Library of
Medicine and the Institute for Scientific Information. Given a single
input term from a user, a medical subject heading, a cocited author, or a
cocited journal, PNASLINK rapidly displays views in which that term and
the other 24 terms that most frequently co-occur with it in a
bibliographic database are interrelated in ways suggesting fruitful
combinations for document retrieval. The interrelationships are produced
by two algorithms, pathfinder networks and Kohonen-style self-organizing
maps. PNASLINK displays are themselves interactive interfaces that can
retrieve documents from digital libraries (e.g., PNAS Online). This style
of visualizing knowledge domains is called "localized" because it does
not attempt to map the indexing of literatures in full but concentrates
on the top terms in an "associative thesaurus" reflecting user interests.
It also permits swift remappings, as the user recognizes terms worth
pursuing. PNASLINK is illustrated with maps drawn from the literature of
population genetics. Some comparative and evaluative comments are added,
one from a domain expert indicating that the face validity of the system
may be tempered by insufficient specificity in the indexing terms being
mapped.

AUTHOR ADDRESS: HD White, Drexel Univ, Coll Informat Sci & Technol,
                Philadelphia, PA 19104 USA

[
--------------------------------------------------------------------------
TITLE:          Searching for intellectual turning points: Progressive
                knowledge domain visualization (Article, English)
AUTHOR:         Chen, CM
SOURCE:         PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE
                UNITED STATES OF AMERICA 101 (SUPPL). APR 6 2004.
                p.5303-5310 NATL ACAD SCIENCES, WASHINGTON

ABSTRACT:       This article introduces a previously undescribed method
progressively visualizing the evolution of a knowledge domain's
cocitation network. The method first derives a sequence of cocitation
networks from a series of equal-length time interval slices. These time-
registered networks are merged and visualized in a panoramic view in such
away that intellectually significant articles can be identified based on
their visually salient features. The method is applied to a cocitation
study of the superstring field in theoretical physics. The study focuses
on the search of articles that triggered two superstring revolutions.
Visually salient nodes in the panoramic view are identified, and the
nature of their intellectual contributions is validated by leading
scientists in the field. The analysis has demonstrated that a search for
intellectual turning points can be narrowed down to visually salient
nodes in the visualized network. The method provides a promising way to
simplify otherwise cognitively demanding tasks to a search for landmarks,
pivots, and hubs.

AUTHOR ADDRESS: CM Chen, Drexel Univ, Coll Informat Sci & Technol, 3141
                Chestnut St, Philadelphia, PA 19104 USA

[