[Asis-l] JASIST TOC Vol 54, #8
Richard Hill
rhill at asis.org
Thu Jun 5 09:43:09 EDT 2003
JASIST, Journal of the American Society for Information Science and Technology
Volume 54, Issue 8, 2003
[Note: at the end of this message are URLs for viewing contents of JASIST
from past issues. Below, the contents of Bert Boyces In this Issue has
been cut into the Table of Contents.]
CONTENTS
EDITORIAL
IN THIS ISSUE
Bert L. Boyce
704
RESEARCH
Graph Structure in Three National Academic Webs: Power Laws with Anomalies
Mike Thelwall and David Wilkinson
Published online 16 April 2003
706
Thelwall and Wilkinson use crawls of university web sites in the
UK, Australia, and New Zealand to generate all links targeted at same
country university web sites which they then use to create a graph
structure for study. Using Broder's study as a model they identify a
strongly connected component, SCC, where one could start anywhere in the
set and reach every other page, and an Out component whose pages can be
reached from all strongly connected pages but provide no link back to that
set. The other components in the Broder model are not accessible except
with access to a major search engine database. In link and out link counts
for all three university systems in both the Out and SCC components when
graphed logarithmically display the linear nature which would indicate that
power laws, and a success breeds success phenomena, are generally in
effect. However, automatically generated pages, non-HTML web pages, and
large resource-driven sites all were associated with anomalies in this
observation.
Efficient Single-Pass Index Construction for Text Databases
Steffen Heinz and Justin Zobel
Published online 16 April 2003
713
Zobel and Heinz review file inversion processes for the creation
of text indices and suggest an efficient single pass approach. Complete in
memory indexing remains impractical for very large files. Current rapid
algorithms require that the entire vocabulary of the collection be kept in
memory. This approach creates inverted files in memory for sequences of
documents until memory resources are exhausted, then transferring the
lexicon and inverted file in lexicographical order to disk for subsequent
merger. Each term is assigned a dynamic in-memory bi-vector that
accumulates postings in a compressed d-gap format. The lexicon is
maintained in a burst trie file structure where leaves are containers of
strings with common prefixes. Performance on five gigabyte to twenty
gigabyte files is fifteen to twenty percent faster than a sort based approach.
Automatic Construction of English/Chinese Parallel Corpora
Christopher C. Yang and Kar Wing Li
Published online 16 April 2003
730
Yang and Li describe the automatic matching of English and Chinese
document titles, by character and word matching based upon the study of web
pages within a site where some pages exist separately in each language.
Word and character alignment is followed by redundancy resolution, and then
title alignment takes place. English words are translated into Chinese
character string words by dictionary lookup and the various possibilities
matched with the Chinese titles using the longest common sequence of
characters. Using Hong Kong Special Administrative Region government press
releases and releases from the Hong Kong and Shanghai Banking Corporation,
they find 31,567 in the Chinese language and 30,810 in English, but only
23,701 released in parallel. There are no links between the versions. With
Recall as the number of system correct matches over the actual matches in
the file, and Precision the number of correct system matches over the
number of system matches, a test yields Precision in the range of .998 to
1.00 and recall from .806 to .948. Thus links to parallel documents in the
other language could quite likely be automatically generated.
Mning Longitudinal Web Queries: Trends and Patterns
Peiling Wang, Michael W. Berry, and Yiheng Yang
Published online 16 April 2003
743
Wang, Berry and Yang log hit counts and date stamped queries to
the University of Tennessee website for a four year period as entered
through the SWISH search engine as Boolean statements where spaces were
considered to be AND operators. Queries were parsed into words and word
pairs of adjacent words or words separated by one other word. (94% of
queries contained three words or less) URLs were not parsed but treated as
unusual queries. Null outputs exceed 30%. Queries averaged 2 words or 13
characters. Number of queries and the vocabulary used grow over time but
the vocabulary is relatively small and includes a large number (26%) of
misspelled words and personal names. Log plots of frequencies and ranks for
both all words and words with unique frequencies overlap in the upper
portion which is quadratic polynomial and diverge in the lower portion
where the all word line becomes linear. Topics and search behavior vary
little over the four year period. Websites could be improved by containing
content identified from queries.
Students' Conceptual Structure, Search Process, and Outcome While Preparing
a Research Proposal: A Longitudinal Case Study
Mikko Pennanen and Pertti Vakkari
Published online 16 April 2003
759
Pennanen and Vakkari use 22 undergraduate psychology students
doing Boolean searches on PsycINFO, a system with which they were
unfamiliar, to investigate the relationship between their conceptual
structure of their topic and their search process, and whether these
relations vary depending upon their stage in the Kuhlthau model. Students
searched both at the beginning and the end of their construction of a
proposal, and each search was proceeded and followed by an interview. The
thought process during search was vocalized and recorded, and transaction
logs were also retained. They recorded the number of concepts used by a
student, the proportion of sub-concepts included, and the proportion of
concepts expressed in query terms. Retrieved useful references were
recorded. The two main tactics used by the subjects were the adding of a
conjoined term, and the replacement of an existing term with another. The
students were able to translate into query terms only slightly more than
half the concepts they identified. The subjects advanced significantly in
terms of the Kuhthau model between their search sessions. Their conceptual
structure was richer, search terms used increased, but references accepted
as useful decreased. The proportion of concepts articulated in the query
correlated significantly with the number of useful references found.
Information Science Abstracts: Tracking the Literature of Information
Science. Part 2: A New Taxonomy For Information Science
Donald T. Hawkins, Signe E. Larson and Bari Q. Caton
Published online 16 April 2003
771
Using 3000 Information Science Abstracts abstracts, Hawkins,
Larson, and Caton test the validity of a new ISA classification structure
for information science leading to the revision and fine-tuning of the
structure. The structure was produced by collecting terms from available
vocabularies grouped into 13 main headings. Each abstract was given only
one classification number representing a main heading and a single
sub-heading by each of the researchers. A review of the distribution of
abstracts over section indicated the combination of some closely related
categories and the presence of unclassifiable abstracts pointed to
uncovered gaps. Only in 19% of the cases did all three disagree on the
assignment of a main heading. A second test with 1265 abstracts showed that
the abstracts were well distributed over what were now 11 main sections.
Low posted sub-headings were examined but retained as growing areas. The
taxonomy is included as an appendix.
Improving the Search Environment: Informed Decision Making in the Search
for Statistical Information
Stephanie W. Haas
Published online 16 April 2003
782
Studying the Bureau of Labor Statistics' LABSTAT database, Haas
looks for searching decision points at which assistance for the searcher
may be of value. A searcher has some measure of both search and domain
knowledge and will need some knowledge of the way information is provided
to make effective decisions. Transition points are identified between user
vocabulary and Bureau of Labor Statistics (BLS) concepts, between BLS
concepts and BLS data and information products, and between these products
and the actual query. This suggests the need for help in concept
definition, ambiguity resolution, and synonym usage which is not uncommon
in retrieval systems, but also assistance in the choice of products through
a matrix of specifiable categories with available surveys and series. The
need to express a query for a chosen survey/series suggests a need for
variable displays, information on the interaction of variable-value
choices, and warnings of unusual situations.
BRIEF COMMUNICATION
Using the Mann-Whitney Text on Informetric Data
John C. Huber and Roland Wagner-Döbler
Published online 16 April 2003
798
Huber and Wagner-Dobler demonstrate a relatively simple procedure
for implementing, using a spreadsheet, a Mann-Whitney test of the
difference of two bibliometric samples which will take into account the
large number of ties normally present in such data. Sources with the same
count of publications are assigned the same rank where the value is the
median of the number of such sources in both samples. The lower the p-level
the higher the probability the samples are from different distributions. It
is thus possible to determine if a change in productivity is due to factors
beyond the change in number of sources. However, small samples with small
differences will appear to be from the same distribution, and larger
samples are necessary to overcome the effect of multiple ties.
BOOK REVIEW
Patents, Citations & Innovations: A Window on the Knowledge Economy, by
Adam B. Jaffe and Manuel Trajtenberg
Reviewed by Chaomei Chen
Published online 16 April 2003
802
LETTERS TO THE EDITOR
Empirical Evidence of Self-Organization a rejoinder
Loet Leydesdorff
804
Arguments for Epistemology in Information Science
Birger Hjorland
805
------------------------------------------------------
The ASIS web site <http://www.asis.org/Publications/JASIS/tocs.html>
contains the Table of Contents and brief abstracts as above from January
1993 (Volume 44) to date.
The John Wiley Interscience site <http://www.interscience.wiley.com>
includes issues from 1986 (Volume 37) to date. Guests have access only to
tables of contents and abstracts. Registered users of the interscience
site have access to the full text of these issues and to preprints.
Executive Director
American Society for Information Science and Technology
1320 Fenwick Lane, Suite 510
Silver Spring, MD 20910
FAX: (301) 495-0810
PHONE: (301) 495-0900
http://www.asis.org
More information about the Asis-l
mailing list