[Asis-l] JASIST TOC Vol 54, #8

Richard Hill rhill at asis.org
Thu Jun 5 09:43:09 EDT 2003


JASIST, Journal of the American Society for Information Science and Technology
Volume 54, Issue 8, 2003

[Note: at the end of this message are URLs for viewing contents of JASIST 
from past issues.  Below, the contents of Bert Boyce’s “In this Issue” has 
been cut into the Table of Contents.]

CONTENTS

EDITORIAL

IN THIS ISSUE
Bert L. Boyce
704

RESEARCH

Graph Structure in Three National Academic Webs: Power Laws with Anomalies
Mike Thelwall and David Wilkinson
Published online 16 April 2003
706
         Thelwall and Wilkinson use crawls of university web sites in the 
UK, Australia, and New Zealand to generate all links targeted at same 
country university web sites which they then use to create a graph 
structure for study. Using Broder's study as a model they identify a 
strongly connected component, SCC, where one could start anywhere in the 
set and reach every other page, and an Out component whose pages can be 
reached from all strongly connected pages but provide no link back to that 
set. The other components in the Broder model are not accessible except 
with access to a major search engine database. In link and out link counts 
for all three university systems in both the Out and SCC components when 
graphed logarithmically display the linear nature which would indicate that 
power laws, and a success breeds success phenomena, are generally in 
effect. However, automatically generated pages, non-HTML web pages, and 
large resource-driven sites all were associated with anomalies in this 
observation.

Efficient Single-Pass Index Construction for Text Databases
Steffen Heinz and Justin Zobel
Published online 16 April 2003
713
         Zobel and Heinz review file inversion processes for the creation 
of text indices and suggest an efficient single pass approach. Complete in 
memory indexing remains impractical for very large files. Current rapid 
algorithms require that the entire vocabulary of the collection be kept in 
memory. This approach creates inverted files in memory for sequences of 
documents until memory resources are exhausted, then transferring the 
lexicon and inverted file in lexicographical order to disk for subsequent 
merger. Each term is assigned a dynamic in-memory bi-vector that 
accumulates postings in a compressed d-gap format. The lexicon is 
maintained in a burst trie file structure where leaves are containers of 
strings with common prefixes. Performance on five gigabyte to twenty 
gigabyte files is fifteen to twenty percent faster than a sort based approach.


Automatic Construction of English/Chinese Parallel Corpora
Christopher C. Yang and Kar Wing Li
Published online 16 April 2003
730
         Yang and Li describe the automatic matching of English and Chinese 
document titles, by character and word matching based upon the study of web 
pages within a site where some pages exist separately in each language. 
Word and character alignment is followed by redundancy resolution, and then 
title alignment takes place. English words are translated into Chinese 
character string words by dictionary lookup and the various possibilities 
matched with the Chinese titles using the longest common sequence of 
characters. Using Hong Kong Special Administrative Region government press 
releases and releases from the Hong Kong and Shanghai Banking Corporation, 
they find 31,567 in the Chinese language and 30,810 in English, but only 
23,701 released in parallel. There are no links between the versions. With 
Recall as the number of system correct matches over the actual matches in 
the file, and Precision the number of correct system matches over the 
number of system matches, a test yields Precision in the range of .998 to 
1.00 and recall from .806 to .948. Thus links to parallel documents in the 
other language could quite likely be automatically generated.

Mning Longitudinal Web Queries: Trends and Patterns
Peiling Wang, Michael W. Berry, and Yiheng Yang
Published online 16 April 2003
743
         Wang, Berry and Yang log hit counts and date stamped queries to 
the University of Tennessee website for a four year period as entered 
through the SWISH search engine as Boolean statements where spaces were 
considered to be AND operators. Queries were parsed into words and word 
pairs of adjacent words or words separated by one other word. (94% of 
queries contained three words or less) URLs were not parsed but treated as 
unusual queries. Null outputs exceed 30%. Queries averaged 2 words or 13 
characters. Number of queries and the vocabulary used grow over time but 
the vocabulary is relatively small and includes a large number (26%) of 
misspelled words and personal names. Log plots of frequencies and ranks for 
both all words and words with unique frequencies overlap in the upper 
portion which is quadratic polynomial and diverge in the lower portion 
where the all word line becomes linear. Topics and search behavior vary 
little over the four year period. Websites could be improved by containing 
content identified from queries.

Students' Conceptual Structure, Search Process, and Outcome While Preparing 
a Research Proposal: A Longitudinal Case Study
Mikko Pennanen and Pertti Vakkari
Published online 16 April 2003
759
         Pennanen and Vakkari use 22 undergraduate psychology students 
doing Boolean searches on PsycINFO, a system with which they were 
unfamiliar, to investigate the relationship between their conceptual 
structure of their topic and their search process, and whether these 
relations vary depending upon their stage in the Kuhlthau model. Students 
searched both at the beginning and the end of their construction of a 
proposal, and each search was proceeded and followed by an interview. The 
thought process during search was vocalized and recorded, and transaction 
logs were also retained. They recorded the number of concepts used by a 
student, the proportion of sub-concepts included, and the proportion of 
concepts expressed in query terms. Retrieved useful references were 
recorded. The two main tactics used by the subjects were the adding of a 
conjoined term, and the replacement of an existing term with another. The 
students were able to translate into query terms only slightly more than 
half the concepts they identified. The subjects advanced significantly in 
terms of the Kuhthau model between their search sessions. Their conceptual 
structure was richer, search terms used increased, but references accepted 
as useful decreased. The proportion of concepts articulated in the query 
correlated significantly with the number of useful references found.

Information Science Abstracts: Tracking the Literature of Information 
Science. Part 2: A New Taxonomy For Information Science
Donald T. Hawkins, Signe E. Larson and Bari Q. Caton
Published online 16 April 2003
771
         Using 3000 Information Science Abstracts abstracts, Hawkins, 
Larson, and Caton test the validity of a new ISA classification structure 
for information science leading to the revision and fine-tuning of the 
structure. The structure was produced by collecting terms from available 
vocabularies grouped into 13 main headings. Each abstract was given only 
one classification number representing a main heading and a single 
sub-heading by each of the researchers. A review of the distribution of 
abstracts over section indicated the combination of some closely related 
categories and the presence of unclassifiable abstracts pointed to 
uncovered gaps. Only in 19% of the cases did all three disagree on the 
assignment of a main heading. A second test with 1265 abstracts showed that 
the abstracts were well distributed over what were now 11 main sections. 
Low posted sub-headings were examined but retained as growing areas. The 
taxonomy is included as an appendix.

Improving the Search Environment: Informed Decision Making in the Search 
for Statistical Information
Stephanie W. Haas
Published online 16 April 2003
782
         Studying the Bureau of Labor Statistics' LABSTAT database, Haas 
looks for searching decision points at which assistance for the searcher 
may be of value. A searcher has some measure of both search and domain 
knowledge and will need some knowledge of the way information is provided 
to make effective decisions. Transition points are identified between user 
vocabulary and Bureau of Labor Statistics (BLS) concepts, between BLS 
concepts and BLS data and information products, and between these products 
and the actual query. This suggests the need for help in concept 
definition, ambiguity resolution, and synonym usage which is not uncommon 
in retrieval systems, but also assistance in the choice of products through 
a matrix of specifiable categories with available surveys and series. The 
need to express a query for a chosen survey/series suggests a need for 
variable displays, information on the interaction of variable-value 
choices, and warnings of unusual situations.

BRIEF COMMUNICATION

Using the Mann-Whitney Text on Informetric Data
John C. Huber and Roland Wagner-Döbler
Published online 16 April 2003
798
         Huber and Wagner-Dobler demonstrate a relatively simple procedure 
for implementing, using a spreadsheet, a Mann-Whitney test of the 
difference of two bibliometric samples which will take into account the 
large number of ties normally present in such data. Sources with the same 
count of publications are assigned the same rank where the value is the 
median of the number of such sources in both samples. The lower the p-level 
the higher the probability the samples are from different distributions. It 
is thus possible to determine if a change in productivity is due to factors 
beyond the change in number of sources. However, small samples with small 
differences will appear to be from the same distribution, and larger 
samples are necessary to overcome the effect of multiple ties.


BOOK REVIEW

Patents, Citations & Innovations: A Window on the Knowledge Economy, by 
Adam B. Jaffe and Manuel Trajtenberg
Reviewed by Chaomei Chen
Published online 16 April 2003
802


LETTERS TO THE EDITOR
Empirical Evidence of Self-Organization” a rejoinder
Loet Leydesdorff
804

Arguments for Epistemology in Information Science
Birger Hjorland
805
------------------------------------------------------
The ASIS web site <http://www.asis.org/Publications/JASIS/tocs.html> 
contains the Table of Contents and brief abstracts as above from January 
1993 (Volume 44) to date.

The John Wiley Interscience site <http://www.interscience.wiley.com> 
includes issues from 1986 (Volume 37) to date.  Guests have access only to 
tables of contents and abstracts.  Registered users of the interscience 
site have access to the full text of these issues and to preprints.



Executive Director
American Society for Information Science and Technology
1320 Fenwick Lane, Suite 510
Silver Spring, MD  20910
FAX: (301) 495-0810
PHONE: (301) 495-0900

http://www.asis.org 




More information about the Asis-l mailing list