[Asis-l] JASIST TOC: Vol 44, No 3

Thu Dec 11 11:34:59 EST 2003

Journal of the American Society for Information Science and Technology
Volume 55, Number 3.  February 1, 2004

[Note: at the end of this message are URLs for viewing contents of JASIST 
from past issues.  Below, the contents of Bert Boyce’s “In this Issue” has 
been cut into the Table of Contents.]
-------------

CONTENTS

EDITORIAL
In This Issue
Bert R. Boyce
187

RESEARCH
Arabic Morphological Analysis Techniques: A Comprehensive Survey
Imad A. Al-Sughaiyer and Ibrahim A. Al-Kharashi
Published online 20 November 2003
189
         Al-Sughaiyer and Al-Kharashi provide definitions of standard 
linguistic terms as they are seen in Arabic analysis and identify 
efficiency, compactness, bi-directionality, success rate, and retrieval 
performance as the measures of the effectiveness of morphological analysis 
algorithms. After a review of the Arabic morphological analysis literature, 
they suggest the approaches may be classified as table lookup (large 
construction demands and space requirements), linguistic (require a large 
number of lists and removing affixes by trial and error), combinatorial 
(large space and time requirements), or rule based (the authors’ choice), 
and they present a summary of the work in each area. The majority of work 
is linguistic in nature, but little comparison of existing work exists. 
Evaluation of suggested algorithms is weak. Use of a word’s root (the 
single basic morpheme) in an Arabic index leads to invalid conflation.

Predicting Library of Congress Classifications From Library of
Congress Subject Headings
Eibe Frank and Gordon W. Paynter
Published online 28 October 2003
214
         Frank and Paynter attempt to assign LC Classification number 
ranges to INFOMINE documents based on their assigned LCSH headings in order 
to provide a better browsing capability. Since they claim that 
retrospective assignment of a class number is logistically impossible for 
those librarians that already assign terms from LCSH to INFOMINE records, 
they have devised a machine-learning technique to create the classification 
number where, rather than creating virtual LCSH documents to represent each 
LC class and using similarity measures to assign documents, they use a 
support vector machine classifier to determine which of the top 21 nodes is 
most likely and classifiers at each successive level until a leaf is 
reached or the classifier chooses itself. They utilize LCSH terms without 
subdivisions, and also make use of intervals from the LCC outline available 
on the Pharos Web site, both processed and extracted from existing MARC 
records to create a training set. The training set of 868,836 records was 
drawn from the UC Riverside library catalog with 50,000 items reserved for 
testing and the remainder used for training at different levels. Accuracy 
increases with training set size but returns diminish. Accuracy increases 
from 32% to 55% as the training set size increases from 10,000 to 800,000. 
Less than 7% of errors are due to the classifier terminating too early or 
too late. With the large test set 80% of the first array classification 
decisions are correct and 16% at the seventh level. The learning algorithm 
scales at the order of n instances to the 1.7, and test processing proceeds 
at a rate of 21 instances per second.

A Nonlinear Model of Information-Seeking Behavior
Allen Foster
Published online 11 November 2003
228
         Foster disagrees with the conventional wisdom view of information 
seeking as a linear process of identifiable stages and iteration, 
particularly as it would apply to interdisciplinary information-seeking 
behavior, and he attempts a non-linear model based on identifying the 
processes, contexts, and behaviors of such interdisciplinary activity and 
their relationships. In-depth structured interviews conducted in the 
workplace environment were utilized to collect data on searching examples 
provided by the subjects. Subjects were purposively selected from the 
University of Sheffield across multiple faculties for their 
interdisciplinary research and then used as the kernel for a snowball 
sampling expansion resulting in 45 faculty-diverse participants. Transcript 
coding took place in multiple iterations and the final results were 
confirmed by participant review. Activities viewed in conjunction with time 
lines did not support a linear stages model. The new model groups 
activities into three core categories; “opening” (moving from orientation 
to actual search), “orientation” (identifying existing research and a 
direction for search), and “consolidation” (refining and knowing when to 
stop). These operate with the boundaries of an “external context” which 
incorporates social and organizational influences, time, project and access 
constraints, and navigational issues. Within the external context one finds 
an internal context which incorporates the individual’s experience, prior 
knowledge, and feelings and which are individually unique. Four cognitive 
approaches were identified: flexible and adaptable to other cultures, an 
open approach with no prior framework, a nomadic approach which actively 
seeks diverse ways of access, and a holistic approach which attempts to 
bring diverse areas together. Interaction among the core activities was 
cumulative, reiterative, holistic, and context-bound.

Indicators of Accuracy for Answers to Ready Reference Questions
on the Internet
Martin Frick‚ and Don Fallis
Published online 19 November 2003
238
         Frické and Fallis explore the validity of proposed indicators of 
the accuracy of ready reference information to be found in Web sites. Using 
49 of the 60 questions previously used by Connell and Tipple, AltaVista 
searches were run to identify potential answer sites, the first five of 
which actually answered the question chosen, and then evaluated for answer 
accuracy and checked for the presence of indicators of accuracy. This was 
followed by a Google search to yield these and at most five additional 
sites. Each site was manually scored as completely accurate, partially 
accurate, partially inaccurate, or completely inaccurate and checked for 
owner entity type, recency of update, presence of advertising, copyright 
claim, appeal to authority, and the presence of any awards for quality, as 
well as its ranked position by the search engine, its Google PageRank 
(0–10) position, and the number of in-links found with the AltaVista link 
command. Contingency tables were formed and chi-square used to determine 
possible correlation. Likelihood ratios for presence and absence of 
indicators and indicator pairs were also computed. Of 300 sites that 
answered the questions, 214 were judged completely accurate and only 25 
inaccurate. High display position, high Google PageRank, currency, 
copyright and in-link count all yield a chi-square probability of less than 
.05, suggesting a relationship to accuracy.

The Effects of Domain Knowledge on Search Tactic Formulation
Barbara M. Wildemuth
Published online 13 November 2003
246
         Wildemuth is interested in whether a growing understanding of the 
knowledge domain covered by a database will affect the sequence of 
searching moves (tactics) used by medical students searching that database. 
Two random samples were drawn from entering medical school classes, 
excluding those with advanced science degrees and those whose undergraduate 
degree was in microbiology, the topic of the database. Each was asked to 
address six specific clinical problems involving several specific 
questions; first, prior to any instruction in microbiology, resulting in a 
12.6% success rate; second, directly after the microbiology course, 
resulting in a 48.1% success rate; and finally, six months after the 
course, achieving a 27.3% success rate. In each instance subjects were 
asked to respond from their own knowledge and then to search the database 
for a question for which they had provided an incorrect response. The 
nearly 1,300 searches were recorded by transaction logs and hand coded 
according to an adaption of the Shute & Smith scheme incorporating 
beginning moves, reduction moves, expansion moves, and term replacement. A 
transition matrix showing the frequency of transitions from one coded move 
to every other coded move was created and used to create a graphic 
representation of transitions that accounted for at least 1% of all 
occurrences. Maximal repeating patterns of moves were also extracted and 
the most frequently occurring retained. The most common pattern was the 
entry of a new concept followed by the addition of one or more concepts 
prior to display. Number of moves decreased with experience. Database usage 
increases performance at all three levels of experience.

A Graph Model for E-Commerce Recommender Systems
Zan Huang, Wingyan Chung, and Hsinchun Chen
Published online 14 November 2003
259
         Huang, Chung, and Chen are interested in maximizing the value of 
the product and usage information available from online transactions for 
those that supply material and those that interact with that supplied 
material. Such information needs to be represented in a flexible manner, 
since different recommendation approaches are typically used to create 
recommender systems that find associations between users and items and use 
discovered associations to recommend additional items to previous users. A 
two-layer graph model is implemented with users and items as nodes in 
separate layers and transactions and similarities as links. Nodes are kept 
as relative similarity measures to other nodes. If links in the item layer 
are activated, the approach is content-based. If user and inter-layer links 
are activated, the approach is collaborative and activating all links gives 
a hybrid approach. A direct retrieval approach retrieves items similar to 
those used previously by a user or similar users. A collaborative 
recommendation forms a list of similar users by either past common item 
selection or by common demographics and recommends that list’s past 
selections. An association mining method was used with the three 
approaches, each generating a different set of association rules with 
transitive rules in a Hopfield net utilized as an option to overcome sparse 
user ratings. Testing on a Chinese online bookstore data set provided 
records for 9,695 books, 2,000 customers, and 18,771 transactions. Books 
and customers were described as feature vectors and similarity measures 
computed and customers’ purchase lists were halved to provide a predicted 
set from the first half allowing recall and precision-type measures. 
Pairwise t-tests were then applied. The hybrid approach was the best 
performer, but the spreading activation approach did not better 
significantly the associative mining approach or direct search.

BOOK REVIEWS

Mining the Web: Discovering Knowledge From Hypertext Data, by
Soumen Chakrabarti
Chaomei Chen
Published online 20 November 2003
275

The Library's Legal Answer Book, by Mary Minow and Tomas A.
Lipinski
Kenneth Einar 
Himma
Published online 18 November 2003
276

The Internet in Everyday Life, edited by Barry Wellman and
Caroline Haythornthwaite
Pramod K. Nayar
Published online 14 November 2003
278
------------------------------------------------------
The ASIS web site <http://www.asis.org/Publications/JASIS/tocs.html> 
contains the Table of Contents and brief abstracts as above from January 
1993 (Volume 44) to date.

The John Wiley Interscience site <http://www.interscience.wiley.com> 
includes issues from 1986 (Volume 37) to date.  Guests have access only to 
tables of contents and abstracts.  Registered users of the interscience 
site have access to the full text of these issues and to preprints.

Executive Director
American Society for Information Science and Technology
1320 Fenwick Lane, Suite 510
Silver Spring, MD  20910
FAX: (301) 495-0810
PHONE: (301) 495-0900

http://www.asis.org