[Asis-l] JASIST TOC: Vol 44, No 3
Richard Hill
rhill at asis.org
Thu Dec 11 11:34:59 EST 2003
Journal of the American Society for Information Science and Technology
Volume 55, Number 3. February 1, 2004
[Note: at the end of this message are URLs for viewing contents of JASIST
from past issues. Below, the contents of Bert Boyces In this Issue has
been cut into the Table of Contents.]
-------------
CONTENTS
EDITORIAL
In This Issue
Bert R. Boyce
187
RESEARCH
Arabic Morphological Analysis Techniques: A Comprehensive Survey
Imad A. Al-Sughaiyer and Ibrahim A. Al-Kharashi
Published online 20 November 2003
189
Al-Sughaiyer and Al-Kharashi provide definitions of standard
linguistic terms as they are seen in Arabic analysis and identify
efficiency, compactness, bi-directionality, success rate, and retrieval
performance as the measures of the effectiveness of morphological analysis
algorithms. After a review of the Arabic morphological analysis literature,
they suggest the approaches may be classified as table lookup (large
construction demands and space requirements), linguistic (require a large
number of lists and removing affixes by trial and error), combinatorial
(large space and time requirements), or rule based (the authors choice),
and they present a summary of the work in each area. The majority of work
is linguistic in nature, but little comparison of existing work exists.
Evaluation of suggested algorithms is weak. Use of a words root (the
single basic morpheme) in an Arabic index leads to invalid conflation.
Predicting Library of Congress Classifications From Library of
Congress Subject Headings
Eibe Frank and Gordon W. Paynter
Published online 28 October 2003
214
Frank and Paynter attempt to assign LC Classification number
ranges to INFOMINE documents based on their assigned LCSH headings in order
to provide a better browsing capability. Since they claim that
retrospective assignment of a class number is logistically impossible for
those librarians that already assign terms from LCSH to INFOMINE records,
they have devised a machine-learning technique to create the classification
number where, rather than creating virtual LCSH documents to represent each
LC class and using similarity measures to assign documents, they use a
support vector machine classifier to determine which of the top 21 nodes is
most likely and classifiers at each successive level until a leaf is
reached or the classifier chooses itself. They utilize LCSH terms without
subdivisions, and also make use of intervals from the LCC outline available
on the Pharos Web site, both processed and extracted from existing MARC
records to create a training set. The training set of 868,836 records was
drawn from the UC Riverside library catalog with 50,000 items reserved for
testing and the remainder used for training at different levels. Accuracy
increases with training set size but returns diminish. Accuracy increases
from 32% to 55% as the training set size increases from 10,000 to 800,000.
Less than 7% of errors are due to the classifier terminating too early or
too late. With the large test set 80% of the first array classification
decisions are correct and 16% at the seventh level. The learning algorithm
scales at the order of n instances to the 1.7, and test processing proceeds
at a rate of 21 instances per second.
A Nonlinear Model of Information-Seeking Behavior
Allen Foster
Published online 11 November 2003
228
Foster disagrees with the conventional wisdom view of information
seeking as a linear process of identifiable stages and iteration,
particularly as it would apply to interdisciplinary information-seeking
behavior, and he attempts a non-linear model based on identifying the
processes, contexts, and behaviors of such interdisciplinary activity and
their relationships. In-depth structured interviews conducted in the
workplace environment were utilized to collect data on searching examples
provided by the subjects. Subjects were purposively selected from the
University of Sheffield across multiple faculties for their
interdisciplinary research and then used as the kernel for a snowball
sampling expansion resulting in 45 faculty-diverse participants. Transcript
coding took place in multiple iterations and the final results were
confirmed by participant review. Activities viewed in conjunction with time
lines did not support a linear stages model. The new model groups
activities into three core categories; opening (moving from orientation
to actual search), orientation (identifying existing research and a
direction for search), and consolidation (refining and knowing when to
stop). These operate with the boundaries of an external context which
incorporates social and organizational influences, time, project and access
constraints, and navigational issues. Within the external context one finds
an internal context which incorporates the individuals experience, prior
knowledge, and feelings and which are individually unique. Four cognitive
approaches were identified: flexible and adaptable to other cultures, an
open approach with no prior framework, a nomadic approach which actively
seeks diverse ways of access, and a holistic approach which attempts to
bring diverse areas together. Interaction among the core activities was
cumulative, reiterative, holistic, and context-bound.
Indicators of Accuracy for Answers to Ready Reference Questions
on the Internet
Martin Frick and Don Fallis
Published online 19 November 2003
238
Frické and Fallis explore the validity of proposed indicators of
the accuracy of ready reference information to be found in Web sites. Using
49 of the 60 questions previously used by Connell and Tipple, AltaVista
searches were run to identify potential answer sites, the first five of
which actually answered the question chosen, and then evaluated for answer
accuracy and checked for the presence of indicators of accuracy. This was
followed by a Google search to yield these and at most five additional
sites. Each site was manually scored as completely accurate, partially
accurate, partially inaccurate, or completely inaccurate and checked for
owner entity type, recency of update, presence of advertising, copyright
claim, appeal to authority, and the presence of any awards for quality, as
well as its ranked position by the search engine, its Google PageRank
(010) position, and the number of in-links found with the AltaVista link
command. Contingency tables were formed and chi-square used to determine
possible correlation. Likelihood ratios for presence and absence of
indicators and indicator pairs were also computed. Of 300 sites that
answered the questions, 214 were judged completely accurate and only 25
inaccurate. High display position, high Google PageRank, currency,
copyright and in-link count all yield a chi-square probability of less than
.05, suggesting a relationship to accuracy.
The Effects of Domain Knowledge on Search Tactic Formulation
Barbara M. Wildemuth
Published online 13 November 2003
246
Wildemuth is interested in whether a growing understanding of the
knowledge domain covered by a database will affect the sequence of
searching moves (tactics) used by medical students searching that database.
Two random samples were drawn from entering medical school classes,
excluding those with advanced science degrees and those whose undergraduate
degree was in microbiology, the topic of the database. Each was asked to
address six specific clinical problems involving several specific
questions; first, prior to any instruction in microbiology, resulting in a
12.6% success rate; second, directly after the microbiology course,
resulting in a 48.1% success rate; and finally, six months after the
course, achieving a 27.3% success rate. In each instance subjects were
asked to respond from their own knowledge and then to search the database
for a question for which they had provided an incorrect response. The
nearly 1,300 searches were recorded by transaction logs and hand coded
according to an adaption of the Shute & Smith scheme incorporating
beginning moves, reduction moves, expansion moves, and term replacement. A
transition matrix showing the frequency of transitions from one coded move
to every other coded move was created and used to create a graphic
representation of transitions that accounted for at least 1% of all
occurrences. Maximal repeating patterns of moves were also extracted and
the most frequently occurring retained. The most common pattern was the
entry of a new concept followed by the addition of one or more concepts
prior to display. Number of moves decreased with experience. Database usage
increases performance at all three levels of experience.
A Graph Model for E-Commerce Recommender Systems
Zan Huang, Wingyan Chung, and Hsinchun Chen
Published online 14 November 2003
259
Huang, Chung, and Chen are interested in maximizing the value of
the product and usage information available from online transactions for
those that supply material and those that interact with that supplied
material. Such information needs to be represented in a flexible manner,
since different recommendation approaches are typically used to create
recommender systems that find associations between users and items and use
discovered associations to recommend additional items to previous users. A
two-layer graph model is implemented with users and items as nodes in
separate layers and transactions and similarities as links. Nodes are kept
as relative similarity measures to other nodes. If links in the item layer
are activated, the approach is content-based. If user and inter-layer links
are activated, the approach is collaborative and activating all links gives
a hybrid approach. A direct retrieval approach retrieves items similar to
those used previously by a user or similar users. A collaborative
recommendation forms a list of similar users by either past common item
selection or by common demographics and recommends that lists past
selections. An association mining method was used with the three
approaches, each generating a different set of association rules with
transitive rules in a Hopfield net utilized as an option to overcome sparse
user ratings. Testing on a Chinese online bookstore data set provided
records for 9,695 books, 2,000 customers, and 18,771 transactions. Books
and customers were described as feature vectors and similarity measures
computed and customers purchase lists were halved to provide a predicted
set from the first half allowing recall and precision-type measures.
Pairwise t-tests were then applied. The hybrid approach was the best
performer, but the spreading activation approach did not better
significantly the associative mining approach or direct search.
BOOK REVIEWS
Mining the Web: Discovering Knowledge From Hypertext Data, by
Soumen Chakrabarti
Chaomei Chen
Published online 20 November 2003
275
The Library's Legal Answer Book, by Mary Minow and Tomas A.
Lipinski
Kenneth Einar
Himma
Published online 18 November 2003
276
The Internet in Everyday Life, edited by Barry Wellman and
Caroline Haythornthwaite
Pramod K. Nayar
Published online 14 November 2003
278
------------------------------------------------------
The ASIS web site <http://www.asis.org/Publications/JASIS/tocs.html>
contains the Table of Contents and brief abstracts as above from January
1993 (Volume 44) to date.
The John Wiley Interscience site <http://www.interscience.wiley.com>
includes issues from 1986 (Volume 37) to date. Guests have access only to
tables of contents and abstracts. Registered users of the interscience
site have access to the full text of these issues and to preprints.
Executive Director
American Society for Information Science and Technology
1320 Fenwick Lane, Suite 510
Silver Spring, MD 20910
FAX: (301) 495-0810
PHONE: (301) 495-0900
http://www.asis.org
More information about the Asis-l
mailing list