[Asis-l] JASIST Volume 55, Number 7

Tue Mar 30 10:53:59 EST 2004

Journal of the American Society for Information Science and Technology
Volume 55, Number 7  May 2004

[Note: at the end of this message are URLs for viewing contents of JASIST
from past issues.  Below, the contents of Bert Boyce's "In this Issue" has
been cut into the Table of Contents.]

CONTENTS

EDITORIAL
In This Issue
Bert R. Boyce	
563

RESEARCH
Information Retrieval by Metabrowsing
F. Wiesman, H.J. van den Herik, and A. Hasman
Published online 17 February 2004	
565
	Wiesman, van den Herik, and Hasman consider six difficulties in
information retrieval: expression of information need, communication of that
need to the system, implicit inter-human communication, indexing
consistency, reliability of retrieved items, and the need of the searcher
for five distinct knowledge types (system procedural, domain, search
strategy, indexing policy, and search tactics). Since humans are good at
recognizing relevance but not at describing it, browsing can overcome these
difficulties. In particular they suggest metabrowsing, the browsing of
information about documents’ domain, contents, location, and relations to
other documents, rather than of the documents themselves. They represent
domain with a simplified version of the Unified Medical Language System and
use 36,000 1995 Medline records for documents, each linked to the domain
file by their assigned primary or secondary index terms. A key term is
chosen from an alphabetical list, its preferred term is substituted, and a
window opened around this preferred term in the domain. Related terms may be
added to this window with arcs indicating the relation type and clickable
definitions. A new screen will give sub-terms of the chosen term and links
to documents so indexed. The document’s other terms can be displayed or its
content presented. Bookmarking, backtracking, and a history list provide for
reorientation, if needed. A test group of 24 second- and fourth-year medical
students used the system and WinSpires on the same file with three questions
designed by domain experts who also evaluated the retrieved documents.
Overall, there was no significant difference in effectiveness or user
satisfaction, and the system was less efficient for fourth-year students who
also were more satisfied with WinSpires.

Improving Performance of Text Categorization by Combining Filtering and
Support Vector Machines
Irene Díaz, José Ranilla, Elena Montañes, Javier Fernández, and Elías F.
Combarro
Published online 20 February 2004	
579
	Diaz et alia believe text categorization, the automatic
classification of documents reduced to weighted stem counts and, in this
case, assigned to categories by a Support Vector Machine (SVM), can be
improved by feature reduction techniques despite the SVM’s unique capability
of handling large feature spaces. They compare the effect of term frequency,
inverse document frequency, and information gain, as reduction techniques on
expert classed collections; the Reuters-21578 corpus, and three subsets of
the Osmand Medline collection, using fixed training sets and parameters for
the SVM. They define precision as the number of true positives over the sum
of the number of true and false positives; and recall as the number of true
positives over the sum of the number of true positives and false negatives
and use van Rijsbergn’s combined measure with equal weights. The filtering
has no effect on precision, but all methods provide a significant
improvement in recall, and thus the combined measure, over unreduced text.
Information gain is the best performer at aggressive filtering levels. 

A Formal Knowledge Management Ontology: Conduct, Activities, Resources, and
Influences
C.W. Holsapple and K.D. Joshi
Published online 25 February 2004	
593
	Holsapple and Joshi develop an ontology, or set of definitions and
axioms, which can be used to characterize knowledge management as a
discipline. The goal is to identify and express the knowledge manipulation
activities that fall within that domain. They begin by setting the
conditions for their design, namely, that their result occurs in business
settings, describes KM phenomena, and captures concepts at two or more
levels of detail. Then, they collected KM case studies, surveys, and
articles as a source for terminology, and chose terms via multiple
iterations until their satisfaction as to helpfulness, comprehensiveness,
and unification was attained. Interacting by questionnaire with a panel of
31 KM researchers and practitioners, the four initial components of their
framework (conduct, resources, knowledge manipulation, and KM influences)
were reviewed for completeness, accuracy, clarity, and conciseness and the
whole reviewed for utility, comprehensiveness, unification, and limitations.
The resulting revision along with a summary of comments was again sent to
the panel, evaluated by questionnaire, and the process repeated until no
further revision occurred. Ninety-four percent of panelists were at least
moderately satisfied with the ontology. Eighty-one percent felt the ontology
was at least moderately successful in terms of providing a unified and
comprehensive view. Sixty percent considered the result to be either helpful
or extremely helpful to researchers, and 70% felt it was at least moderately
helpful to practitioners. 

An Entropy-Based Interpretation of Retrieval Status Value-Based Retrieval,
and Its Application to the Computation of Term and Query Discrimination
Value
Sándor Dominich, Júlia Góth, Tamás Kiezer, and Zoltán Szlávik
Published online 5 February 2004	
613
	Dominich et alia show that any Retrieval Status Value (RSV) based
retrieval model can be seen as a probability space where the amount of
associated Shannon-type information is decreased by retrieval operations,
that is to say, as an Uncertainty Decreasing Operation (UDO) probability
space. Thus, a term’s discrimination value can be based upon its reduction
of the UDO space entropy, rather than upon its reduction of Euclidean space
as in the vector space model, and term discrimination values become
available to any RSV system. The term discrimination values (TDV) for an 82
document ADI test collection that gave 915 terms, time stop listed and
Porter stemmed, were computed by each method. About half the terms using UDO
have a 100% TDV, and each such term has a positive vector space based TDV
indicating agreement on good discrimination. Most of the terms with UDO
based TDV between 80% and 100% have positive vector space based TDVs, while
those between 40% and 80% have near-zero vector space discrimination values.
UDO may be used to compute a discrimination value for queries, and such
values were computed for 35 ADI test queries. The fewer relevant answers a
query has, the higher its discrimination value was found to be, except for
query 27 where all terms have very high document frequencies and the query
is extremely general. Retrieval tests on ADI using both weights indicates
that UDO weights enhance precision at recall levels above 50%, but perform
equally at lower recall levels. Tests on three additional databases of
various similarity measures show that dot product reduces entropy to the
greatest extent and that cosine produces the least entropy reduction. The
use of normalized frequency weighting reduces entropy to the greatest
extent, while lack of normalization gave the least entropy reduction. UDO is
faster, and less restrictive. 

The Effects of Fitness Functions on Genetic Programming-Based Ranking
Discovery for Web Search
Weiguo Fan, Edward A. Fox, Praveen Pathak, and Harris Wu
Published online 17 February 2004	
628
	Fan et alia find fitness function design important in the
improvement of Genetic Programming based ranking functions for Web
retrieval. Candidate ranking functions are represented as individuals in a
GP population tree structure and evolved to find those with better fitness
values. Average precision, which does not preserve rank order information,
has been the reasonably effective common fitness function, but other
possibilities may improve performance. The ideal utility function preserves
rank order information and is non-linear with high values for documents
ranked at the top of the list and quickly losing value as the rank
increases. Four functions are designed to meet these requirements. Chang and
Kwok and Lopez-Pujalte et alia each provide functions that preserve rank
order information with the Lopez-Pujalte function incorporating negative
values for nonrelevant documents. As an experimental baseline, the Okapi
BM25 ranking formula is used with the TREC 10GB collection of 1.69 million
documents and 100 queries from TREC 9 and TREC 10 in a vector space format.
The fitness function in use had a noticeable effect on performance with
three of the new functions showing strong improvement. 

Query Association Surrogates for Web Search
Falk Scholer, Hugh E. Williams, and Andrew Turpin
Published online 25 February 2004	
637
	Scholer, Williams, and Turpin construct document surrogates by
supplementing existing document texts with terms from queries that dropped
these documents as the top N (thirty nine) of a retrieved list based upon
the Okapi BM25 similarity measure, and limiting such supplementation to M
(nineteen) queries per document. When the set limits are reached, new query
terms with higher similarity measures can supplant those in existence.
However, only terms that appear in the document as well as the associated
query may be added to the surrogate, so that it is the weight of these terms
that changes in the document surrogate. They also create surrogates that are
a set of such query terms without the original document surrogate. The 1.69
million Web documents of TREC WT10g make up the experimental collection,
which is searched for title word strings (stop listed but not stemmed) from
50 queries each from TREC-9 and TREC-2001 without relevance feedback.
Queries for creation of supplements came from some 900,000 logged Excite
queries. Query association improved mean average precision by 4.3%, and mean
average precision at 10 by 7%. Adding anchor terms has no effect on queries
that did well, but, this reduces performance of those below the baseline
even further. Query term surrogates without full text are 6% less effective
under average precision at 10 than text alone. Query associations did not
appear helpful for named page finding, and a dynamic parameter setting for M
and N does not lead to improvement. 

BOOK REVIEWS
A History of Online Information Services, 1963–1976, by Charles P. Bourne
and Trudi Bellardo Hahn
Derek G. Smith
Published online 6 February 2004	
651

Research Questions for the Twenty-first Century, edited by Mary Jo Lynch
Lydia Eato Harris
Published online 23 February 2004	
652

-------------------                                                      
The ASIS web site <http://www.asis.org/Publications/JASIS/tocs.html>
contains the Table of Contents and brief abstracts as above from January
1993 (Volume 44) to date.

The John Wiley Interscience site <http://www.interscience.wiley.com>
includes issues from 1986 (Volume 37) to date.  Guests have access only to
tables of contents and abstracts.  Registered users of the interscience site
have access to the full text of these issues and to preprints.

------------
Richard Hill
Executive Director
American Society for Information Science and Technology
1320 Fenwick Lane, Silver Spring, MD  20910 
FAX: (301) 495-0810
Voice: (301) 495-0900
www.asis.org