[Sigia-l] Comparing search engines

Thu Jun 19 12:24:04 EDT 2003

I spent a little time recently looking into the use of measures for
evaluating search engines, and I have a couple of notes to pass along in
the hope they might be helpful:

(1) It's hard to get users to make any evaluations based on either
precision or recall (the "Cranfield" measures), since these pre-suppose
an underlying awareness of the size and scope and contents of the
underlying database.  How would you know whether you were getting "good
recall" unless you already knew what was in there to begin with?  To a
certain degree and in certain circumstances, a user can make rough
judgments about precision or recall (I know that there is a document in
here that I'm not finding), but asking them to regularly make judgments
of that nature is problematic. 

(2) Recall and precision are widely used in TREC and as part of other
research studies, and have proven effective over many years of research
in improving results in test conditions.  But these measures are NOT
used as evaluating tools by the manufacturers of search engines.  This
is actually a pretty significant gap, since it means that the things
researchers are paying attention to may not be having very much
influence in the design of search tools users will employ. 

(3) User judgment about results is notoriously inconsistent.  One of the
biggest problems is that users tend to be satisfied with results that
are "good enough", i.e. that have *some* relevance to the task at hand,
even if only tangentially so, even when more "relevant" results exist in
the system.  This is not to blame the user for being ignorant, but
rather to demonstrate that evaluation of results according to relevance
is complex and depends on many factors and is thus somewhat hard to
generalize about. 

(4) Some additional considerations, however, that might factor into an
evaluation could include the ability on the part of the system to remove
duplicate entries from the results set, and the ability to enable users
to find previously unknown items within the database.  More practically
focused measures might be how well the search engine handles "error"
cases (zero results, error in query entry), and whether the content of
the results is clear (i.e. whether the results display accurately
reflects the underlying content, regardless of whether they are
relevant).

This is a fascinating and wide topic, one on which the jury is still out
to a great extent. I would love to hear more from anyone who can speak
authoritatively about these questions. 

CH

--------------

Charles Hanson 
[information architect]
[sbi-razorfish]

212.798.7922 office
212.966.6915 fax