Aphinyanaphongs Y, Statnikov A, Aliferis CF "A comparison of citation metrics to machine learning filters for the identification of high quality MEDLINE documents " Journal of the American Medical Informatics Association 13(4):446-455 July-August 2006.

Eugene Garfield garfield at CODEX.CIS.UPENN.EDU
Wed Sep 27 14:34:30 EDT 2006


E-mail Addresses:C.F. Aliferis : constantin.aliferis at vanderbilt.edu 


Title: A comparison of citation metrics to machine learning filters for the 
identification of high quality MEDLINE documents 

Author(s): Aphinyanaphongs Y, Statnikov A, Aliferis CF 

Source: JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION 13 (4): 446-
455 JUL-AUG 2006 

Document Type: Article 
Language: English 
Cited References: 37      Times Cited: 0   
     
Abstract: 
objective: The present study explores the discriminatory performance of 
existing and novel gold-standard-specific machine learning (GSS-ML) focused 
filter models (i.e., models built specifically for a retrieval task and a 
gold standard against which they ate evaluated) and compares their 
performance to citation count and impact factors, and non-specific machine 
learning (NS-ML) models (i.e., models built for a different task and/or 
different gold standard). 

Design: Three gold standard corpora were constructed using the SSOAB 
bibliography, the ACPJ-cited treatment articles, and the ACPJ-cited 
etiology articles. Citation counts and impact factors were obtained for 
each article. Support vector machine models were used to classify the 
articles using combinations of content, impact factors, and citation counts 
as predictors.

Measurements: Discriminatory performance was estimated using the area under 
the receiver operating characteristic curve and n-fold cross-validation.

Results: For all three gold standards and tasks, GSS-ML filters 
outperformed citation count, impact factors, and NS-ML filters. 
Combinations of content with impact factor or citation count produced no or 
negligible improvements to the GSS machine learning filters.

Conclusions: These experiments provide evidence that when building 
information retrieval filters focused on a retrieval task and corresponding 
gold standard, the filter models have to be built specifically for this 
task and gold standard. Under those conditions, machine learning filters 
outperform standard citation metrics. Furthermore, citation counts and 
impact factors add marginal value to discriminatory performance. Previous 
research that claimed better performance of citation metrics than machine 
learning in one of the corpora examined here is attributed to using machine 
learning filters built for a different gold standard and task.

KeyWords Plus: DETECTING CLINICALLY SOUND; OPTIMAL SEARCH STRATEGIES; TEXT 
CATEGORIZATION; RETRIEVAL 

Addresses: Aliferis CF (reprint author), Vanderbilt Univ, Dept Biomed 
Informat, Eskind Biomed Lib, Discovery Syst Lab, Room 412,2209 Garland Ave, 
Nashville, TN 37232 USA
Vanderbilt Univ, Dept Biomed Informat, Eskind Biomed Lib, Discovery Syst 
Lab, Nashville, TN 37232 USA 

E-mail Addresses: constantin.aliferis at vanderbilt.edu 

Publisher: ELSEVIER SCIENCE INC, 360 PARK AVE SOUTH, NEW YORK, NY 10010-
1710 USA 
Subject Category: COMPUTER SCIENCE, INFORMATION SYSTEMS; COMPUTER SCIENCE, 
INTERDISCIPLINARY APPLICATIONS; INFORMATION SCIENCE & LIBRARY SCIENCE; 
MEDICAL INFORMATICS 
IDS Number: 064EU 

ISSN: 1067-5027 


CITED REFERENCES :
ACP J 131 : A15 1999   
 LIBSVM LIB SUPPORT V : 2005   
 PUBMED : 2005   

 ALIFERIS C
P AMIA S WASH DC : 2003   

 ALIFERIS CF
METMBS : 371 1908   

 APHINYANAPHONGS Y
Text categorization models for high-quality article retrieval in internal 
medicine
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION 12 : 207 2005  
 
 APHINYANAPHONGS Y
MEDINFO : 2004  
 
 BAEZAYATES R
MODERN INFORMATION R : 1999   

 BERNSTAM EV
J AM MED INFORM ASS : 2005   

 DELONG ER
COMPARING THE AREAS UNDER 2 OR MORE CORRELATED RECEIVER OPERATING 
CHARACTERISTIC CURVES - A NONPARAMETRIC APPROACH
BIOMETRICS 44 : 837 1988   

 DUDA S
AMIA S WASH D C : 2005   

 DUDOIT S
126 UC BERK DIV BIOS : 2003  
 
 DUMAIS S
P ACM CIKM98 NOV : 1998   

 FAWCETT T
HPL20034 : 2003   

 GARFIELD E
CAN CITATION INDEXIN : 1965   

 GARFIELD E
INT J CLIN HLTH PSYC 3 : 363 2003   

 GARFIELD E
SCI PUBL POLICY 19 : 321 1992   

 GUYON I
Gene selection for cancer classification using support vector machines
MACHINE LEARNING 46 : 389 2002   

 HAND DJ
A simple generalisation of the area under the ROC curve for multiple class 
classification problems
MACHINE LEARNING 45 : 171 2001 
  
 HAYNES RB
DEVELOPING OPTIMAL SEARCH STRATEGIES FOR DETECTING CLINICALLY SOUND STUDIES 
IN MEDLINE
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION 1 : 447 1994  
 
 HSU CW
PRACTICAL GUIDE SUPP : 2005   

 JENKINS M
HLTH INFO LIB J 21 : 148 2004   

 JOACHIMS T
LEARNING CLASSIFY TE : 2002   

 KLEINBERG
P ACM SIAM S DISCR A : 1997   

 LEOPOLD E
Text categorization with support vector machines. How to represent texts in 
input space ?
MACHINE LEARNING 46 : 423 2002 
  
 PAGANO M
PRINCIPLES BIOSTATIS : 2000 
  
 PAGE L
PAGERANK CITATION RA : 1998  
 
 PORTER MF
AN ALGORITHM FOR SUFFIX STRIPPING
PROGRAM-AUTOMATED LIBRARY AND INFORMATION SYSTEMS 14 : 130 1980 
  
 PROVOST F
ICML 98 15 INT C MAC : 1998   

 SALTON G
TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL
INFORMATION PROCESSING & MANAGEMENT 24 : 513 1988 
 
SCHEFFER T
ERROR ESTIMATION MOD : 1999   

 SUN A
ICDM : 2001   

 TSAMARDINOS I
AI STAT : 2003   

 VAPNIK V
STAT LEARNING THEORY : 1998  
 
 WEISS S
COMPUTER SYSTEMS LEA : 1991   

 WILCZYNSKI NL
Optimal search strategies for detecting clinically sound prognostic studies 
in EMBASE: An analytic survey
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION 12 : 481 2005  
 
 YANG Y
22 ANN ACM C RES DEV : 1999   



More information about the SIGMETRICS mailing list