Kishida K. "Techniques of document clustering: A review " LIBRARY AND INFORMATION SCIENCE (49). 2003. p.33-75 MITA SOC LIBRARY INFORMATION SCIENCE, TOKYO
Eugene Garfield
garfield at CODEX.CIS.UPENN.EDU
Wed Feb 23 15:18:13 EST 2005
E-mail : K. Kishida : kishida at surugadai.ac.jp
TITLE: Techniques of document clustering: A review (Review,
English)
AUTHOR: Kishida, K
SOURCE: LIBRARY AND INFORMATION SCIENCE (49). 2003. p.33-75 MITA
SOC LIBRARY INFORMATION SCIENCE, TOKYO
SEARCH TERM(S): SMALL H rauth; J DOC* rwork; J INF SCI rwork;
SCIENTOMETR* rwork
KEYWORDS+: INFORMATION-RETRIEVAL; CLASSIFICATION; ORGANIZATION;
COLLECTIONS; ALGORITHMS; SIMILARITY; TEXT; COMPUTATION;
DATABASES; SCIENCE
ABSTRACT: The document clustering technique is widely recognized as
a useful tool for information retrieval, organizing web documents, text
mining and so on. The purpose of this paper is to review various document
clustering techniques, and to discuss research issues for enhancing
effectiveness or efficiency of the clustering methods. We explore
extensive literature on non-hierarchical methods (single-pass methods),
hierarchical methods (single-link, complete-link, etc.), dimensional
reduction methods (LSI, principal component analysis, etc.),
probabilistic methods, data mining techniques, and so on. In particular,
this paper focuses on typical techniques, such as the k-means algorithm,
the leader-follower algorithm, self-organizing map (SOM), single- or
complete-link methods, bisecting k-means methods, latent semantic
indexing (LSI), Gaussian-Mixture model and so on. After reviewing the
techniques and algorithms, we discuss research issues on document
clustering; computational complexity, feature extraction (selection of
words), methods for defining term weights and similarity, and evaluation
of results.
Addresses: Kishida K (reprint author), Surugadai Univ, 698 Azu, Hanno,
Saitama Japan
Surugadai Univ, Hanno, Saitama Japan
Publisher: MITA SOC LIBRARY INFORMATION SCIENCE, KEIO UNIV 2-15-45 MITA,
SCHOOL LIBRARY INFO SCIENCE, MINATO-KU, TOKYO, 108-8345, JAPAN
Subject Category: INFORMATION SCIENCE & LIBRARY SCIENCE
IDS Number: 881IO
ISSN: 0373-4447
CITED REFERENCES :
CR *NIST, TOP DET TRACK TDT
*NTCIR, NTCIR NII NACSIS TES
*TREC, TEXT RETR C TREC
ANDO RK, 2000, P ACM SIGIR ATH GREE, P216
ANDO RK, 2001, P 24 ANN INT ACM SIG, P154
AZCARRAGA AP, 2001, P 10 C INF KNOWL MAN, P41
BEIL F, 2002, P 8 INT C KNOWL DISC, P436
BOLEY D, 1999, ARTIF INTELL REV, V13, P365
BOTE VPG, 2002, INFORM PROCESS MANAG, V38, P79
BUCKLEY C, 1994, P TREC 2, P45
CAN F, 1984, J AM SOC INFORM SCI, V35, P268
CAN F, 1985, J AM SOC INFORM SCI, V36, P3
CAN F, 1987, J AM SOC INFORM SCI, V38, P171
CAN F, 1990, ACM T DATABASE SYST, V15, P483
CAN F, 1993, ACM T INFORM SYST, V11, P143
CAN F, 1995, INFORM SCI, V84, P101
CHEN H, 1994, COMMUN ACM, V37, P56
CROFT WB, 1977, J AM SOC INFORM SCI, V28, P341
CROUCH DB, 1975, INFORMATION PROCESSI, V11, P11
CUGINI J, 1997, CODATA EUR AM WORKSH
CUTTING DR, 1992, P 15 ANN INT ACM SIG, P318
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391
DHILLON I, 2004, SURVEY TEXT MINING, P73
DUDA RO, 2001, PATTERN CLASSIFICATI
ELHAMDOUCHI A, 1987, J INF SCI, V13, P361
FRANZ M, 2001, P P RES DEV INF RETR, P310
FRIGUI H, 2004, SURVEY TEXT MINING, P45
FUNG BCM, 2003, SIAM INT C DAT MIN
GOLUB GH, 1996, MATRIX COMPUTATIONS
GRIFFITHS A, 1984, J DOC, V40, P175
HAN J, 2001, DATA MINING CONCEPTS
HARDING AF, 1980, J AM SOC INFORM SCI, V31, P298
HATZIVASSILOGLO.V, 2000, P 23 ACM SIGIR C RES, P224
HAVRE S, 2002, IEEE T VIS COMPUT GR, V8, P9
HEARST MA, 1996, P 19 ANN INT ACM SIG, P76
HOFMANN T, 1999, P IJCAI 99
HONKELA T, 1996, P ICNN96 INT C NEURA, P56
ISHIKAWA Y, 2001, LNCS, V2163, P325
JAIN AK, 1999, ACM COMPUT SURV, V31, P3
JARDINE N, 1971, INFORMATION STORAGE, V7, P217
JONES JP, 1995, J MARK COMMUN, V1, P1
KASKI S, 1998, P IJCNN 98 INT JOINT, V1, P413
KASKI S, 1999, P ICANN99 9 INT C AR, V2, P940
KOBAYASHI M, 2004, SURVEY TEXT MINING C, P103
KOGAN J, 2001, P WORKSH TEXT MIN 1, P47
KOHONEN T, 2000, IEEE T NEURAL NETWOR, V11, P574
KOLATCH E, 2001, CLUSTERING ALGORITHM
KORFHAGE RR, 1997, INFORMATION STORAGE
LAGUS K, 1999, P ICANN99 9 INT C AR, V1, P371
LAGUS K, 2000, A61 HELS U TECH LAB
LARSEN B, 1999, P 5 ACM SIGKDD INT C, P16
LIN X, 1991, P 14 ANN INT ACM SIG, P262
LIN X, 1997, J AM SOC INFORM SCI, V48, P40
LIU X, 2002, P 25 ANN INT ACM SIG, P191
MILLER NE, 1998, IEEE VISUALIZATION 9, P189
MURESAN G, 2001, LNCS, V2163, P438
MURTAGH F, 1984, INFORMATION PROCESSI, V20, P611
NILSSON M, 2002, INFORM RETRIEVAL, V5, P311
OLSEN KA, 1993, INFORMATION PROCESSI, V29, P66
ORWIG RE, 1997, J AM SOC INFORM SCI, V48, P157
PANTEL P, 2002, P 25 INT ACM SIGIR C, P199
PAPKA R, 2000, ADV INFORMATION RETR, P97
RASMUSSEN E, 1992, INFORMATION RETRIEVA, P419
ROUSSINOV DG, 2001, INFORM PROCESS MANAG, V37, P789
SAHAMI M, 1998, P 3 ACM INT C DIG LI, P200
SALTON G, 1983, INTRO MODERN INFORMA
SCHUTZE H, 1997, P ACM SIGIR C, P74
SIBSON R, 1973, COMPUT J, V16, P30
SILVERSTEIN C, 1997, P 20 INT ACM SIGIR C, P60
SLONIM N, 2000, P 23 ANN INT ACM SIG, P267
SMALL H, 1997, SCIENTOMETRICS, V38, P275
SMALL H, 1999, J AM SOC INFORM SCI, V50, P799
SMEATON AF, 1998, 20 BCS IRGS C INF RE
STEINBACH M, 2000, KDD WORKSH TEXT MIN
VANHULLE, 2000, FAITHFULL REPRESENTA
VANRIJSBERGEN CJ, 1973, J DOC, V29, P251
VANRIJSBERGEN CJ, 1974, INFORM STORAGE RETR, V10, P1
VANRIJSBERGEN CJ, 1975, INFORM PROCESS MANAG, V11, P171
VANRIJSBERGEN CJ, 1979, INFORMATION RETRIEVA
VOORHEES EM, 1986, INFORM PROCESS MANAG, V22, P465
WANG Y, 2002, P 11 ACM IN C INF KN, P499
WEISS S, 2000, RC21684 IBM RES
WILLETT P, 1980, J INFORM SCI, V2, P223
WILLETT P, 1981, INFORMATION PROCESSI, V17, P53
WILLETT P, 1988, INFORMATION PROCESSI, V24, P577
WISE JA, 1999, J AM SOC INFORM SCI, V50, P1224
XU JX, 2000, ADV INFORMATION RETR, P151
XU W, 2003, P 26 ANN INT ACM SIG, P267
YU CT, 1974, J AM SOC INFORM SCI, V25, P218
ZAMIR O, 1998, P 21 ANN INT ACM SIG, P45
ZAMIR O, 1999, 8 INT WORLD WIDE WEB
ZHAO Y, 2002, P INT C INF KNOWL MA, P515
More information about the SIGMETRICS
mailing list