Kishida K. "Techniques of document clustering: A review " LIBRARY AND INFORMATION SCIENCE (49). 2003. p.33-75 MITA SOC LIBRARY INFORMATION SCIENCE, TOKYO

Eugene Garfield garfield at CODEX.CIS.UPENN.EDU
Wed Feb 23 15:18:13 EST 2005


E-mail : K. Kishida :  kishida at surugadai.ac.jp


TITLE:          Techniques of document clustering: A review (Review,
                English)
AUTHOR:         Kishida, K

SOURCE:         LIBRARY AND INFORMATION SCIENCE (49). 2003. p.33-75 MITA
                SOC LIBRARY INFORMATION SCIENCE, TOKYO

SEARCH TERM(S):  SMALL H  rauth; J DOC*  rwork; J INF SCI  rwork;
                 SCIENTOMETR*  rwork

KEYWORDS+:       INFORMATION-RETRIEVAL; CLASSIFICATION; ORGANIZATION;
                COLLECTIONS; ALGORITHMS; SIMILARITY; TEXT; COMPUTATION;
                DATABASES; SCIENCE

ABSTRACT:       The document clustering technique is widely recognized as
a useful tool for information retrieval, organizing web documents, text
mining and so on. The purpose of this paper is to review various document
clustering techniques, and to discuss research issues for enhancing
effectiveness or efficiency of the clustering methods. We explore
extensive literature on non-hierarchical methods (single-pass methods),
hierarchical methods (single-link, complete-link, etc.), dimensional
reduction methods (LSI, principal component analysis, etc.),
probabilistic methods, data mining techniques, and so on. In particular,
this paper focuses on typical techniques, such as the k-means algorithm,
the leader-follower algorithm, self-organizing map (SOM), single- or
complete-link methods, bisecting k-means methods, latent semantic
indexing (LSI), Gaussian-Mixture model and so on. After reviewing the
techniques and algorithms, we discuss research issues on document
clustering; computational complexity, feature extraction (selection of
words), methods for defining term weights and similarity, and evaluation
of results.


Addresses: Kishida K (reprint author), Surugadai Univ, 698 Azu, Hanno,
Saitama Japan
Surugadai Univ, Hanno, Saitama Japan

Publisher: MITA SOC LIBRARY INFORMATION SCIENCE, KEIO UNIV 2-15-45 MITA,
SCHOOL LIBRARY INFO SCIENCE, MINATO-KU, TOKYO, 108-8345, JAPAN
Subject Category: INFORMATION SCIENCE & LIBRARY SCIENCE

IDS Number: 881IO
ISSN: 0373-4447

CITED REFERENCES :
CR *NIST, TOP DET TRACK TDT
   *NTCIR, NTCIR NII NACSIS TES
   *TREC, TEXT RETR C TREC
   ANDO RK, 2000, P ACM SIGIR ATH GREE, P216
   ANDO RK, 2001, P 24 ANN INT ACM SIG, P154
   AZCARRAGA AP, 2001, P 10 C INF KNOWL MAN, P41
   BEIL F, 2002, P 8 INT C KNOWL DISC, P436
   BOLEY D, 1999, ARTIF INTELL REV, V13, P365
   BOTE VPG, 2002, INFORM PROCESS MANAG, V38, P79
   BUCKLEY C, 1994, P TREC 2, P45
   CAN F, 1984, J AM SOC INFORM SCI, V35, P268
   CAN F, 1985, J AM SOC INFORM SCI, V36, P3
   CAN F, 1987, J AM SOC INFORM SCI, V38, P171
   CAN F, 1990, ACM T DATABASE SYST, V15, P483
   CAN F, 1993, ACM T INFORM SYST, V11, P143
   CAN F, 1995, INFORM SCI, V84, P101
   CHEN H, 1994, COMMUN ACM, V37, P56
   CROFT WB, 1977, J AM SOC INFORM SCI, V28, P341
   CROUCH DB, 1975, INFORMATION PROCESSI, V11, P11
   CUGINI J, 1997, CODATA EUR AM WORKSH
   CUTTING DR, 1992, P 15 ANN INT ACM SIG, P318
   DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391
   DHILLON I, 2004, SURVEY TEXT MINING, P73
   DUDA RO, 2001, PATTERN CLASSIFICATI
   ELHAMDOUCHI A, 1987, J INF SCI, V13, P361
   FRANZ M, 2001, P P RES DEV INF RETR, P310
   FRIGUI H, 2004, SURVEY TEXT MINING, P45
   FUNG BCM, 2003, SIAM INT C DAT MIN
   GOLUB GH, 1996, MATRIX COMPUTATIONS
   GRIFFITHS A, 1984, J DOC, V40, P175
   HAN J, 2001, DATA MINING CONCEPTS
   HARDING AF, 1980, J AM SOC INFORM SCI, V31, P298
   HATZIVASSILOGLO.V, 2000, P 23 ACM SIGIR C RES, P224
   HAVRE S, 2002, IEEE T VIS COMPUT GR, V8, P9
   HEARST MA, 1996, P 19 ANN INT ACM SIG, P76
   HOFMANN T, 1999, P IJCAI 99
   HONKELA T, 1996, P ICNN96 INT C NEURA, P56
   ISHIKAWA Y, 2001, LNCS, V2163, P325
   JAIN AK, 1999, ACM COMPUT SURV, V31, P3
   JARDINE N, 1971, INFORMATION STORAGE, V7, P217
   JONES JP, 1995, J MARK COMMUN, V1, P1
   KASKI S, 1998, P IJCNN 98 INT JOINT, V1, P413
   KASKI S, 1999, P ICANN99 9 INT C AR, V2, P940
   KOBAYASHI M, 2004, SURVEY TEXT MINING C, P103
   KOGAN J, 2001, P WORKSH TEXT MIN 1, P47
   KOHONEN T, 2000, IEEE T NEURAL NETWOR, V11, P574
   KOLATCH E, 2001, CLUSTERING ALGORITHM
   KORFHAGE RR, 1997, INFORMATION STORAGE
   LAGUS K, 1999, P ICANN99 9 INT C AR, V1, P371
   LAGUS K, 2000, A61 HELS U TECH LAB
   LARSEN B, 1999, P 5 ACM SIGKDD INT C, P16
   LIN X, 1991, P 14 ANN INT ACM SIG, P262
   LIN X, 1997, J AM SOC INFORM SCI, V48, P40
   LIU X, 2002, P 25 ANN INT ACM SIG, P191
   MILLER NE, 1998, IEEE VISUALIZATION 9, P189
   MURESAN G, 2001, LNCS, V2163, P438
   MURTAGH F, 1984, INFORMATION PROCESSI, V20, P611
   NILSSON M, 2002, INFORM RETRIEVAL, V5, P311
   OLSEN KA, 1993, INFORMATION PROCESSI, V29, P66
   ORWIG RE, 1997, J AM SOC INFORM SCI, V48, P157
   PANTEL P, 2002, P 25 INT ACM SIGIR C, P199
   PAPKA R, 2000, ADV INFORMATION RETR, P97
   RASMUSSEN E, 1992, INFORMATION RETRIEVA, P419
   ROUSSINOV DG, 2001, INFORM PROCESS MANAG, V37, P789
   SAHAMI M, 1998, P 3 ACM INT C DIG LI, P200
   SALTON G, 1983, INTRO MODERN INFORMA
   SCHUTZE H, 1997, P ACM SIGIR C, P74
   SIBSON R, 1973, COMPUT J, V16, P30
   SILVERSTEIN C, 1997, P 20 INT ACM SIGIR C, P60
   SLONIM N, 2000, P 23 ANN INT ACM SIG, P267
   SMALL H, 1997, SCIENTOMETRICS, V38, P275
   SMALL H, 1999, J AM SOC INFORM SCI, V50, P799
   SMEATON AF, 1998, 20 BCS IRGS C INF RE
   STEINBACH M, 2000, KDD WORKSH TEXT MIN
   VANHULLE, 2000, FAITHFULL REPRESENTA
   VANRIJSBERGEN CJ, 1973, J DOC, V29, P251
   VANRIJSBERGEN CJ, 1974, INFORM STORAGE RETR, V10, P1
   VANRIJSBERGEN CJ, 1975, INFORM PROCESS MANAG, V11, P171
   VANRIJSBERGEN CJ, 1979, INFORMATION RETRIEVA
   VOORHEES EM, 1986, INFORM PROCESS MANAG, V22, P465
   WANG Y, 2002, P 11 ACM IN C INF KN, P499
   WEISS S, 2000, RC21684 IBM RES
   WILLETT P, 1980, J INFORM SCI, V2, P223
   WILLETT P, 1981, INFORMATION PROCESSI, V17, P53
   WILLETT P, 1988, INFORMATION PROCESSI, V24, P577
   WISE JA, 1999, J AM SOC INFORM SCI, V50, P1224
   XU JX, 2000, ADV INFORMATION RETR, P151
   XU W, 2003, P 26 ANN INT ACM SIG, P267
   YU CT, 1974, J AM SOC INFORM SCI, V25, P218
   ZAMIR O, 1998, P 21 ANN INT ACM SIG, P45
   ZAMIR O, 1999, 8 INT WORLD WIDE WEB
   ZHAO Y, 2002, P INT C INF KNOWL MA, P515



More information about the SIGMETRICS mailing list