He X, Zha HY, Ding CHQ, Simon HD "Web document clustering using hyperlink structures" COMPUTATIONAL STATISTICS & DATA ANALYSIS 41 (1): 19-45 NOV 28 2002

Eugene Garfield garfield at CODEX.CIS.UPENN.EDU
Wed Dec 18 16:11:46 EST 2002


Xiaopeng HE : {xhe,zha}@cse.psu.edu


Title     Web document clustering using hyperlink structures
Author    He X, Zha HY, Ding CHQ, Simon HD
Journal   COMPUTATIONAL STATISTICS & DATA ANALYSIS 41 (1): 19-45 NOV 28 2002

 Document type: Article    Language : English
 Cited References : 35     Times Cited: 0


Abstract:
With the exponential growth of information on the World Wide Web, there is
great demand for developing efficient methods for effectively organizing the
large amount of retrieved information. Document clustering plays an
important role in information retrieval and taxonomy management for the Web.
In this paper we examine three clustering methods: K-means, multi-level
METIS, and the recently developed normalized-cut-method using a new approach
of combining textual information, hyperlink structure and co-citation
relations into a single similarity metric. We found the normalized-cut
method with the new similarity metric is particularly effective, as
demonstrated on three datasets of web query results. We also explore some
theoretical connections between the normalized-cut method and the K-means
method. (C) 2002 Elsevier Science B.V. All rights reserved.

Author Keywords:
World Wide Web, graph partitioning, cheeger constant, clustering method,
K-means method, normalized cut method, eigenvalue decomposition, link
structure,
similarity metric

KeyWords Plus:
GRAPHS, EIGENVECTORS, ALGORITHM, MATRICES

Addresses:
Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
Univ Calif Berkeley, Lawrence Berkeley Lab, NERSC Div, Berkeley, CA 94720
USA

Publisher:
ELSEVIER SCIENCE BV, AMSTERDAM

IDS Number:
615NV

ISSN:
0167-9473


 Cited Author            Cited Work                Volume      Page   Year

 ANICK PG              P 7 INT ACM SIGIR C                    349      1994
 BHARAT K              P 7 INT WORLD WID WE                   379      1998
 CHAKRABARTI S         COMPUT NETWORKS ISDN          30        65      1998
 CHAKRABARTI S         COMPUTER                      32        60      1999
 CHEEGER J             LOWER BOUND SMALLEST                            1970
 CHUNG FRK             SPECTRAL GRAPH THEOR                            1997
 CROFT WB              PROVIDING GOVT INFOR                    95
 DONATH W              IBM TECHNICAL DISCLO          15       938      1972
 EFTHIMIADIS EN        P 16 INT C ASS COMP                    146      1993
 EVERITT B             CLUSTER ANAL                                    1993
 FIEDLER M             CZECH MATH J                  25       619      1975
 FIEDLER M             CZECH MATH J                  23       298      1973
 FLAKE GW              EFFICIENT IDENTIFICA                   150      2000
 FRIEZE A              FAST MONTE CARLOL ME                            2000
 GIBSON D              P 9 ACM C HYP HYP                      225      1998
 GOLUB G               MATRIX COMPUTATIONS                             1989
 GORDON AD             CLASSIFICATION                                  1981
 HEARST MA             P SIGIR 96                             246      1996
 HENDRICKSON B         SIAM J SCI COMPUT             16       452      1995
 KARYPIS G             METISASTERIX SOFTWAR
 KLEINBERG J           P ACM SIAM S DISCR A                   668      1998
 KLEINBERG JM          P 5 ANN INT COMP COM                    26      1999
 KUMAR R               P 25 INT C VER LARG                    639      1999
 LARSON R              P 59 ANN M AM SOC IN                    71      1996
 LI YH                 IEEE INTERNET COMPUT           2        24      1998
 MOHAR B               DISCRETE MATH                109       171      1992
 PIROLLI P             P ACM C HUM FACT COM                   118      1996
 PORTER MF             PROGRAM                       14       130      1980
 POTHEN A              SIAM J MATRIX ANAL A          11       430      1990
 RIJSBERGERN CJV       INFORMATION RETRIEVA                            1979
 SHI JB                PROC CVPR IEEE                         731      1997
 SMALL H               J AM SOC INFORM SCI           24       265      1973
 SPIELMAN DA           IN PRESS P 37 ANN IE                    96      1996
 WILLETT P             INFORMATION PROCESSI          24       577      1988
 ZAMIR O               GROUPER DYNAMIC CLUS                            1999



More information about the SIGMETRICS mailing list