AbstractsStatistics

K-groups: A Generalization of K-means by Energy Distance

by Songzi Li




Institution: Bowling Green State University
Department: Statistics
Degree: PhD
Year: 2015
Keywords: Statistics; K-groups; K-means; Clustering analysis
Record ID: 2060467
Full text PDF: http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1428583805


Abstract

We propose two distribution-based clustering algorithms called K-groups. Our algorithms group the observations in one cluster if they are from a common distribution. Energy distance is a non-negative measure of the distance between distributions that is based on Euclidean distances between random observations, which is zero if and only if the distributions are identical. We use energy distance to measure the statistical distance between two clusters, and search for the best partition which maximizes the total between clusters energy distance. To implement our algorithms, we apply a version of Hartigan and Wong's moving one point idea, and generalize this idea to moving any m points. We also prove that K-groups is a generalization of the K-means algorithm. K-means is a limiting case of the K-groups generalization, with common objective function and updating formula in that case. K-means is one of the well-known clustering algorithms. From previous research, it is known that K-means has several disadvantages. K-means performs poorly when clusters are skewed or overlapping. K-means can not handle categorical data, because the mean is not a good estimate of center. K-means can not be applied when dimension exceeds sample size. Our K-groups methods provide a practical and effective solution to these problems. Simulation studies on the performance of clustering algorithms for univariate and multivariate mixture distributions are presented. Four validation indices (diagonal, Kappa, Rand and corrected Rand) are reported for each example in the simulation study. Results of the empirical studies show that both K-groups algorithms perform as well as K-means when clusters are well-separated and spherically shaped, but K-groups algorithms perform better than K-means when clusters are skewed or overlapping. K-groups algorithms are more robust than K-means with respect to outliers. Results are presented for three multivariate data sets, wine cultivars, dermatology diseases and oncology cases. In our real data examples, the performance of both K-groups algorithms are better than the performance of K-means in each case.