Efficient k-means++ approximation with MapReduce
Xu, Yujie; Qu, Wenyu; Li, Zhiyang; et al.Min, Geyong; Li, Keqiu; Liu, Zhaobin
Date: 10 November 2014
Article
Journal
IEEE Transactions on Parallel and Distributed Systems
Publisher
Institute of Electrical and Electronics Engineers (IEEE)
Publisher DOI
Abstract
k-means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and thus a proper initialization is crucial for obtaining an ideal solution. To address this problem, k-means++ is proposed to sequentially choose the ...
k-means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and thus a proper initialization is crucial for obtaining an ideal solution. To address this problem, k-means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, k-means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents MapReduce k-means++ method which can drastically reduce the number of MapReduce jobs by using only one MapReduce job to obtain k centers. The k-means++ initialization algorithm is executed in the Mapper phase and the weighted k-means++ initialization algorithm is run in the Reducer phase. As this new MapReduce k-means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed MapReduce k-means++ method obtains O(α2) approximation to the optimal solution of k-means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed MapReduce k-means++ method is much more efficient and can achieve a good approximation.
Computer Science
Faculty of Environment, Science and Economy
Item views 0
Full item downloads 0