Efficient k-means++ approximation with MapReduce

Xu, Yujie; Qu, Wenyu; Li, Zhiyang; Min, Geyong; Li, Keqiu; Liu, Zhaobin

dc.contributor.author	Xu, Yujie
dc.contributor.author	Qu, Wenyu
dc.contributor.author	Li, Zhiyang
dc.contributor.author	Min, Geyong
dc.contributor.author	Li, Keqiu
dc.contributor.author	Liu, Zhaobin
dc.date.accessioned	2016-03-07T09:42:51Z
dc.date.accessioned	2016-03-22T16:54:29Z
dc.date.issued	2014-11-10
dc.description.abstract	k-means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and thus a proper initialization is crucial for obtaining an ideal solution. To address this problem, k-means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, k-means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents MapReduce k-means++ method which can drastically reduce the number of MapReduce jobs by using only one MapReduce job to obtain k centers. The k-means++ initialization algorithm is executed in the Mapper phase and the weighted k-means++ initialization algorithm is run in the Reducer phase. As this new MapReduce k-means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed MapReduce k-means++ method obtains O(α2) approximation to the optimal solution of k-means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed MapReduce k-means++ method is much more efficient and can achieve a good approximation.	en_GB
dc.description.sponsorship	This work was supported by the National Science Foundation for Distinguished Young Scholars of China under Grant No. of 61225010, National Nature Science Foundation of China (Nos. 61173162, 61173165, 61370199, 61300187, 61300189, and 61370198), New Century Excellent Talents (No. NCET-10-0095), the Fundamental Research Funds for the Central Universities (Nos. 2013QN044 and 2012TD008).	en_GB
dc.identifier.citation	Vol. 25, iss. 12, pp. 3135 - 3144	en_GB
dc.identifier.doi	10.1109/TPDS.2014.2306193
dc.identifier.uri	http://hdl.handle.net/10871/20806
dc.language.iso	en	en_GB
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)	en_GB
dc.relation.replaces	http://hdl.handle.net/10871/20517	en_GB
dc.relation.replaces	10871/20517	en_GB
dc.rights	This is the author accepted manuscript. The final version is available from Institute of Electrical and Electronics Engineers (IEEE) via the DOI in this record.	en_GB
dc.subject	scalability	en_GB
dc.subject	k-means++	en_GB
dc.subject	k-means	en_GB
dc.subject	approximation	en_GB
dc.subject	MapReduce	en_GB
dc.subject	Clustering algorithms	en_GB
dc.title	Efficient k-means++ approximation with MapReduce	en_GB
dc.type	Article	en_GB
dc.date.available	2016-03-07T09:42:51Z
dc.date.available	2016-03-22T16:54:29Z
dc.identifier.issn	1045-9219
dc.description	Published	en_GB
dc.description	Journal Article	en_GB
dc.identifier.journal	IEEE Transactions on Parallel and Distributed Systems	en_GB

Files in this item

Name:: Paper2014_9 Efficient means ...
Size:: 3.209Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Computer Science

Show simple item record

Show Statistical Information