Show simple item record

dc.contributor.authorXu, Yujie
dc.contributor.authorQu, Wenyu
dc.contributor.authorLi, Zhiyang
dc.contributor.authorMin, Geyong
dc.contributor.authorLi, Keqiu
dc.contributor.authorLiu, Zhaobin
dc.date.accessioned2016-03-07T09:42:51Z
dc.date.accessioned2016-03-22T16:54:29Z
dc.date.issued2014-11-10
dc.description.abstractk-means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and thus a proper initialization is crucial for obtaining an ideal solution. To address this problem, k-means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, k-means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents MapReduce k-means++ method which can drastically reduce the number of MapReduce jobs by using only one MapReduce job to obtain k centers. The k-means++ initialization algorithm is executed in the Mapper phase and the weighted k-means++ initialization algorithm is run in the Reducer phase. As this new MapReduce k-means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed MapReduce k-means++ method obtains O(α2) approximation to the optimal solution of k-means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed MapReduce k-means++ method is much more efficient and can achieve a good approximation.en_GB
dc.description.sponsorshipThis work was supported by the National Science Foundation for Distinguished Young Scholars of China under Grant No. of 61225010, National Nature Science Foundation of China (Nos. 61173162, 61173165, 61370199, 61300187, 61300189, and 61370198), New Century Excellent Talents (No. NCET-10-0095), the Fundamental Research Funds for the Central Universities (Nos. 2013QN044 and 2012TD008).en_GB
dc.identifier.citationVol. 25, iss. 12, pp. 3135 - 3144en_GB
dc.identifier.doi10.1109/TPDS.2014.2306193
dc.identifier.urihttp://hdl.handle.net/10871/20806
dc.language.isoenen_GB
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en_GB
dc.relation.replaceshttp://hdl.handle.net/10871/20517en_GB
dc.relation.replaces10871/20517en_GB
dc.rightsThis is the author accepted manuscript. The final version is available from Institute of Electrical and Electronics Engineers (IEEE) via the DOI in this record.en_GB
dc.subjectscalabilityen_GB
dc.subjectk-means++en_GB
dc.subjectk-meansen_GB
dc.subjectapproximationen_GB
dc.subjectMapReduceen_GB
dc.subjectClustering algorithmsen_GB
dc.titleEfficient k-means++ approximation with MapReduceen_GB
dc.typeArticleen_GB
dc.date.available2016-03-07T09:42:51Z
dc.date.available2016-03-22T16:54:29Z
dc.identifier.issn1045-9219
dc.descriptionPublisheden_GB
dc.descriptionJournal Articleen_GB
dc.identifier.journalIEEE Transactions on Parallel and Distributed Systemsen_GB


Files in this item

This item appears in the following Collection(s)

Show simple item record