Bulletin of Surveying and Mapping ›› 2019, Vol. 0 ›› Issue (12): 96-100.doi: 10.13474/j.cnki.11-2246.2019.0394

Previous Articles     Next Articles

Spark parallel optimization large-scale spectral clustering

Lü Honglin1, YIN Qingshan1,2   

  1. 1. Liaoning University of International Business and Economics, Dalian 116052, China;
    2. College of Mining Engineering, Jilin University, Changchun 130000, China
  • Received:2019-06-24 Revised:2019-10-30 Published:2020-01-03

Abstract: To solve the problems of computational time-consuming and resource occupation, which is hard to be prevented in existing spectral clustering on large-scale datasets, based on the Spark technology, an improved parallel optimization algorithm for spectral clustering is proposed. In which, repetitive calculation of data in similar matrix calculations is avoided by parallel one-way iteration, the resource occupancy is optimized by the parallel position transformation, the scalar multiplication replacement and the distance scaling, and the calculation amount is further optimized by the use of the approximate eigenvectors. The experimental results verify the effectiveness of the approximate eigenvectors and the good clustering performance and scalability under large-scale data sets.

Key words: large-scale spectral clustering, approximate eigenvector, Spark parallel computing, K-means distance calculation, optimization

CLC Number: