Spark parallel optimization large-scale spectral clustering

doi:10.13474/j.cnki.11-2246.2019.0394

Abstract

Abstract: To solve the problems of computational time-consuming and resource occupation, which is hard to be prevented in existing spectral clustering on large-scale datasets, based on the Spark technology, an improved parallel optimization algorithm for spectral clustering is proposed. In which, repetitive calculation of data in similar matrix calculations is avoided by parallel one-way iteration, the resource occupancy is optimized by the parallel position transformation, the scalar multiplication replacement and the distance scaling, and the calculation amount is further optimized by the use of the approximate eigenvectors. The experimental results verify the effectiveness of the approximate eigenvectors and the good clustering performance and scalability under large-scale data sets.

Key words: large-scale spectral clustering, approximate eigenvector, Spark parallel computing, K-means distance calculation, optimization

CLC Number:

P208

Lü Honglin, YIN Qingshan. Spark parallel optimization large-scale spectral clustering[J]. Bulletin of Surveying and Mapping, 2019, 0(12): 96-100.

References

[1] LIERDE H, CHEN G, CHOW T. Scalable spectral clustering for overlapping community detection in large-scale networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 52(8):1058-1071.
[2] 成宝芝, 赵春晖, 张丽丽, 等. 联合空间预处理与谱聚类的协同稀疏高光谱异常检测[J]. 光学学报, 2017, 37(4):296-306.
[3] CHONG L, XIONG T, PENG G, et al. Improving large-scale moso bamboo mapping based on dense Landsat time series and auxiliary data[J]. Remote Sensing Letters, 2018, 9(1):1-10.
[4] BIAN Z, ISHIBUCHI H, WANG S. Joint learning of spectral clustering structure and fuzzy similarity matrix of data[J]. IEEE Transactions on Fuzzy Systems, 2019, 27(1):31-44.
[5] 金海, 张劲松, 吴睿. 一种基于抽样改进加权核K-means的大数据谱聚类算法[J]. 测绘通报, 2018(11):78-82.
[6] MORIN J, DONATI J F, PETIT P, et al. Large-scale magnetic topologies of mid M dwarfs[J]. Monthly Notices of the Royal Astronomical Society, 2018, 407(4):2269-2286.
[7] 王建明, 史文中, 邵攀. 自适应距离和模糊拓扑优化的模糊聚类SAR影像变化检测[J]. 测绘学报, 2018, 47(5):611-619.
[8] 王小玉, 丁世飞. 基于共享近邻的成对约束谱聚类算法[J]. 计算机工程与应用, 2019, 55(2):142-147.
[9] 李滔, 王士同. 适合大规模数据集的增量式模糊聚类算法[J]. 智能系统学报, 2016, 11(2):188-199.
[10] CAI D, CHEN X. Large scale spectral clustering via landmark-based sparse representation[J]. IEEE Transactions on Cybernetics, 2015, 45(8):1669-1680.
[11] CHEN W, SONG Y, BAI H, et al. Parallel spectral clustering in distributed systems[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(3):568-586.
[12] 李晓瑜, 俞丽颖, 雷航,等. 一种K-means改进算法的并行化实现与应用[J]. 电子科技大学学报, 2017, 46(1):61-68.
[13] HE Li,RAY Nilanjan,GUAN Yisheng, et al. Fast large-scale spectral clustering via explicit feature mapping[J]. IEEE Transactions on Cybernetics, 2019, 49(3):1058-1071
[14] 朱书伟, 周治平, 张道文. 融合并行混沌萤火虫算法的K-调和均值聚类[J]. 智能系统学报, 2015, 10(6):872-880.
[15] GHORBANI A, HONARMAND M, SHAHRIARI H, et al. Regional scale prospecting for non-sulphide zinc deposits using ASTER data and different spectral processing methods[J]. International Journal of Remote Sensing, 2019(1):1-21.
[16] 刘鹏, 滕家雨, 丁恩杰,等. 基于Spark的大规模文本k-means并行聚类算法[J]. 中文信息学报, 2017, 31(4):145-153.