测绘通报 ›› 2019, Vol. 0 ›› Issue (12): 96-100.doi: 10.13474/j.cnki.11-2246.2019.0394

• 学术研究 • 上一篇    下一篇

大规模数据集Spark并行优化谱聚类

吕洪林1, 尹青山1,2   

  1. 1. 辽宁对外经贸学院, 辽宁 大连 116052;
    2. 吉林大学, 吉林 长春 130000
  • 收稿日期:2019-06-24 修回日期:2019-10-30 发布日期:2020-01-03
  • 作者简介:吕洪林(1967-),男,硕士,教授,研究方向为计算机技术与应用。E-mail:lu_ljuan@163.com
  • 基金资助:
    辽宁对外经贸学院博士科研启动基金(2019XJLXBSJJ002);辽宁省教育厅科学研究项目(ldxy2017008)

Spark parallel optimization large-scale spectral clustering

Lü Honglin1, YIN Qingshan1,2   

  1. 1. Liaoning University of International Business and Economics, Dalian 116052, China;
    2. College of Mining Engineering, Jilin University, Changchun 130000, China
  • Received:2019-06-24 Revised:2019-10-30 Published:2020-01-03

摘要: 针对已有大规模数据集并行谱聚类算法的计算耗时和资源占用巨大等问题,基于当前批处理和图计算兼顾的Spark并行技术,提出了大规模数据集谱聚类的并行优化改进算法,算法通过并行单向迭代避免了相似矩阵计算时的数据重复计算,通过并行位置变换、标量乘法替换及距离缩放优化算法的资源占用,通过近似特征向量替代进一步优化算法的计算量。试验结果验证了算法近特征向量的有效性及在大规模数据集下良好聚类性能和扩展性。

关键词: 大规模集谱聚类, 近似特征向量, Spark并行框架, K-means距离计算, 优化

Abstract: To solve the problems of computational time-consuming and resource occupation, which is hard to be prevented in existing spectral clustering on large-scale datasets, based on the Spark technology, an improved parallel optimization algorithm for spectral clustering is proposed. In which, repetitive calculation of data in similar matrix calculations is avoided by parallel one-way iteration, the resource occupancy is optimized by the parallel position transformation, the scalar multiplication replacement and the distance scaling, and the calculation amount is further optimized by the use of the approximate eigenvectors. The experimental results verify the effectiveness of the approximate eigenvectors and the good clustering performance and scalability under large-scale data sets.

Key words: large-scale spectral clustering, approximate eigenvector, Spark parallel computing, K-means distance calculation, optimization

中图分类号: