A Chinese addresses matching method based on the pseudo-semantic model

doi:10.13474/j.cnki.11-2246.2022.0085

Abstract

Abstract: Due to various ways to express the address element such as abbreviation and logogram,address matching is a difficult task specially in Chinese address matching.One important address matching method is relying on similarity.However,these traditional similarity methods focused on the overlap characters,and could not deal with the situation.The other crucial and useful method is based on deep learning technology,but it is difficult to generate a large amount of learning samples.In this paper,Bi-directional long short-term memory conditional random field is applied to achieve the goal of Chinese address segmentation.Then,a new similarity named pseudo-semantic is constructed to solve the problem of abbreviation and logogram.According to current results,the pseudosemantic similarity can provide better performance than other similarity models in the matching process and its recall and precision are both reaching 0.9 on the test set.The samples proved that the pseudo-semantic can recognize the abbreviation and logogram of address elements.

Key words: BiLSTM-CRF;resolution of address elements;pseudo-semantic model;addresses matching;address standardization

CLC Number:

P208

YU Ting, WANG Duo, CHEN Qin. A Chinese addresses matching method based on the pseudo-semantic model[J]. Bulletin of Surveying and Mapping, 2022, 0(3): 101-106.

References

[1] 王静远,李超,熊璋,等.以数据为中心的智慧城市研究综述[J].计算机研究与发展, 2014, 51(2):239-259.
[2] 张雪英,闾国年,杜咪,等.大数据驱动的地名信息获取与应用[J].现代测绘, 2017, 40(2):1-5.
[3] 徐流畅.预训练深度学习架构下的语义地址匹配与语义空间融合模型研究[D].杭州:浙江大学, 2020.
[4] 邹恩岑,曾诚,张谦,等.一种面向中文非标建筑地址标准化的自动匹配方法[J].苏州科技大学学报(自然科学版), 2019, 36(4):66-74.
[5] 王方正.面向少量标记数据的中文地址分词方法研究[D].杭州:浙江大学, 2020.
[6] 周海.基于条件随机场和空间推理的地理编码方法[D].郑州:信息工程大学, 2015.
[7] MELO F, MARTINS B. Automated geocoding of textual documents:a survey of current approaches[J]. Transactions in GIS, 2017, 21(1):3-38.
[8] GOLDBERG D W, WILSON J P, KNOBLOCK C A. From text to geographic coordinates:the current state of geocoding[J]. URISA Journal, 2007, 19(1):33.
[9] 宋子辉.自然语言理解的中文地址匹配算法[J].遥感学报, 2013, 17(4):788-801.
[10] WU Z, TSENG G. Chinese text segmentation for text retrieval:achievements and problems[J]. Journal of the American Society for Information Science, 1993, 44(9):532-542.
[11] FUNG P, WU D. Statistical augmentation of a chinese machine-readable dictionary[M].[S. l.]:Springer Netherlands, 1995.
[12] 魏金明,仲伟政.基于置信度的地址匹配方法初探[J].测绘科学, 2015, 40(1):122-125.
[13] 李振星,徐泽平,唐卫清,等.全二分最大匹配快速分词算法[J].计算机工程与应用, 2002, 38(11):106-109.
[14] XUE N, CONVERSE S P. Combining classifiers for Chinese word segmentation[C]//Proceedings of the First SIGHAN Workshop on Chinese Language ProcessingVolume 18. Morristown,NJ:Association for Computational Linguistics, 2002:1-7.
[15] XUE N, SHEN L. Chinese word segmentation as LMR tagging[C]//Proceedings of the Second SIGHAN Workshop on Chinese Language Processing-Volume17. Morristown,NJ:Association for Computational Linguistics, 2003:176-179.
[16] 王东海,赵伟,陈洁,等.基于隐Markov模型汉语词性自动标注的若干分析与改进[J].长春工业大学学报(自然科学版),2007, 28(1):48-52.
[17] 王敏.基于改进的隐马尔科夫模型的汉语词性标注[D].太原:山西大学, 2007.
[18] 邬伦,刘磊,李浩然,等.基于条件随机场的中文地名识别方法[J].武汉大学学报(信息科学版), 2017,42(2):150-156.
[19] 何炎祥,罗楚威,胡彬尧.基于CRF和规则相结合的地理命名实体识别方法[J].计算机应用与软件, 2015, 32(1):179-185.
[20] ZHENG X, CHEN H, XU T. Deep learning for Chinese word segmentation and POS tagging[C]//Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington:[s. n.], 2013:647-657.
[21] PEI W, GE T, CHANG B. Max-margin tensor neural network for Chinese word segmentation[C]//Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2014:293-303.
[22] CHEN X, QIU X, ZHU C, et al. Gated recursive neural network for Chinese word segmentation[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). Stroudsburg,PA:Association for Computational Linguistics, 2015:1744-1753.
[23] CHEN X, QIU X, ZHU C, et al. Long short-term memory neural networks for chinese word segmentation[C]//Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2015:1197-1206.
[24] YAO Y, HUANG Z. Bi-directional LSTM recurrent neural network for Chinese word segmentation[C]//International Conference on Neural Information Processing.[S. l.]:Springer, Cham, 2016:345-353.
[25] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2016:260-270.
[26] KANG M, DU Q, WANG M. A new method of Chinese address extraction based on address tree model[J]. Acta Geodaetica et Cartographica Sinica, 2015, 44(1):99-107.
[27] 亢孟军,杜清运,王明军.地址树模型的中文地址提取方法[J].测绘学报, 2015, 44(1):99-107.
[28] 李新放,宋转玲,陈学业,等. K叉树地址的模糊匹配研究与实现[J].测绘通报, 2018(9):126-129.
[29] 史名君.非规范中文地址的智能匹配研究[D].徐州:中国矿业大学, 2020.
[30] 刁兴春,谭明超,曹建军.一种融合多种编辑距离的字符串相似度计算方法[J].计算机应用研究,2010, 27(12):4523-4525.
[31] SHAN S, LI Z, QIANG Y, et al. DeepAM:deep semantic address representation for address matching[C]//Web and Big Data. Cham:Springer International Publishing, 2019:45-60.
[32] LIN Y, KANG M, WU Y, et al. A deep learning architecture for semantic address matching[J]. International Journal of Geographical Information Science,2020, 34(3):559-576.
[33] 赵明,杜会芳,董翠翠,等.基于word2vec和LSTM的饮食健康文本分类研究[J].农业机械学报, 2017, 48(10):202-208.
[34] GERS F A, SCHMIDHUBER J, CUMMINS F, et al. Learning to forget:continual prediction with LSTM[J]. Neural computation, 2000, 12(10):2451-2471.
[35] 邓力.深度学习自然语言处理[M].北京:清华大学出版社,2020.
[36] CHEN T, XU R, HE Y, et al. Improving sentiment analysis via sentence type classification using BiLSTMCRF and CNN[J]. Expert System with Applications, 2017, 72:221-230.