测绘通报 ›› 2022, Vol. 0 ›› Issue (3): 101-106.doi: 10.13474/j.cnki.11-2246.2022.0085

• 学术研究 • 上一篇    下一篇

基于伪语义相似度模型的中文地址匹配方法

郁汀1,2, 王铎2, 陈钦1   

  1. 1. 公安部第三研究所, 上海 200031;
    2. 复旦大学, 上海 200433
  • 收稿日期:2021-03-16 修回日期:2022-01-21 出版日期:2022-03-25 发布日期:2022-04-01
  • 作者简介:郁汀(1990-),男,博士,助理研究员,研究方向为地址匹配。E-mail:rainydaily0@163.com

A Chinese addresses matching method based on the pseudo-semantic model

YU Ting1,2, WANG Duo2, CHEN Qin1   

  1. 1. The Third Research Institute of Ministry of Public Security, Shanghai 200031, China;
    2. Fudan University, Shanghai 200433, China
  • Received:2021-03-16 Revised:2022-01-21 Online:2022-03-25 Published:2022-04-01

摘要: 地址匹配中,由于传统相似度模型受字符重叠数影响大,在处理简写、缩写地址要素单元时,错误匹配问题突出;深度学习方法需要大量样本支撑,但庞大的数据量和多样的形式,导致生成样本的成本过高。为解决上述问题,本文首先应用基于条件随机场和双向长短时记忆神经网络的模型,对地址进行分词;然后通过建立一种伪语义相似度,对地址要素进行分级匹配。通过对公安业务中地址数据进行测试,在对缩写、简写等不规范地址描述方面,本文模型能较理想地完成任务,各参考指标均高于0.9。

关键词: 条件随机场和双向长短时记忆神经网络;地址要素解析;伪语义相似度;地址匹配;地址标准化

Abstract: Due to various ways to express the address element such as abbreviation and logogram,address matching is a difficult task specially in Chinese address matching.One important address matching method is relying on similarity.However,these traditional similarity methods focused on the overlap characters,and could not deal with the situation.The other crucial and useful method is based on deep learning technology,but it is difficult to generate a large amount of learning samples.In this paper,Bi-directional long short-term memory conditional random field is applied to achieve the goal of Chinese address segmentation.Then,a new similarity named pseudo-semantic is constructed to solve the problem of abbreviation and logogram.According to current results,the pseudosemantic similarity can provide better performance than other similarity models in the matching process and its recall and precision are both reaching 0.9 on the test set.The samples proved that the pseudo-semantic can recognize the abbreviation and logogram of address elements.

Key words: BiLSTM-CRF;resolution of address elements;pseudo-semantic model;addresses matching;address standardization

中图分类号: