节点文献
文本特征和复合统计量的领域术语抽取方法
Domain Term Extraction Method Based on Hierarchical Combination Strategy for Chinese Web Documents
【摘要】 中文领域术语的抽取,是文本知识挖掘的重要内容。传统的中文领域术语抽取方法以人工方法为主,显然这种方法费时费力。目前,处于研究阶段的中文领域术语自动化抽取方法主要有:基于字典的方法、基于规则的方法以及基于统计的方法。但由于中文自然语言的复杂性,这些自动化抽取方法都存在一定的局限性,比如对特定领域的用户字典及规则存在更新速度慢、文本特征考虑不足等,从而导致抽取的效果不佳。针对这一问题,提出了一种基于文本特征和复合统计量的中文领域术语抽取方法,该方法在对中文文档中的词语进行粗粒度筛选后,再综合考虑候选术语的词性、长度、边界词语等文本特征,构造出信息熵和TFIDF等统计量,计算其综合权值,并将综合权值大于设定阈值的候选术语抽取出来,作为最终的领域术语。实验结果表明,该方法在测试语料下,获得了较好的正确率、召回率和F值。
【Abstract】 Chinese domain term extraction is an important content of text knowledge mining. Chinese domain term extraction method with the traditional manual method,this method is time-consuming and laborious. It is currently in Chinese domain term extraction method of automation stage are: dictionary based method,rule-based method and statistical based method. Due to the complexity of Chinese natural language,the automatic extraction method has some limitations,such as the specific areas of the user dictionary and rule updating speed is slow,lack of consideration of text feature,which leads to the extraction performance is poor. To solve these problems,this paper presents Chinese domain term extraction methods that compound the text feature and statistics. After coarse grain screening of Chinese words in a document,the method considering the part of speech,word length,boundary text features of the candidate terms,construct information entropy and TFIDF statistics,calculate the comprehensive weight,and the weights are bigger than the set threshold extracted as the final domain terms. The experimental results show that the method gets the good correct rate,recall rate and F-measure under the test corpus.
【Key words】 Chinese domain term; text mining; natural language processing; text feature;
- 【文献出处】 西北工业大学学报 ,Journal of Northwestern Polytechnical University , 编辑部邮箱 ,2017年04期
- 【分类号】TP391.1
- 【被引频次】13
- 【下载频次】217