节点文献

面向技术问题的中文专利聚类分析

Chinese Patent Clustering Analysis for Technical Problems

【作者】 李伟

【导师】 马建红;

【作者基本信息】 河北工业大学 , 工程硕士(专业学位), 2018, 硕士

【摘要】 对专利信息进行抽取并构建相应的模型,有助于把握相关技术知识,从而推动相应领域产品的创新发展。而目前对中文专利信息的分析研究主要集中在专利标题、分类号、摘要等内容,缺乏对专利主体内容的分析,导致专利分析有一定的局限性。本文扩展了专利分析领域,面向技术问题进行中文专利聚类分析,可以更为清晰地了解专利创新过程中面临的问题与解决的问题,从而为新的创新提供思路,推动创新的进程,并进一步提高后续专利分析的效率。本文将专利背景技术文本作为技术问题抽取对象,通过总结句式特点及构建问题词典匹配获取问题描述句,并借鉴观点信息抽取方法实现中文专利技术问题抽取,将技术问题的抽取分为抽取问题词、问题对象、问题单元三个过程。先利用关联规则与条件随机场模型迭代抽取构建问题词集,再联合无监督方法与有监督方法,将无监督方法的结果作为有监督方法的输入抽取问题对象,最后将问题单元的抽取视为序列标注问题,以抽取的问题词、问题对象为特征,建立多特征模板,抽取获得扩展的问题单元四元组。并以抽取的中文专利技术问题为语料,改进传统的K-means文本聚类算法,提出一种基于相似中心的文本聚类算法——cK-means,面向技术问题对中文专利进行聚类分析,聚类构建的专利问题模型为今后专利推荐系统的设计与实现奠定了基础。实验证明本文的技术问题抽取方法减少了人工标注量并取得了可观的效果,利用多特征模板抽取问题单元的F1值达到了81.76%,而且本文构建的专利问题模型比传统构建模型的稳定性与准确性均有所提升,取得的F1值与准确率均提升约10%,所以本文方法对专利技术问题的分析具有一定的研究意义。

【Abstract】 The extraction of patent information and the construction of corresponding models help to grasp the relevant technical knowledge and thus promote the innovation and development of products in the corresponding fields.At present,the analysis and research on Chinese patent information mainly focuses on the patent title,classification number,abstract,etc.The lack of analysis of the main content of patents leads to a certain limitation of patent analysis.This paper expands the field of patent analysis,and analyzes Chinese patents for technical problems.It can more clearly understand the problems faced in the process of patent innovation and solve problems,thus providing ideas for new innovations,promoting the process of innovation,and further Improve the efficiency of subsequent patent analysis.This article will take the patent background technical text as the technical question to extract the object,obtains the problem description sentence by summarizing the sentence pattern characteristic and constructs the problem dictionary match,and draws the viewpoint information extraction method to realize the Chinese patent technology question extraction,divides the technical question extraction to extract the question word,problem objects,problem units three processes.Firstly,it constructs problem word sets by iteratively extracting association rules and conditional random field models,and then unsupervised and supervised methods are combined.Unsupervised method results are used as input for supervised methods to extract problem objects.Finally,the extraction of problem units is regarded as The sequence labeling problem is characterized by the extracted problem words and problem objects,and a multi-feature template is established,and the extended problem unit quad is extracted.In order to improve the traditional K-means text clustering algorithm,a text clustering algorithm based on similarity centers,called cK-means,was proposed to analyze Chinese patents for technical problems.The patent problem model built by clustering laid the foundation for the design and implementation of the patent recommendation system in the future.Experiments prove that the technical problem extraction method in this paper reduces the amount of manual labeling and achieves considerable results.The F1 value of the problem unit using the multi-feature template has reached 81.76%,and the stability of the patent problem model built in this paper is better than that of the traditional construction model.Accuracy has been improved,and both the F1 value and the accuracy rate obtained have increased by about 10%.Therefore,this method has a certain significance in the analysis of patented technical issues.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络