节点文献

基于SAO结构的科技文本挖掘方法及应用研究

Research on SAO-Based Science and Technology Text Mining and Its Application

【作者】 杨超

【导师】 朱东华;

【作者基本信息】 北京理工大学 , 管理科学与工程, 2016, 博士

【摘要】 科技文本挖掘已成为技术发展决策活动中的重要方法,在促进知识分享与创新进程上正发挥越来越重要的作用。但是目前科技文本挖掘主要集中在独立的主题词/词组识别上,未能明确识别主题词间的关系,且易于遭遇同形异义词和同义词造成的语义歧义问题。本文围绕Subject-Action-Object(SAO)结构的科技文本挖掘方法开展研究,重点针对SAO语义结构抽取、科技文本中技术主题识别与分类以及技术主题发展趋势分析这三个主要问题,在SAO结构抽取模型、基于SAO结构的主题模型、基于SAO结构的核心技术组件识别模型以及基于SAO网络的技术发展趋势模型开展深入研究。本文属于科技管理与文本挖掘的交叉科学研究,主要创新成果如下:(1)构建了层次化的基于语法树的SAO结构识别方法在前人基于规则的命名实体间关系识别的基础上,建立了层次化的、基于语法树的SAO结构识别方法。该方法包含三个部分:1)建立了基于Term Clumping的SAO核心组件识别模型,以保证所识别SAO结构与目标主题具有较强相关性;2)建立了分层的、基于语法树的SAO结构提取模型,以保证SAO提取的查全率和查准率;3)建立了整合SAO各组件词频、文档频指标的SAO评价模型,以识别核心SAO结构。案例研究验证了SAO识别的准确性,该方法能确保所抽取SAO结构隶属目标主题范围内,并支持对SAO结构进行重要性排序。(2)构建了基于SAO结构的LDA(Latent Dirichlet Allocation)主题模型本文基于SAO结构识别“问题&解决方案”(P&S)模式,提出“bag of P&S”假设,进而构建基于SAO结构的LDA(Latent Dirichlet Allocation)主题模型。本研究给出的主题模型能够有效识别主题结构,并在主题辨识度和语义消歧方面较传统LDA模型具有较大改善。(3)构建了基于SAO结构的需求导向的核心技术组件识别模型为了理解和监控不同需求主题的核心技术组件(技术流程、操作方法、功能、材料制备等),本文基于SAO结构构建了需求导向的核心技术组件识别模型:1)识别科技文本中的“功能”、“操作”、“关系”等概念;2)设立重要性指标和创新性指标,基于频率统计、技术组件相关性、技术组件生命周期分析判断技术组件重要程度和创新性,以筛选重要技术组件;3)设计技术组件与需求相关性指标来计算技术组件与需求的相关性,进而识别核心技术组件。本文给出的基于科技文本的核心技术组件识别方法在准确描述完整技术细节方面有其优势,可以判断核心技术组件对应的技术需求。(4)构建了基于SAO网络的技术发展趋势分析模型构建基于SAO网络的技术发展趋势分析模型,为了计算两个行动者间的关系强度,提出了基于“Subject(node)-Action(edge)-Object(node)链接”的SAO网络构建方法。从出入度、关键Action、行动者“Burt约束”、节点“度分布”演变以及网络结构中心偏移角度分析新兴技术发展趋势。本研究提出的基于SAO网络的技术趋势分析模型能够识别核心技术与需求,识别行动者间具体关系内容和关系强度,应用在技术竞争优势分析。本研究给出的实证研究,为验证新方法的有效性提供了依据。

【Abstract】 Science and technology text mining becomes the key method in decision-making of technology development,and plays an important role in promoting knowledge sharing and innovation process.But at present,the Science and technology text mining mainly focuses on the using of topic words/phrases,cannot clearly identify the relationship between topic words/phrases,and is facing the problem of ambiguous interpretations resulted by homonyms and synonyms of words.This paper researches Subject-Action-Object(SAO)-based text mining methods,which focus on the semantic structure extraction,topic identification and classification,and technology trend analysis.This method includes four models: SAO extraction model,SAO-based topic model,SAO-based core technological components’ identification model and SAO network-based technology trend analysis model.This paper is an interdisciplinary research of technological management and text mining.The significances of this essay are given as below:(1)To build up a hierarchical and parse tree-based SAO identification methodA hierarchical and parse tree-based SAO identification method is proposed on the basis of the former rule-based naming entities relationship identification.This method includes three parts: 1)In order to ensure that the SAO structure has a strong correlation with the topic,an SAO components identification model is proposed on the basis of term clumping processes and co-word analysis;2)a parse tree-based hierarchical SAO extraction model is proposed to ensure the recall and precision of SAO extraction;and 3)a Term Frequency Inverse Document Frequency(TF-IDF)-based SAO weighting model is proposed to rank SAO structures for key SAOs selection.The case study verifies the accuracy of SAO identification.The proposed method ensures that the SAO structure falls within the scope of the target topic and supports the weighting of the SAO structures.(2)To build up an SAO-based LDA modelAn SAO-based LDA model is proposed,which includes: 1)identifying and exploring the problem & solution patterns embodied in SAO structures;2)proposing “bag-of-SAO” assumption;3)SAO-Based LDA(Latent Dirichlet Allocation)model is built based on the “bag-of-SAO” assumption.The proposed topic model can effectively identify the topic structure,and achieve great improvement in topic recognition and semantic disambiguation compared with the traditional LDA model.(3)To build up a requirement-oriented core technological components’ identification model based on SAO structureIn order to understand and monitor the core technological components(e.g.,technology process,operation method,function and material preparation)of a technology,this paper proposes a requirement-oriented core technological components’ identification model based on SAO structure,in which 1)a syntax-based approach is constructed to identify the SAO structures describing the function,relationship and operation in specified topics;2)"Importance indicator" and "innovation indicator" are built based on frequency statistics,technological components’ correlation and technological component life cycle analysis,to judge the importance and innovativeness of technological components,and finally to screen technological components;and 3)this paper proposes a “relevance indicator” to calculate the relevance of the technological components to requirements,and finally identify core technological components based on this indicator.The proposed method can be used to describe the complete technical details accurately,judge the technical requirement corresponding to the core technological components.(4)To build up an SAO network-based technology trend modelAn SAO network-based technology trend model is proposed considering the actor network theory.SAO network is built based on the "Subject(node)-Action(edge)-Object(node)" link.After that,the relationship strength between actors is calculated.The development trend of emerging technology is analyzed with five indicators: in&out degree,key action,"Burt constraint",node "degree distribution" evolution and network center deviation.The proposed SAO network-based technology trend model can identify the core technologies and requirements,identify the relationship details and relationship strength between actors,and finally implement technology competitive advantage analysis.The empirical study is performed to demonstrate the proposed methods.

  • 【分类号】TP391.1;G254
  • 【被引频次】1
  • 【下载频次】383
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络