节点文献

基于领域知识库的简历信息抽取系统的设计与实现

Design and Implementation of Resume Information Extraction Ystem Based on Domain Knowledge Base

【作者】 张博

【导师】 徐塞虹;

【作者基本信息】 北京邮电大学 , 计算机技术, 2018, 硕士

【摘要】 简历是求职者对自身情况所做的书面介绍,尽管在结构上具有一定的特点,在内容上存在一定的规范,但是形式多样。对于招聘者来说,通过人工的方式阅读、记录和筛选简历,往往耗费巨大的工作量。因此需要利用信息抽取技术从自由格式的简历文本中抽取出结构化的有价值信息,能够极大地简化简历分析工作,从而围绕简历中的实体和事件信息构造有效的人才库,方便进行简历的筛选、检索以及人才匹配。本论文根据实际需要明确简历抽取的功能和非功能性需求,对系统架构和功能模块进行设计,深入研究简历信息抽取的技术解决方案,实现了一个基于领域知识库的简历信息抽取系统,主要完成了以下几方面工作:(1)从维基百科、招聘网站等互联网资源中采集信息进行整理,构建简历信息抽取相关的企业名特征库、等价名称库等领域知识库。(2)采用触发词匹配算法并结合Word2vec词向量扩展触发词库,实现了按照结构特征的简历信息分块。对于不含有触发词特征的简历,通过将简历句子表示为特征向量,利用SVM分类算法实现按照内容特征的简历分块。(3)对比分析了基于领域知识的条件随机场模型(CRF),隐马尔可夫模型(HMM)和最大熵模型(ME)在简历命名实体识别中的原理和应用效果,使用最优的统计模型实现各类简历块中的实体信息抽取。(4)提出了简历信息抽取回溯策略,采用基于领域知识库的规则匹配方法对统计模型实体识别的结果进行二次抽取,同时在识别出的部分实体序列中鉴别出事件信息。(5)利用E1 asti c search分布式检索引擎实现了对简历抽取结果的快速筛选和查询。除此之外,使用Zend框架,Echarts等WEB相关技术将各个功能整合到系统中,实现了简历信息抽取的可视化操作。本文在上述工作的基础上,对简历信息抽取系统进行了一系列功能和性能测试,结果显示系统能够实现自动从简历文本中抽取生成结构化信息并建立求职者数据库,并且对于大多数实体均能达到预期的抽取效果,说明了本文中提出的简历分块方案和实体抽取方案的有效性。同时系统为用户提供的简历管理、筛选和检索等功能,也显著提高了简历处理的效率,使其具有了更好的实用价值。

【Abstract】 Resume is a job seeker written description of their own situation,although there are certain characteristics in the structure,there are some content in the specification,but a variety of forms.So for recruiters,manual reading,recording and filtering resumes often cost a tremendous amount of work.Therefore,it is necessary to use information extraction technology to extract structured and valuable information from the free-form resume text,which can greatly simplify the resume analysis and construct an effective talent pool around the entity and event information in the resume so as to facilitate the talent matching,searching and filtering of resumes.Based on the brief introduction of the related technology of information extraction,this paper clarifies the demand and function design of resume extraction according to the actual needs,deeply studies the core technology solutions of resume information extraction,and implements a complete resume information extraction system and the following aspects of work:(1)Collect information from Internet resources such as Wikipedia and recruitment websites for collation,and build an enterprise name knowledge base,equivalent name knowledge base etc.(2)Trigger word matching algorithm is used in conjunction with Word2vec word vector to expand thesaurus to implement the segmentation of the resume information according to the structure characteristics.Trigger word matching algorithm is used in conjunction with Word2vec word vector to expand thesaurus to achieve the structure of the resume information block.For resumes that do not contain triggers,the resumes are expressed as eigenvectors,and the SVM classification algorithm is used to implement resume segmentation based on content features.(3)Comparative analysis the principle and application effect of Hidden Markov Model(HMM),Maximum Entropy Model(ME)and Conditional Random Field Model(CRF)which introduce domain knowledge in the named entities recognition of resume,select the optimal statistical model to achieve entity information extraction in various categories of resume block.(4)Proposing a backtracking strategy of resume information extraction.The rules matching method based on knowledge base was used to complete the results of entity recognition based on statistical methods.At the same time,identify some event information in sequence of entities.(5)The Elasticsearch distributed search engine is used to filter and search resume extraction results.In addition,using Zend framework,Echarts and other WEB related technology to achieve the resume information extraction data visualization and other business layer functions,so that it has a more practical value,enabling business recruiters to efficiently handle resumes.Based on the above work,this paper carried out a series of functions and performance tests on the resume information extraction system.The results show that system can automatically extract structured information from the resume texts and establish a job seeker database,and for most entities can achieve the expected results,illustrate the effectiveness of the proposed block citation scheme and entity extraction scheme in this paper.At the same time the system also provides users with resume management,filtering and retrieval capabilities to improve the efficiency of resume processing.

  • 【分类号】TP391.1
  • 【被引频次】7
  • 【下载频次】469
节点文献中: 

本文链接的文献网络图示:

本文的引文网络