节点文献
基于人机协同的医学文献信息抽取关键技术及系统研发
Research and Development of Key Technologies and Systems for Medical Literature Information Extraction Based on Human-computer Collaboration
【作者】 王国栋;
【导师】 周雪忠;
【作者基本信息】 北京交通大学 , 软件工程(专业学位), 2021, 硕士
【摘要】 随着互联网技术的日益普及和医学文献数量的快速增长,医学文献数据量呈现爆炸式增长,但大量医学文献数据大多以结构化方式存储,有着不易提取,人工标注成本高昂等特点。在医学文献中,文献摘要记录着重要信息,如何根据大量的医学文献摘要抽取重要的循证医学数据,并分析从而开发合成新的药物来治疗疾病变得愈发重要。医学文献命名实体识别,作为自然语言处理的基础和重要的任务,可以从非结构化的医学文献中抽取规范的实体,可以用于构建医学知识图谱任务。研发基于人机协同的医学文献标注系统可以用少量的人工标注从而实现短时间对大量医学数据进行信息提取,提高效率,并且可以为下游的数据挖掘等提供支持。首先提出医学文献命名实体识别的模型BERT-BiLSTM-CRF的实体识别算法模型,最后研发对于循证医学进行高效标注的“人机协同医学文献标注系统”。具体工作有如下二部分组成:(1)基于BERT-BiLSTM-CRF的命名实体识别模型BERT预训练算法模型是近些年出现的优秀的深度学习算法模型。本文首先采取BERT训练字向量,使用双向Transformer对输入语言序列进行编码,得到句子在任何两个相对位置上的字之间的表征。BERT模型在实体识别任务NCBI-疾病标准数据集中F1达到75.91%,添加双向LSTM和CRF层,对BERT处理之后的向量进行进一步特征提取之后,F1值提升了0.58%。本文运用2种传统机器学习算法模型和5种深度学习算法模型,做了生物医学领域的Pub Med循证医学文献实体识别方面的工作,对实验结果进行了比较和分析对比。对于每种算法模型进行3次求平均的方式,得到最后结果。通过对比实验结果发现性能最好的是BERT-BiLSTM-CRF。在NCBI-疾病标准数据集F1为76.49%在Pub Med疾病和症状语料库的F1值为74.18%。(2)人机协同医学文献标注系统研发针对循证医学文献进行快速结构化提取的“人机协同医学文献标注系统”。该系统用人工标注和智能标注相结合的方法,并对常见的命名实体识别算法模型进行管理,标注人员人工标注之后可以设置需要训练的算法和数据集以及学习率等参数,训练算法并测试算法模型的性能;对医学文献进行管理,可以根据条件在线搜索医学文献并对结果结构化存储和对搜索主题词进行管理,提高科研人员的使用体验;对数据集进行管理,实现对特定任务标注之后制作标准数据集用于算法模型的训练和测试。人机协同医学文献标注系统可以在人工很少的情况下,快速的对大量循证医学数据进行信息提取,减少人工标注的成本以及所需的人力物力和时间提高了效率。本文主要提出了人机协同医学文献标注系统流程设计和功能设计以及本人参与的部分开发工作。
【Abstract】 With the increasing popularity of Internet technology and the rapid growth of medical literature,the volume of medical literature data is exploding,but a large amount of medical literature data is mostly stored in a structured manner,which is not easy to extract and expensive to manually annotate.In medical literature,abstracts record important information,and it becomes more and more important to extract important evidence-based medical data based on a large number of medical literature abstracts and analyze them to develop new synthetic drugs to treat diseases.Named entity recognition of medical literature,as a basic and important task of natural language processing,Normative entities can be extracted from the unstructured medical literature,which can be used for the task of building medical knowledge graphs.The development of a collaborative human-computer based medical literature annotation system can be used to extract information from a large amount of medical data in a short time with a small amount of manual annotation to improve efficiency and provide support for downstream data mining.Firstly,we propose the entity recognition algorithm model of BERTBiLSTM-CRF for named entity recognition of medical literature,and finally we develop the "human-computer collaborative medical literature annotation system" for efficient annotation of evidence-based medicine.The work consists of two parts as follows:(1)A BERT-BiLSTM-CRF based Named Entity Recognition Model for Medical Literature.The BERT pre-training algorithm model is an excellent deep learning algorithm model that has emerged in recent years.In this paper,we first take BERT to train the word vector and encode the input language sequence using a bidirectional Transformer to get the representation between the words of the sentence at any two relative positions.The BERT model achieves an F1 of 75.91% in the NCBI-disease standard dataset for the entity recognition task.Adding bidirectional LSTM and CRF layers to the vector after BERT processing After further feature extraction,the F1 value improved by 0.58%.In this paper,two traditional machine learning algorithm models and five deep learning algorithm models are used to do the work on Pub Med evidence-based medical literature entity recognition in biomedical field,and the experimental results are compared and analyzed for comparison.For each algorithmic model 3 times averaging was performed to get the final results.The best performance is found by comparing the experimental results with BERT-BiLSTM-CRF.in NCBI-disease standard dataset F1 was76.49% in Pub Med disease and symptom corpus F1 value was 74.18%.(2)We developed the "Human-Computer Collaborative Medical Literature Annotation System" for rapid structured extraction of evidence-based medical literature.The system combines manual annotation and intelligent annotation,and manages common named entity recognition algorithm models,so that annotators can train and test the performance of the algorithm models after annotation.The system can be used to create standard datasets for training and testing of algorithm models after labeling specific tasks.The human-computer collaborative medical literature annotation system can quickly extract information from a large amount of evidence-based medical data with little manual effort,reducing the cost of manual annotation and improving the efficiency by reducing the human and material resources and time required.This paper presents the process design and functional design of the human-computer collaborative medical literature annotation system and some of the development work I participated in.
【Key words】 named entity recognition; medical literature; deep neural networks; human-computer collaboration; text mining;
- 【网络出版投稿人】 北京交通大学 【网络出版年期】2022年 02期
- 【分类号】R-05;TP391.1
- 【被引频次】1
- 【下载频次】139