节点文献
基于Android的众包文本标注系统的设计与实现
The Design and Implementation of Crowdsourcing Platform for Text Labeling System Based on Android
【作者】 孔敏;
【导师】 伏晓;
【作者基本信息】 南京大学 , 工程硕士(软件工程领域)(专业学位), 2019, 硕士
【摘要】 文本信息是最基本的信息形式,利用自然语言处理技术可以对海量的文本数据进行分析处理。而智能化自动处理信息的首要条件是要有已经标注的数据作为训练集对数据模型进行训练。因此,对文本数据进行标注就成为在对自然语言处理算法进行研究之前需要解决的一个问题。由于文本处理算法多种多样,需要对文本进行不同角度的研究,就需要实现多种类型的文本标注。本文总结了国内外数据标注平台的发展现状,针对目前数据标注平台标注类型繁多,但是鲜有专业的文本标注平台的特点;结合众包平台用户量大、效率高、成本低的特点,提出构建基于众包的文本标注系统的必要性和可行性,从而有效解决文本标注问题。本文设计实现了一个基于众包平台的文本标注系统。该系统分为任务发布、任务执行和任务管理三个模块。在该系统中,文本标注工作以任务为载体,文本标注任务被划分成不同的类型。在任务发布模块用户选择任务类型,然后把需要标注的文本内容以文件的形式上传到该系统。在任务执行模块用户可以通过选取文件内容、选择标签、连线和拖拽文本等不同操作方式,对文本数据进行不同类型的标注。在任务管理模块用户可以查看自己发布或参与的任务。该系统后台使用Spring Boot框架进行搭建,前端使用And roid移动端页面展示数据。该系统设计并实现了对文本的六种类型的标注,完成了预期功能,后期可以扩展新的文本标注类型。本文主要对三种文本标注类型的设计与实现进行了描述。该系统致力于为自然语言处理的所有算法提供高质量、多种类的可靠标注数据集;利用可靠数据提高算法训练的准确度,缩减训练算法前期准备的时间,推动自然语言处理技术的发展。
【Abstract】 Text information is the most basic form of information,and natural language processing technology can be used to analyze and process massive amounts of text data.The first condition for processing information intelligently and automatically is to own the text data that has already been labeled as thetraining set to train the data model.Therefore,labeling text data has become a problem to be solved before the study on natural language processing algorithms.Because there are many kinds of text processing algorithms,it is necessary to study the text at different angles,and it is necessary to implement multiple types of text labeling.This thesis has summarized the development status of data labeling platform at home and abroad,aiming at the characteristics of the current data labeling platform:there are many kinds of data labeling types,but there are few professional text labeling platforms;combined with the characteristics of crowdsourcing platform:users with large quantity,high efficiency and low cost,so the necessity and feasibility of constructing a crowdsourcing-based text labeling system is proposed to solve the data labeling problem effectivelyThis thesis has designed and implemented a text labeling system based on crowdsourcing platform.The system is divided into three modules:task publishing module task executing module and task management module.In this system,the text labeling work is task-based and the text labeling tasks are divided into different types.In the task publishing module,users can select a text labeling type,and then upload the text content that wants to be labeled to the system in the form of file.In the task executing module,users can choose different ways of operation such as selecting file content,selecting labels,connecting lines and draging text to implement different types of text labeling.In the task management module,users can view tasks that are published or participated in by himself.The system’s back end uses the Spring Boot framework to build,and the front end uses Android mobile pages to display data.The system has designed and implemented six types of text labeling to label text,and has completed the expected functions,the system can extend new text labeling types in late period.The system is dedicated to providing high-quality,multi-category and reliable labeling data sets for all algorithms of natural language processing;and improving the accuracy of algorithm training by using reliable data,reducing the preparation time required for training algorithms,and promoting the development of natural language processing technology.
【Key words】 Crowdsourcing; Text Annotation; Information Extraction; Relationship Extraction; Spring Boot; Android;
- 【网络出版投稿人】 南京大学 【网络出版年期】2019年 07期
- 【分类号】TP391.1
- 【被引频次】2
- 【下载频次】320