节点文献
基于改进seq2seq模型的英汉翻译研究
Research on English-Chinese Translation Based on Improved seq2seq Model
【作者】 刘杰;
【导师】 李石君;
【作者基本信息】 武汉大学 , 软件工程, 2018, 硕士
【摘要】 机器翻译是自然语言处理领域的一个重要课题,具有巨大的科研价值和广阔的商业应用前景。当今机器翻译领域效果最好的方法是自2014年首次提出的神经机器翻译模型,其中最主流的是基于注意力机制seq2seq模型。现有seq2seq模型主要在印欧语系上进行优化与评测,少有针对中文的优化,且现有模型没有考虑到不同语言间语法的变换。本文针对中文特点,使用了不同的文本预处理和嵌入层参数初始化方法,并改进了 seq2seq模型结构,在编码器和解码器之间添加了一层用于语法变化的转化层。本文主要工作如下:1.提出不同的文本预处理方法。在自然语言处理任务中,需要先通过预处理将非结构化的文本数据转换为计算机可识别的数据格式。翻译系统中传统的中文预处理方法是通过分词将中文句子转换为词语序列,但这种方法依赖分词的准确率且会导致中文词汇量过大。本文针对中文字符种类多、字符信息熵大、表意能力强等特点,提出通过命名实体识别,将中文句子转换成字符+命名实体序列的预处理方法。通过实验发现,使用该预处理方法,在英汉翻译任务中,能缩减翻译模型的参数规模和训练时间18%以上,且翻译性能有0.3~0.5BLEU的提升。2.提出不同的嵌入层参数初始化方法。嵌入层是用于文本处理的神经网络模型中的第一层,将预处理后的字符序列转换为数值向量序列,以支持后续的数值计算。深度学习中参数初始化方法的选择对模型的收敛位置至关重要,现有的翻译模型中,通常会选择预训练的词向量作为嵌入层参数的初始化值。但由于翻译系统中需要使用两种不同语言的词嵌入表示,而预训练的词向量在不同语言的语料中训练,导致不同语言的词向量语义上并不契合。因此,本文提出在英汉翻译模型中,英文端使用GIoVe进行嵌入层参数初始化,中文端使用随机初始化。通过实验发现,使用该参数初始化方法训练的英汉翻译模型,在中小型规模的语料上翻译性能有0.3~0.6BLEU的提升。3.改进seq2seq模型结构,提出转换层。现有seq2seq模型中,源语言序列通过编码器生成一个表示向量,然后表示向量直接作为解码器的初始状态,生成目标语言序列。但这一结构没有考虑到不同语言之间语法的变化。因此,本文改进了 seq2seq模型的结构,在编码器和解码器之间添加了一层用于语法变化的转换层。该转换层由两层前向神经网络、残差连接和一层规范化层组成。通过实验发现,使用了转换层的seq2seq模型翻译性能上有0.7~1.0BLEU的提升。
【Abstract】 Machine translation is an important topic in the field of natural language processing.It has great research value and broad commercial application prospects.The best method in the field of machine translation today is the neural machine translation models first proposed in 2014.The most popular one among the neural machine translation models are the attention-based seq2seq model.However,the existing seq2seq models are mainly optimized and evaluated on the Indo-European language family,and there are few optimizations for Chinese.Besides,the existing model does not take into account the transformation of syntax between different languages.This paper aims at the Chinese characteristics,using different methods of text preprocessing and embedding layer parameter initialization,and improves the structure of the seq2seq model by adding a conversion layer for syntax changes between the encoder and the decoder.The main work of this paper is as follows:1.We propose different text preprocessing methods.In natural language processing tasks,unstructured text data needs to be first converted to a computer recognizable data format through preprocessing.The traditional Chinese preprocessing method in the translation system is to convert Chinese sentences into word sequences through word segmentation.However,this method relies on the accuracy of word segmentation and can result in large Chinese vocabulary.Aiming at the characteristics of Chinese,such as large number of Chinese characters,large entropy of character information and strong ideographic ability,this paper proposes a preprocessing method that converts Chinese sentences into character + named entity sequences by named entity recognition.Through experiments,it is found that using this preprocessing method can reduce the translation model’s parameter scale and training time by more than 18%in the English-Chinese translation task,and the translation performance can be improved by 0.3-0.5 BLEU.2.We propose different embedded layer parameter initialization methods.The embedding layer is the first layer in a neural network model for text processing,converting pre-processed character sequences into numerical vector sequences to support subsequent numerical calculations.The choice of parameter initialization method in deep learning is crucial to the convergence of the model.In the existing translation model,the pre-trained word embedding is usually selected as the initialization value of the embedding layer parameter.However,because the translation system needs to use word embeddings expressions in two different languages,and pre-trained word embeddings are trained in corpus of different languages,resulting in semantically incompatible word embeddings of different languages.Therefore,this paper proposes that in the English-Chinese translation model,the English-side uses GloVe to initialize the embedded layer parameters and the Chinese-side uses random initialization.Through experiments,it is found that the English-Chinese translation model trained using this parameter initialization method has a 0.3-0.6 BLUE improvement in the translation performance of small and medium-sized corpora.3.We improve seq2seq model architecture,propose the conversion layer structure.In the existing seq2seq model,the source language sequence generates a representation vector by the encoder and then expresses the vector directly as the initial state of the decoder to generate the target language sequence.However,this structure does inot take into account the changes in syntax between different languages.Therefore,the architecture of the seq2seq model is improved in this paper.A conversion layer for grammatical changes is added between the encoder and the decoder.The conversion layer is composed of two layers of feed-forward neural networks,residual connections and normalization layer.Through experiments,it has been found that the translation performance of the seq2seq model using the conversion layer has a 0.7-1.0 BLEU improvement.
【Key words】 Deep learning; Neural machine translation; Seq2seq model; Attention mechanism; Named entity recognition;
- 【网络出版投稿人】 武汉大学 【网络出版年期】2019年 06期
- 【分类号】TP391.1
- 【被引频次】1
- 【下载频次】290