节点文献
子字粒度切分在蒙汉神经机器翻译中的应用
Application of Sub-word Segmentation in Mongolian-Chinese Neural Machine Translation
【摘要】 在蒙汉神经机器翻译任务中,由于语料稀少使得数据稀疏问题严重,极大影响了模型的翻译效果。该文对子字粒度切分技术在蒙汉神经机器翻译模型中的应用进行了研究。通过BPE算法将切分粒度控制在字符和词之间的子字粒度大小,将低频词切分成相对高频的子字片段,来缓解数据稀疏问题,从而在有限的数据和硬件资源条件下,更高效地提升模型的鲁棒性。实验表明,在两种网络模型中使用子字粒度切分技术,BLEU值分别提升了4.81和2.96,且随着语料的扩大,训练周期缩短效果也更加显著,说明子字粒度切分技术有助于提高蒙汉神经机器翻译效果。
【Abstract】 In the Mongolian-Chinese neural machine translation,the data sparse issue is of substantial effect to the translation quality.This paper applies the sub-word granularity segmentation in the Mongolian-Chinese neural machine translation model.The Byte Pair Encoding algorithm is adopted to alleviate the data sparseness by reducing the low-frequency words into relatively high-frequency sub-units.Experiments show that the sub-word segmentation technique can improve the Mongolian-Chinese neural machine translation,achieving 4.81 and 2.96 improvements in BLEU score,respectively.
【Key words】 Mongolian-Chinese neural machine translation; data sparseness; sub-word segmentation;
- 【文献出处】 中文信息学报 ,Journal of Chinese Information Processing , 编辑部邮箱 ,2019年01期
- 【分类号】TP391.2
- 【被引频次】19
- 【下载频次】233