节点文献
古汉语预训练语言模型研究与应用
Research and Application of Pre-trained Language Model for Classical Chinese
【作者】 周波;
【导师】 张寅;
【作者基本信息】 浙江大学 , 计算机技术, 2023, 硕士
【摘要】 古汉语(包括文言文、诗词歌赋等)是中国数千年文化的结晶,古文文献理解难度较大,并且传世古籍数量巨大,整理和发掘其中的价值需要极大地工作量。因此,有必要引入高效的NLP(自然语言处理)技术来处理、理解和研究此类文献。当前预训练语言模型在NLP领域,包括英文、中文等,都取得了巨大的成功,但文言文行文与现代汉语存在较大差异,因此通用的现代汉语预训练语言模型并不适用。本文提出了WYWLM(Wen Yan Wen Language Model),在大规模的语料集上,针对古汉语句子短、用词精炼、文本整齐、引用多等特征,采取多种预训练技巧进行语言模型训练。其中,本工作提出了一种新的基于对比学习的预训练任务,该任务以词典为媒介,可以利用海量的现代文文本,允许模型学习更好的汉字、词表示;引入了风格桥接解码器以增强语言模型并弥合古典汉语和现代汉语之间的差异;使用包含字/词释义和出处的古汉语词典将知识引入语言模型。对于预训练语言模型研究起重要作用的评估基准,允许研究人员评估其语言模型的性能,理清改进方向,然而,目前的评估基准都不适用于文言文。为了能够使研究者以统一的标准评测预训练语言模型,本文提出了专门用于文言文的自然语言处理评估基准WYWEB(Wen Yan Wen Evaluation Benchmark),它包含八个任务,实现了句子分类、序列标记、阅读理解和机器翻译等类型的任务,使得古汉语领域的预训练语言模型研究者能够以统一的标准评估其模型的能力。在WYWEB上,对多个的古文预训练模型以及WYWLM进行了评测,结果表明,WYWEB能够用于在多个维度评估预训练模型的性能;而WYWLM取得了最佳的分数,表明本工作针对古汉语设计的预训练方法是有效的。上述技术将作为一款国学典籍阅读器的后端支持组件,向用户开放RESTful接口。在此研究的基础上,WYWEB数据集和WYWLM模型将进行开源,为古文NLP研究社区提供一定的贡献。
【Abstract】 Ancient Chinese(Literary Chinese)is the epitome of thousands of years of Chinese culture.The difficulty in understanding classical literary texts is high and the number of ancient texts passed down is huge,requiring a great deal of effort to sort and uncover their value.Therefore,it is necessary to introduce efficient NLP technology to process,understand,and research these types of documents.Current pre-trained language models in the field of NLP,including English and Chinese,have achieved tremendous success.However,Classical Chinese writing differs significantly from modern Chinese,making general-purpose modern Chinese pre-trained language models unsuitable.This article proposes WYWLM(Wen Yan Wen Language Model),which employs various pre-training techniques on large-scale corpora to address the characteristics of Classical Chinese,such as short sentences,concise vocabulary,organized text,and frequent quotations.In this work,a new pre-training task based on contrastive learning is introduced,using dictionaries as a medium to leverage vast amounts of modern Chinese text,allowing the model to learn better representations of Chinese characters and words.A style bridging decoder is introduced to enhance the language model and bridge the gap between Classical Chinese and modern Chinese.Additionally,a Classical Chinese dictionary containing character/word definitions and sources is used to incorporate knowledge into the language model.The evaluation benchmarks such as GLUE,Super GLUE,and CLUE play an important role in pre-trained language model research by allowing researchers to assess the performance of their language models.However,current evaluation benchmarks are not suitable for Classical Chinese(文言文).In order to enable researchers to evaluate pre-trained language models in the field of Classical Chinese using a standardized framework,this paper proposes a specific Natural Language Processing(NLP)evaluation benchmark called WYWEB(Classical Chinese Web).WYWEB consists of eight tasks,including sentence classification,sequence labeling,reading comprehension,and machine translation,among others.This benchmark enables researchers in the Classical Chinese domain to assess the capabilities of their models using a unified standard.Multiple pre-trained models for classical Chinese and WYWLM were evaluated on WYWEB,and the results show that WYWEB can be used to evaluate pre-trained models’ performance in multiple dimensions.WYWLM achieved the best score,demonstrating that the pre-training method designed for classical Chinese is effective.These techniques will serve as a backend support component for a classical literature reader and provide RESTful interfaces to users.Based on this research,the WYWEB dataset and WYWLM model will be open-sourced,contributing to the classical Chinese NLP research community.
【Key words】 Pre-trained Language Model; Evaluation Benchmark; Ancient Chinese; Natural Language Processing;
- 【网络出版投稿人】 浙江大学 【网络出版年期】2024年 07期
- 【分类号】TP391.1;H109.2