节点文献
基于唐诗语料词的提取与统计分析的研究
The Research on Extraction and Statistics Analysis of Corpus Words Based on Tang Poem
【作者】 刘杰;
【导师】 魏晓辉;
【作者基本信息】 吉林大学 , 软件工程, 2006, 硕士
【摘要】 本文研究的内容是“基于唐诗语料库的‘词’的提取与统计分析的研究”。本文主要是采用了基于语料库统计的方法,统计作为一种工具可以用来帮助发现语言事例中隐藏的许多语言现象,统计手段的引入,使人们能够有一个相对客观的标准来判定唐诗中“词”的概念,词汇语义之间的相互关系等,基于唐诗三百首的语料库,对诗中的词进行提取,再利用统计的方法,对诗歌中的词汇进行分析。主要是建立一个基于频度、相对共现度以及插入率的多维度未登录词统计发现的模型。该模型针对汉语中多字词被大量使用的特点对传统的互信息模型进行了改进,提高了统计自动提词的查准率和查全率。本文首先对语料库的发展现状和计算语言学的现状进行了简要的介绍,主要采用了基于语料库的统计方法,优化了信息论中的互信息概念,提出了基于同现度,结合力度和插入机率的三维的词的统计发现模型,该模型针对传统的互信息模型进行了改进,经实验验证,大大提高了唐诗语料词的提取的准确率。对于唐诗语料的划分,采用了唐诗固有的特点与现代汉语分词技术相结合的方法,取得了较高的效率。文中还对唐诗语料的共现词和对仗此进行了统计分析,但是此部分仅仅是起步阶段,希望在今后的工作中能继续深入研究。
【Abstract】 Chinese Poems is of long standing which are possessed of most important status in pullulate evolvement in Chinese culture. It is a popular literature form which is most close to spoken language .The Tang poems gather the superiority of archaic poems and carry forward it, so the research of Tang poems is always one of the hot spots of Sinology study. The traditional study of poetry always depends on the well culture of investigators who hold the intension of works .This study method has great superiority of explaining aesthetics of works .Whereas it always has too many problems to deal with, when we do transverse and longitudinal analyzing of language panorama. Today, coming into information ages, it’s a pity that we still make use of handwork to study ancient books. It has low efficiency and waste manual work. Therefore, it is general trends to use computer to ancient books study for researchers. The corpus that is basis of linguistics study has widely application in it. So many problems can be conquered by corpus.It is a new world of computer linguistics to make use of computer knowledge to poetry research. The theory approach, technology set up by modern Chinese information processing can be used to ancient study deeply. Because corpus provides generous language stuff and statistical data, the inaccuracy of qualitative description by the subjectivity of researchers can be avoided. The qualitative statistics and
- 【网络出版投稿人】 吉林大学 【网络出版年期】2007年 05期
- 【分类号】TP391.1
- 【被引频次】5
- 【下载频次】692