节点文献
“不形”短语的自动识别方法和特征的大规模调查研究
【作者】 颜伟;
【导师】 宋柔;
【作者基本信息】 北京语言大学 , 语言学及应用语言学, 2005, 硕士
【摘要】 “不”与形容词组合成“不形”短语是形容词否定形式中最重要的一种,也是现代汉语中一种比较复杂的语言现象。“不”和形容词的组合受到各种因素的制约和影响。传统语言学领域对形容词的研究较多,但对“不”与形容词组合的专题研究相对较少,而且调查规模比较小,研究工作侧重于句法意义的研究。一方面,汉语教学、汉语研究、汉语信息处理等各个领域都需要更大规模地调查研究“不形”组合的形式特征。同时,另一方面,对“不形”短语的机器自动识别还没有见到有成果报道,但这是进行大规模调查的前提,并且对于用统计方法处理汉语也将起到很重要的作用。 本文的研究主要包括四个部分。第一部分是对“不形”短语进行自动识别方法的研究;第二部分是在几个文本库中对“不形”短语进行分布统计;第三部分是在大规模的语料中对“不形”短语进行形式特征(包括线性邻接特征和语法功能特征)的调查研究。这三方面的研究都采用人机结合的办法:计算机软件在基本保证查全率的条件下进行尽量准确的自动检索和统计,人在此基础上进行筛选和整理并总结规则。最后一部分我们考察了“不形”短语和形容词在线性邻接特征和语法特征上的异同以及“不+形容词+名词”格式的特点。 我们的识别工作是对8000多万字的当代大陆小说作品文本库进行的。我们的识别方法能够保证相当高的查全率,从准确率看,尽管对于兼类词用简单规则进行排歧尚难以彻底解决所有问题,但我们利用词例化规则也取得了较为理想的效果。 我们的分布统计工作是在当代大陆小说作品文本库、现代名家小说作品文本库、港台小说作品文本库、古典小说作品文本库和人民日报文本库这5个文本库(共约3.5亿多字)基础上进行的。 我们对于“不形”短语线性邻接特征和语法功能特征调查研究以及“不+形容词+名词”格式考察,都是建立在近8000万字规模的当代大陆小说作品文本库的基础上的,这在以前的传统语言学领域中还没有人做过。所得的结果一方面为相关研究提供了可靠的数据,另一方面也启发我们从新的角度研究形容词及形容词短语。
【Abstract】 Among all the negative forms of the adjective, the phrases of Badj, which consist of the Chinese character Bu (不) and adjectives, are the most important. And they are also a kind of complicated phenomena in Modern Chinese. The consistence of Bu and adjectives is influenced by many factors. The linguists of Chinese have made many researches on them, but there is no special study on the consistence of Bu and adjectives. For one hand, many spheres including the teaching of Chinese, the study of Chinese, Chinese Language Processing need study the formal features of the consistence of Bu and adjectives, and for the other, there is no reports on the automatic reorganizations of the phrases of Badj, but this is the foundation of large-scaled study and it is also very important to study Chinese with static methods.There are four parts in this article. Part one is about automatic recognizing methods of the phrases of Badj. Part two is distribution state of the phrases of Badj in several different corpuses. Part three is about the formal features of the phrases of Badj, including current features and syntactic features. The last part is about the same and different features of adjectives and the phases of Badj.The study of automatic recognizing methods is based on the Contemporary Mainland Writers’ Works Corpus which has almost 800000 words. Our methods could solve the automatic recognizing though there are still several problems such as WSD. Our distribution static of the phrases of Badj, is based on five corpuses which have the scale of 0.35 billon words. The study of current features is also based on the Contemporary Mainland Writers Works Corpus. And no one has ever studied the relative work. One of the purposes of our work is to show a new method of studying Chinese Language.Yanwei (Liguistics ans Applied Linguistics) Directed by Professor Songrou
【Key words】 the phrases of Badj; automatic recognizing methods; distribution static; current feature;
- 【网络出版投稿人】 北京语言大学 【网络出版年期】2005年 05期
- 【分类号】H042
- 【下载频次】136