节点文献
蒙古文语料编码转换与校对方法研究
A Research of Encoding Conversion and Proof Reading Method on Mongolian Corpus
【作者】 乌云塔娜;
【作者基本信息】 内蒙古大学 , 计算机技术(专业学位), 2018, 硕士
【摘要】 如今信息现代化的时代,信息的传播、资源的共享也都被电子化、网络化。大部分信息都是以文字形式传播和共享。对蒙古文信息而言适应信息时代发展要求是必然的。随着蒙古文信息处理的发展出现了多种蒙古文编码,例如赛音、蒙科立、明安图、智能编码等。各种编码字库中,蒙古文字形的对应的编码都不一样,互不兼容,如果没有安装对应的蒙古文字库,计算机里的蒙古文资料将显示为乱码,不能使用。这样会导致蒙古文信息资源无法传播、共享和研究。解决这些问题的最有效方法就是编码转换,转换成统一的编码。本文由蒙古文编码转换和编码校对两大部分构成。编码转换部分中,首先对目前应用较广泛的两种编码-蒙科立编码、智能编码以及蒙古文国际标准编码进行了详细的分析和对比。其次将蒙科立编码、智能编码两种编码转换为蒙古文国际标准编码。编码转换为基于蒙古文变形显现字符集和控制字符使用规则的一种转换方法。编码转换过程中,先通过编码范围判断和编码在词中不同位置的词形判断编码类型。编码类型确定之后,如果是蒙科立编码则用蒙科立编码转换为标准码的算法将其转换为标准编码。如果是智能编码,则用智能编码转换为标准编码的算法将其转换为标准编码。非标准蒙古文编码,例如蒙科立编码、智能编码都属于形码。标准编码为音码。转换成标准编码时,由于这些编码并不是与国际标准编码一一对应,有大量的不确定因素,做不到完全正确的编码转换,会出现错误编码。另外键盘录入也会产生编码错误。因此要对转换后的标准编码或者录入产生的蒙古文国际标准编码进行校对。本文的编码校对是基于蒙古文元音阴阳和谐规则的校对方法。校对规则为同一字中阴阳元音不能混合出现。即词中第一个出现的元音为阳性元音,则词中后续出现的元音也是阳性。词中第一个出现的元音为阴性元音,则词中后续出现的元音也是阴性。否则将错误编码替换为对应的正确编码。编码校对实现过程中,用判断元音辅音的算法判断当前编码是元音还是辅音;用判断元音阴阳性算法判断元音的阴阳性;词中第一个出现的元音用获取第一个元音的元音的算法得到;利用获取正确元音错误元音算法对后续出现的原因进行判断,最后用校对单词算法将错误编码替换为为正确编码。
【Abstract】 Nowadays,in the era of information modernization,transmission of information,and resource sharing become electronic and cyberized.Most of information is spread and shared in words.For Mongolian information,it is inevitable to adapting to the the developmental requirement of information age.As the development of Mongolian information processing,a variety of Mongolian coding have appeared,such as Saiyin,Mengkeli,Ming’antu,and Intelligent Coding.Among the various coding banks,the corresponding codes of Mongolian characters are different and mutually incompatible.If the corresponding Mongolian character bank is not installed,the Mongolian data in the computer will be displayed as messy codes and cannot be used.This would make Mongolian information resources unable to be spread,shared and studied.The most effective way to solve these problems is to convert the codes into unified codes.This thesis is comprised of two parts,coding conversion and coding proofreading.In coding conversion part,the author makes detailed analysis and comparison about Mengkeli and Intelligence Coding,which are widely used,as well as Mongolian International Standard Coding.Then Mengkel coding and Intelligence Coding will be converted into Mongolian International Standard Coding.Coding conversion is a way of conversion that is based on a rule of Mongolian deformed character sets and control character using.In the process of converting,the codes are categorized by scope of codes and form of codes in different places.After the code category is decided,if it is a Mengkeli code,it will be converted into standard code with an algorithm of converting from Mengkeli code into standard code.If it is an intelligence code,it will be converted into a standard code with an algorithm of converting from intelligence code into standard code.Non-standard Mongolian coding,for instance,Mengkeli and Intelligence coding,belongs to form code.But standard coding is phonetic code.When converted into standard codes,coding conversion can not fully conducted in a right way,there would be some wrong codes,because those codes are not correspondent to the International Standard Coding.In addition,typing sometimes produces wrong codes.Therefore,it is necessary to proofread the Mongolian international standard coding,which are produced from conversion or typing.In this thesis,the code proofreading is based on harmonious rules of Mongolian masculine and feminine vowel.The rule is that masculine and feminine vowels can not occur simultaneously in one word.That means if the first vowel occurred in a word is masculine,the following vowel occurred in this word must be masculine.If the first vowel is feminine,the following vowel appeared in this word must be feminine.Otherwise,the wrong codes are replaced by correspondent right codes.In the process of coding proofreading,whether it is a vowel or consonant is judged by the algorithm of judging vowel and consonant;whether the current code is feminine or masculine is judged by the algorithm of judging feminine and masculine.At last,the wrong codes are replaced by right codes with the algorithm of proofreading words.
【Key words】 Mongolian character; Mongolian coding; code conversion; proofreading;
- 【网络出版投稿人】 内蒙古大学 【网络出版年期】2019年 02期
- 【分类号】H212
- 【下载频次】58