节点文献
大规模汉语语料库中任意n的n-gram统计算法及知识获取方法
Algorithm of n gram Statistics for Arbitrary n and Knowledge Acquisition Based on Statistics
【摘要】 本文提出并实现了一种大规模汉语语料库中字、词级任意n的n-gram统计算法,本算法可以一次性统计出所有不大于任意n(本文n取为256)的字、词级n-gram,可将传统n-gram统计时的指数空间开销变为线性的,且与所统计的元数无关。基于这种n-gram的统计,本文还进行了汉语信息熵的计算及字、词级知识获取的研究。本算法及本文的研究结果已应用于我们研制的机译系统中
【Abstract】 A new algorithm of n gram statistics for arbitrary n at word or phrase level is proposed and realized in this paper,with which the n gram for all n at word or phrase level can be calculated at the same time. Based on the n gram,the Chinese information entropy and knowledge acquisition at word or phrase level have also been studied.The algorithm and its result have been integrated with a MT system.
【关键词】 n元语法;
统计;
信息熵;
知识获取;
【Key words】 n gram; statistics; information entropy; knowledge acquisition;
【Key words】 n gram; statistics; information entropy; knowledge acquisition;
- 【文献出处】 情报学报 ,JOURNAL OF THE CHINA SOCIETY FOR SCIENTIFIC AND TECHNICAL INFORMATION , 编辑部邮箱 ,1997年01期
- 【分类号】TP391.1
- 【被引频次】17
- 【下载频次】346