节点文献
一个无需词典支持和切词处理的中文文档分类系统
A CHINESE DOCUMENT CATEGORIZATION SYSTEM WITHOUT DICTIONARY SUPPORT AND SEGMENTATION PROCESSING
【摘要】 报道了一个无需词典支持和切词处理的中文文档分类系统 .其特点是利用 N - gram信息进行中文文档分类 ,使中文文档分类摆脱了对词典和切词处理的依赖 ,实现了中文文档分类的领域无关性和时间无关性 ;采用开放的体系结构使文档分类系统易于功能扩充和性能完善 .测试结果表明该系统具有令人满意的分类性能 .
【Abstract】 In this paper, a Chinese document categorization system without dictionary support and segmentation processing is developed, in which the N gram information instead of Chinese words is used so that the classifier can shake off the support of dictionaries and segmentation processing and subsequently become domain and time independent, and an open architecture is adopted to facilitate functional expansion and performance improvement. Experimental results show that it can achieve satisfying categorization performance.
【关键词】 中文文档分类;
N-gram信息;
属性选择;
Bayes分类;
kNN法;
【Key words】 Chinese text categorization; N gram information; feature selection; Bayesian classification; k NN method;
【Key words】 Chinese text categorization; N gram information; feature selection; Bayesian classification; k NN method;
【基金】 中国博士后科学基金;国家“八六三”高技术研究发展计划基金项目(86 3-30 6 -ZT0 4-0 2 -2 );国家自然科学基金 (6 0 0 0 30 16 )的
- 【文献出处】 计算机研究与发展 ,Journal of Computer Research and Development , 编辑部邮箱 ,2001年07期
- 【分类号】TP391.1
- 【被引频次】84
- 【下载频次】333