节点文献
基于Bootstrapping的文本分类模型
Semi-Supervised Text Categorization Using Bootstrapping
【摘要】 本文提出一种基于Bootstrapping的文本分类模型 ,该模型采用最大熵模型作为分类器 ,从少量的种子集出发 ,自动学习更多的文本作为新的种子样本 ,这样不断学习来提高最大熵分类器的文本分类性能。文中提出一个权重因子来调整新的种子样本在分类器训练过程中的权重。实验结果表明 ,在相同的手工训练语料的条件下 ,与传统的文本分类模型相比这种基于Bootstrapping的文本分类模型具有明显优势 ,仅使用每类10 0篇种子训练集 ,分类结果的F1值为 70 5 6 % ,比传统模型高出 4 70 %。该模型通过使用适当的权重因子可以更好改善分类器的训练效果。
【Abstract】 This paper proposes a semi supervised text categorization using bootstrapping. The System uses the Maximum Entropy Model as the text classifier. It learns more automatic labeled samples as new seed training samples from unlabeled samples using a small size of seed training samples. In this paper, we use a weighted factor to adjust the weight of new seed samples during the following training process. The experimental results show that the proposed system performs better than the conventional system with the same labeled documents. And it yields 70 56% F1 using only 100 labeled documents for each category, 4 7% over the conventional system does. And it can provide the same performance as the conventional system using 50% or less training samples. The results also show that the weighted factor can improve the performance.
【Key words】 computer application; Chinese information processing; text categorization; maximum entropy; weight factor;
- 【文献出处】 中文信息学报 ,Journal of Chinese Information Processing , 编辑部邮箱 ,2005年02期
- 【分类号】TP391.1
- 【被引频次】23
- 【下载频次】525