节点文献
基于OCR与词形状编码的英文扫描文档检索
Scanned English Document Retrieval Based on OCR and Word Shape Coding
【摘要】 分析当前常用的两类扫描文档检索方法:基于OCR和基于词形状编码的方法.提出基于识别信度将两种方法进行有机结合的思路.基于文档排字特性和笔画特征,还提出一种词形状编码方法,对字体有较强的容忍性.针对各种标引方法进行关键词检索对比实验,实验结果表明,本文方法性能较优越.
【Abstract】 Two commonly used methods for scanned document retrieval are analyzed,namely retrieval based on optical character recognition (OCR) and retrieval based on word shape coding.A new strategy of combining these two methods based on recognition confidence is given.Furthermore,a new way for word shape coding based on typographic feature and stroke is presented and it is tolerant to fonts.Experiments are conducted based on different word indexing and the results verify the validity of the proposed method.
【关键词】 词形状编码;
光学字符识别(OCR);
识别信度评估;
文档检索;
【Key words】 Word Shape Coding; Optical Character Recognition (OCR); Evaluation of Recognition Confidence; Document Retrieval;
【Key words】 Word Shape Coding; Optical Character Recognition (OCR); Evaluation of Recognition Confidence; Document Retrieval;
【基金】 国家自然科学基金资助项目(No.60602031)
- 【文献出处】 模式识别与人工智能 ,Pattern Recognition and Artificial Intelligence , 编辑部邮箱 ,2009年03期
- 【分类号】TP391.41
- 【被引频次】10
- 【下载频次】352