节点文献

表格单元格分类的端到端不完全监督方法

An End-to-end Incomplete Supervision Method for Table Cell Classification

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 郝昕毓周建涛王昊

【Author】 HAO Xinyu;ZHOU Jiantao;WANG Hao;College of Computer Science,Inner Mongolia University;Engineering Research Center of Ecological Big Data(Inner Mongolia University),Ministry of Education;

【机构】 内蒙古大学计算机学院生态大数据教育部工程研究中心

【摘要】 大数据时代下,爆炸式增长的非结构化数据中蕴含着大量有价值的信息,对其进行识别和提取变得越发重要。表格是典型的高价值密度非结构化数据,为了识别表格的功能结构,并提高模型的通用性和结果的易用性,针对表格单元格分类提出一个端到端不完全监督方法。设计了基于视觉可见的特征选取方案来提高通用性,提出基于规则的自动修正算法用于改善单元格分类的效果,让用户对结果进行再次修正并将结果作为额外的训练数据参与模型训练来提高模型不同场景下的适应性。最后将方法实现为端到端工具,在提高便捷性的同时使得修正后的数据可直接导出用于下游任务。实验结果表明,提出的方法在多个指标上对比基线方法均有提升,同时在一定程度上提高了结果的易用性。

【Abstract】 In the era of big data,the explosive growth of unstructured data contains a lot of valuable information,and its identification and extraction become more and more important. Tables are typical unstructured data with high value density. In order to identify the functional structure of tables and improve the versatility of the model and the ease of use of the results,this paper proposes an end-to-end incomplete supervision method for table cell classification. A feature selection scheme based on visual visibility is designed to improve versatility,and an automatic rule-based correction algorithm is proposed to improve the effect of cell classification,and then let users revise the results again and participate in the model training as additional training data to improve the adaptability of the model in different scenarios. Finally,the method implemented as an end-to-end tool,which improves the convenience and enables the modified data to be directly exported for downstream tasks. The experimental results show that the proposed method improves in many indicators compared with the baseline method,and improves the usability of the results to a certain extent.

【基金】 国家自然科学基金项目(编号:62162046);内蒙古科技攻关项目(编号:2021GG0155);内蒙古自然科学基金重大项目(编号:2019ZD15);内蒙古自然科学基金项目(编号:2019GG372)资助
  • 【文献出处】 计算机与数字工程 ,Computer & Digital Engineering , 编辑部邮箱 ,2023年01期
  • 【分类号】TP311.13
  • 【下载频次】9
节点文献中: 

本文链接的文献网络图示:

本文的引文网络