节点文献

面向Web新闻与博客的内容提取方法

A Content Extraction Method for Web News and Blogs

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 王金麟方滨兴于海宁马雪阳

【Author】 WANG Jinlin;FANG Binxing;YU Haining;MA Xueyang;School of Computer Science and Technology,Harbin Institute of Technology;

【机构】 哈尔滨工业大学计算机科学与技术学院

【摘要】 Web深刻地改变了社会生活,新闻和博客网站作为其中代表性的消息来源,为人们提供了方便的信息获取方式。在Web分析的实际业务中,广告、文章推荐等无关信息的存在,给新闻和博客网页中主要内容的提取带来了负面影响。本文提出了一种区别于抽取模板的新闻和博客内容提取方法 CEVC,通过定义有效字符,对网页内容文件的DOM树进行递归计算,确定最具代表性的子节点作为主要内容节点。实验选取了中文与英文网页作为数据集,定义了提取新闻和博客内容的性能指标。对比实验的结果表明,CEVC在Web内容提取方面的性能优于现有方法。

【Abstract】 Web has profoundly changed the social life. News and blog sites,as a representative source of information,provide a convenient way for people to obtain. In the actual business of web analysis,the existence of irrelevant information such as advertisements and article recommendations has negatively affected the extraction of the main content in news and blog pages. This paper proposes CEVC,a method for extracting news and blog content,which is different from extracting templates. By defining valid characters,the DOM tree of web content files is recursively calculated to determine the most representative child node as the node of main content. Chinese and English web pages were selected as the data set,and the performance indicators were defined for extracting content of news and blog. The results of the comparative experiments showed that CEVC outperforms the existing method in Web content extraction.

【关键词】 Web分析内容提取DOM树
【Key words】 Web analyticsContent extractionDOM tree
【基金】 国家重点研发计划(2016QY03D0501,2017YFB0803300);国家自然科学基金(61601146,61732022);四川省科技计划项目(2019YFSY0049)
  • 【文献出处】 智能计算机与应用 ,Intelligent Computer and Applications , 编辑部邮箱 ,2020年07期
  • 【分类号】TP393.092
  • 【下载频次】68
节点文献中: 

本文链接的文献网络图示:

本文的引文网络