节点文献

基于文本及HTML标签密度的网页正文提取

Text Extraction Method Based on Page Text and HTML Tag Density

推荐 CAJ下载
PDF下载
不支持迅雷等下载工具，请取消加速工具后下载。

【Author】 YANG Dawei;WANG Shinian;BAO Liyan;YAO Hongli;LIU Chang;Shenyang Ligong University;

【摘要】大多数资讯类网页都包含了与资讯正文无关的内容，如推荐、广告等，这些噪声对获取资讯正文具有较大干扰性。针对基于文本及符号密度的网页正文提取方法(TSD)没有考虑段落标签对提取效果的影响部分进行改进，提出基于文本及HTML标签密度的网页正文提取方法(TTD),通过对页面文本内容和标签的统计分析，可以快速提取正文内容，适用于常见的资讯网站，具有较强的通用性。实验表明，该方法的提取效果较当前常用的方法在准确度上有较大提升，具有较高的实用性。更多还原

【Abstract】 Most information web pages contain content that has nothing to do with the information body, such as recommendations, advertisements and so on.These noises have considerable interference with the acquisition of information text and should be removed.For the improvement of text extraction method based on text and symbol density(TSD)based on text and symbol density without considering the influence of paragraph tags on the extraction effect, this paper proposes a web page text content extraction method based on text and HTML tag density(TTD).Through the statistical analysis of the page text content, the text content can be extracted quickly, which is suitable for common information websites and has strong universality.Experiments show that the extraction effect of this method is greatly improved in accuracy and practicability compared with the current commonly used methods.更多还原

【关键词】标签密度； HTML标签；网页；正文提取；
【Key words】 tag density； HTML tag； web page； text extract；

【基金】辽宁省教育厅科学研究经费项目(LG201915);沈阳理工大学科研创新团队建设计划资助项目(SYLUTD202105)

【文献出处】沈阳理工大学学报 ,Journal of Shenyang Ligong University , 编辑部邮箱 ,2022年04期

【分类号】TP391.1
【下载频次】164

知网节下载

节点文献中：

本文链接的文献网络图示:

本文的引文网络

节点文献