节点文献
基于文本及HTML标签密度的网页正文提取
Text Extraction Method Based on Page Text and HTML Tag Density
【摘要】 大多数资讯类网页都包含了与资讯正文无关的内容,如推荐、广告等,这些噪声对获取资讯正文具有较大干扰性。针对基于文本及符号密度的网页正文提取方法(TSD)没有考虑段落标签对提取效果的影响部分进行改进,提出基于文本及HTML标签密度的网页正文提取方法(TTD),通过对页面文本内容和标签的统计分析,可以快速提取正文内容,适用于常见的资讯网站,具有较强的通用性。实验表明,该方法的提取效果较当前常用的方法在准确度上有较大提升,具有较高的实用性。
【Abstract】 Most information web pages contain content that has nothing to do with the information body, such as recommendations, advertisements and so on.These noises have considerable interference with the acquisition of information text and should be removed.For the improvement of text extraction method based on text and symbol density(TSD)based on text and symbol density without considering the influence of paragraph tags on the extraction effect, this paper proposes a web page text content extraction method based on text and HTML tag density(TTD).Through the statistical analysis of the page text content, the text content can be extracted quickly, which is suitable for common information websites and has strong universality.Experiments show that the extraction effect of this method is greatly improved in accuracy and practicability compared with the current commonly used methods.
- 【文献出处】 沈阳理工大学学报 ,Journal of Shenyang Ligong University , 编辑部邮箱 ,2022年04期
- 【分类号】TP391.1
- 【下载频次】164