节点文献

突发事件语料噪声排除与网页去重方法研究

Research on Noise Reduction and Duplicated Webpages Deletion Method for Accident News Corpus

【作者】 罗永莲

【导师】 张永奎;

【作者基本信息】 山西大学 , 计算机应用技术, 2005, 硕士

【摘要】 对于新闻网页来说,所应提取的是网页的主题内容,而网页除了主题内容之外,还包括大量噪声。通常网页中的噪声和网页的主题内容是统一在HTML所构建的网页结构当中。由于HTML语言是一种直观表示的语言,关于网页内容结构的信息在编辑工作完成之后很难还原出来。但我们也发现网页中包含有丰富的Html标记,突发事件新闻也有其自身的特点,因此,我们在前人的研究基础上,挖掘web页面结构特征、充分利用Html标记与突发事件新闻特征,重点从web页面编者对文本修饰角度出发,对web页面的标题、正文与发布日期等内容提取进行了尝试性研究。 网页检索结果中,用户经常会得到内容相同的冗余页面,其中大量是由于网站之间的转载造成。它们不但浪费了存储资源,而且给信息检索或其它文本处理带来诸多不便。本文依据突发事件时间性(易碎性),按发布日期分“群”,在噪声排除的基础上,从特定区域抽取信息进行网页去重,从而很大程度地缩小了计算时间,提高了去重准确性。 在经典TFIDF(Term Frequency Inverse Document Frequency)权重计算方法的基础上,通过分析事件新闻网页的重复特点,以及不同的特征单元对于文本表征作用的不同,我们采用字、词混合特征来有效地表征文本并对权重计算作了相应的分析和改进。 本文的主要贡献在于:

【Abstract】 To news webpage, what should be extracted is its theme, but there is a large amount of noises besides the topic content. Usually they are unified in the structure constructed by HTML. HTML is a language of visual expression and it would be very difficult to extract the information about the structure of webpage after the edition finished. At the same time we find there is abundant HTML marks in the webpage and there is it’s own characteristic of accidential news, so we mine web page structure, fully utilize HTML mark on the basis of forefather’s research. We make a research on extracting of webpage title, text and date issued and so on from the editor’s attitude.Because of the reprinting between websites, users often get the redundant page with same content in the result of webpage searching. It has not only wasted the storing resources, but also brought a great deal of inconvenience to information retrieval or other text-processing. The main content of this text is that dividing group according to data issued on the basis of accidental event fragility and that deleting the duplicated webpages by extracting information from specific area on the basis of noise reduction.On the basis of classical TFIDF(Term Frequency Inverse Document Frequency) method, we adopt the mixed characteristic word to express text

  • 【网络出版投稿人】 山西大学
  • 【网络出版年期】2005年 07期
  • 【分类号】TP393.092
  • 【被引频次】1
  • 【下载频次】268
节点文献中: 

本文链接的文献网络图示:

本文的引文网络