节点文献
基于DOM的Web信息抽取技术的研究与实现
Research and Realization of Web Information Extraction Based on DOM
【作者】 李猛;
【作者基本信息】 大连理工大学 , 控制理论与控制工程, 2008, 硕士
【摘要】 当前,互联网已成为全球信息传播与共享的重要渠道,但随着其数据量的爆炸性增长,用户查找自己所需的信息却变得越来越困难。在这种情况下,如何从浩繁的Web数据中抽取出有用的信息就成为了众多研究工作者希望解决的问题。近年来国内外已涌现了多种Web信息抽取方法,这些方法各有侧重地解决了信息抽取中所面临的问题。虽然在总体上取得了良好的效果,但仍然存在着对样本需求过多,工作量大的缺陷。针对现有方法的不足,提出一种半自动化Web信息抽取方法,主要内容包括以下几个方面:首先,在相似页面的获取上,采用基于URL结构比较和简单树匹配算法相结合的方式来进行。即对爬虫程序在网站内获取的超链接采用URL比较法进行先过滤,去掉不满足匹配条件的网页。然后使用简单树匹配法对剩下的网页进行后过滤,以此来得到最终的相似页面。这样在网页的相似性度量上,不仅考虑了URL,还考虑了网页的实际结构,弥补了单纯根据URL获取相似页面的不足。其次,提出一种基于DOM的Web信息抽取方法,通过对用户标记项与测试网页中数据项进行比较来获取有效信息。即先将样本网页进行解析,提取其中感兴趣数据项的特征。当输入测试网页时,通过与其中所有的数据项进行特征比较来获得抽取结果。采用这种方式进行信息抽取克服了传统基于DOM的信息抽取方法对网页结构变化适应性不强的缺点。第三,针对多记录网页特别是记录数目不固定的网页进行信息抽取时,提出一种试探策略。在计算出用户标记记录与测试网页中记录的相似度矩阵后,通过矩阵的变化情况来确定记录间的分界点,进而获得所有记录,降低了抽取难度。最后,根据以上分析对基于DOM的Web信息抽取原型系统进行了设计与实现。系统为用户提供可视化操作界面,便于使用。通过不同功能模块的结合满足了抽取任务的要求。对一些数据源的实验结果表明,在单样本网页训练情况下,所提出方法可以有效抽取出网页中的数据。即使网页中存在缺失项,系统依然具有良好的表现。
【Abstract】 As the rapid development of Internet, it has become an important channel for global information spreading and sharing. But with the explosive growth of data, it is more and more difficult to find interested information for users. Under the circumstances, how to extract useful information from Web has become a research focus. Various methods for information extraction have been proposed at home and abroad in recent years. These methods achieve good effect as a whole. However, the defects such as needing excessive sample pages and heavy workload still exist.Aiming to these shortages, a semiautomatic method for Web information extraction is proposed. The main content is shown as below.Firstly, a method combined URL comparison and Simple_Tree_Matching algorithm is used in order to solve the problem of similar pages acquirement. A Web crawler is utilized to obtain hyperlinks on the first step. Then the hyperlinks are filtered by the method of URL comparison and the ones which satisfy matching condition are left. On the last step the Simple_Tree_Matching algorithm is used to filter the result hyperlinks. The final similar pages can be acquired at this time. On the measurement of similar pages, not only the URL but also the concrete structure is considered, which makes up for the weakness of pure URL comparison.Secondly, a DOM based method is proposed. It extracts effective information by characteristic comparison of data items. The sample page is parsed and all characteristics of interested data items are saved at first. When the test page is inputted, the characteristics of data items users labeled and those from the test page are compared. The most similar ones in the test page are extracted as a result. Compared to the traditional DOM based method, it enhances the adaptability to the change of Web pages structure.A detection strategy is used to extract pages with multiple records thirdly. The similarity matrixes are calaulated between the labeled records and records in the test page. According to the change of matrixes, it can discover the boundary between records and then extracts all the records. The difficulty of extraction is reduced.Finally, according to the above analysis a Web information extraction system based on DOM is designed and realized. The system provides a fully visual and interactive user interface which is easy to operate. It finishes the extraction task by the combination of different fuction modules.Experiments on datasets IMDB, RISE and EXALG show that when it is trained by a single page, the proposed method can extract data from Web pages effectively. Even if some pages miss items sometimes, it still has a good performance.
【Key words】 Web Information Extraction; DOM; Characteristic Comparison; Detection Strategy;