节点文献

智能Web广告爬虫系统研究

Research on Intelligent Web Advertising Crawler System

【作者】 李定

【导师】 叶允明;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2013, 硕士

【摘要】 近年来,随着互联网越来越深入的影响人们的日常生活,互联网也演变为除电视、报纸外一个非常重要的广告传播媒介。Web广告由于其覆盖面广、交互性强等特质,吸引了众多的广告主在互联网上进行营销。在互联网上投放的广告数据非常之多,收集这些数据是一份很有意义的工作,但是目前却没有针对这些Web广告数据的采集器。本文提出并设计了一个Web广告爬虫系统,专门用来收集互联网中的广告数据。本文主要做了如下三个方面的工作:(1)设计了针对Web广告信息抓取的爬行策略,爬行策略通过计算URL种子的权重来安排URL种子的下载顺序。结合Web广告爬虫系统要抓取的广告对象类型和Web广告的投放方法,提出了已下载页面权重计算方法和种子链接权重计算方法,计算已下载页面权重,结合一些全局统计知识进一步计算种子链接的权重;(2)通过观察和分析大量不同类型网页中的广告数据,设计了针对Web广告信息的抽取方法,用于抽取网页中的广告数据。该方法根据网页中的广告数据呈现出来的局部性和聚集性,利用聚类算法将网页中的所有超链接聚合成超链接块,然后用启发式规则判断链接块的类别性质,如果判断是广告块,抽取广告块中的广告数据;(3)在以上研究成果的基础上设计并实现了一个智能Web广告爬虫系统,该系统从预设的URL种子开始,自动的从互联网中下载网页数据,然后抽取网页中的广告数据。实验结果表明,智能Web广告爬虫系统的爬行策略与广度优先策略和深度优先策略相比,能够更高效的抓取互联网中的广告数据,同时,广告信息抽取算法也能够精准的抽取网页中的广告数据。

【Abstract】 In recent years, the Internet has great influence on people’s daily lives, and ithas evolved into a very important advertising media together with television andnewspaper. Because of its wide coverage, rich interactivity and some othercharacteristics, web advertising has attracted a large number of advertisers to runadvertising for marketing on the Internet. The ads data on the Internet are very rich,it is meaningful to collect these web advertising data, but right now, there is nocollector for these.We want to design a crawler system for web advertisement; this system is usedfor collecting Internet advertising data. We mainly do the following threeresearches:(1) Design the crawling strategy for advertising data. Through calculating theweight of the URL seeds, crawling strategy arrange crawling order of URL seedsaccording to the weight of them. Combined the web advertising types that the adcrawler system crawl and the method of web ad delivery, we propose thedownloaded page’s weight calculation method and seed’s weight calculation method.Based on the downloaded page weight and some global statistical knowledge, wecalculate seed’s weight;(2) By observing and analyzing a large variety of different type web pages, wedesign the web advertising information extraction method to extract ads from webpages. Based on the locality and aggregation of ads in web pages, this method useclustering algorithm as page segmentation to cluster all hyperlinks in web pages intohyperlinks block, and then use heuristic rules to determine the class of hyperlinkblock, if it is advertising block, extract ads from it;(3) Based on the previous researches, we design and implement an intelligentweb advertising crawler system, the system start with default URL seeds, andautomatically download web pages, then extract ads from these pages. Theexperiments show that the crawling strategy of intelligent web advertising crawlersystem is more efficiency compared with breadth-first and depth-first strategy. Onthe other hand, the extraction algorithm can extract ads accurately.

  • 【分类号】TP393.09;TP391.1
  • 【被引频次】4
  • 【下载频次】178
节点文献中: 

本文链接的文献网络图示:

本文的引文网络