节点文献

基于XML技术的Web信息提取和集成

Web Information Extraction and Integration Based on XML

推荐 CAJ下载
PDF下载
不支持迅雷等下载工具，请取消加速工具后下载。

【摘要】 <正>1.引言随着Internet的快速发展,Web上的信息量爆炸式地高速膨胀;随着数据量的激增,Web规模的快速增长和对网上信息的应用需求的不断提高,原有的对网页文件的链接浏览和关键词检索已经无法满足信息获取的需求。用户虽然可以得到海量数据,更多还原

【Abstract】 Web information is expanding quickly with the dramatical growth of Internet.Usually,accessible Web information which users are interested in is in HTML documents,not in databases.Those documents are always unstructured or semi-structrued,and lack of patterns and metadata.So it is very discommodious for information integation. information exchange,Web knowledge discovery,exact Web information query etc.This paper approaches a new Web information extraction and integration methology based on XML.It has four phases.The first phase creates an XML schema for each class of HTML documents,and maps it to a database schema;the second extracts patterns from the sample HTML documents,and makes a template for them;the third selects the exact template,and extracts useful contents from HTML documents according to it;the last integrates the contents into database.This approach has been implemented in the COMMIX.In addition,it achieves good effectiveness by considering universality and precision.更多还原

【Key words】 Information extraction； Pattern extraction； Information integration； XML； XPath； Wrapper；

【基金】 973国家重点基础研究发展规划(编号G1999032705);863数据库重大专项课题“基于XML的数据集成、共享与交换”(编号2002AA4Z3440)支持

【会议录名称】第二十届全国数据库学术会议论文集（研究报告篇）

【会议名称】第二十届全国数据库学术会议

【会议时间】2003-10-10
【会议地点】中国湖南长沙
【分类号】TP311.10

【主办单位】中国计算机学会数据库专业委员会

知网节下载

节点文献中：

本文链接的文献网络图示:

本文的引文网络

节点文献