节点文献
一种基于结构信息总结树的XML文档聚类方法
Clustering XML Documents Based on a Structural Summary Tree
【摘要】 提出一种有效的XML文档结构信息表达方法,用数字化的结构总结树SST对XML文档的结构信息进行编码,在此基础上给出结构距离的定义,并采用遗传算法对XML文档进行聚类.实验证明该方法分类准确率高,易于实现,且不需先验的DTD知识.
【Abstract】 An approach for calculating the structural similarity between XML documents is proposed in this paper. The structural information of an XML document is captured with a structural summary tree (SST). By encoding elements as digital numbers, a SST is transformed to a digit-labeled tree. Digital numbers at different tree levels are concatenated to form a vector after the normalization process. Consequently, each XML document is represented as an m-dimension vector. The GA-based clustering algorithm is adopted since it is able to provide good results irrespective of the starting configuration. Experimental results show the effectiveness and scalability of the approach.
【Key words】 XML; information retrieval; document clustering; GA; SST(structure summary tree);
- 【文献出处】 应用科学学报 ,Journal of Applied Sciences , 编辑部邮箱 ,2005年01期
- 【分类号】TP393
- 【被引频次】9
- 【下载频次】158