节点文献
结合论文施引特征和分布式检索技术的引文耦合度算法设计
Design of Citation Coupling Algorithm Combining Paper Citation Characteristics and Distributed Retrieval
【摘要】 大规模科技文献知识库的全量引文耦合关系因计算量巨大的难题,阻碍了引文耦合知识服务在诸多业务场景的应用.本文提出了一种适用于大规模文献知识库的全量引文耦合度计算算法,根据施引特征过滤没有耦合关系的无效组合,避免计算过程中稀疏矩阵的产生,并引入多模式匹配技术,优化算法的整体时间复杂度为O(n log z).本算法在生产环境中依赖分布式搜索引擎集群完成工程化实施.在国家科技图书文献中心的3600万篇科技文献数据库上,对该方法与传统引文耦合方法进行了多组实验对比,并生成了6.59亿论文对的耦合度数据,为国家科技图书文献中心的引文耦合知识服务提供了数据支持,验证了该方法的准确性和实用性.
【Abstract】 The huge computational complexity of the full citation coupling relationship in large-scale scientific literature knowledge bases hinders the application of citation coupling knowledge services in many business scenarios. This paper proposes a full citation coupling calculation algorithm suitable for large-scale literature knowledge bases. It filters out invalid combinations without coupling relationships based on citation characteristics,avoids the generation of sparse matrices during the calculation process,and introduces multipattern matching technology to optimize the algorithm. The overall time complexity of is O(n log z). This algorithm relies on distributed search engine clusters to complete engineering implementation in the production environment. Multiple sets of experiments were conducted to compare this method with the traditional citation coupling method on the36million scientific and technological literature database of the National Science and Technology Library Coupled knowledge services provide data support,verifying the accuracy and practicality of the method.
【Key words】 bibliographic coupling; distributed search engine; sparse matrix; citation characteristics; multi-pattern matching;
- 【文献出处】 小型微型计算机系统 ,Journal of Chinese Computer Systems , 编辑部邮箱 ,2025年02期
- 【分类号】TP391.3
- 【下载频次】9