节点文献
一个基于通信系统支持的并行检查点系统
A Parallel Checkpointing System Based on Communication System Support
【摘要】 在大规模机群环境下,检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制,在不作全局同步的情况下获取通信系统全局状态的方法,并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销,适用于大规模机群应用。
【Abstract】 Checkpointing and recovery systems are growing in importance in large-scale clusters.A non-blocking coordinated checkpointing and recovery system is proposed in which reliable communication mechanisms are used to eliminate the overhead of global synchronization.It is shown that a parallel checkpointing system can benefit from supports embedded in low-level communication systems in its implementation and to improve its performance.
【关键词】 机群通信系统;
并行检查点;
容错技术;
【Key words】 Cluster communication system; Parallel checkpointing; Fault-tolerance;
【Key words】 Cluster communication system; Parallel checkpointing; Fault-tolerance;
【基金】 中科院新一代机群关键技术的研究项目(KGCX2-SW-116)
- 【文献出处】 计算机工程 ,Computer Engineering , 编辑部邮箱 ,2007年05期
- 【分类号】TN914;TP338
- 【被引频次】4
- 【下载频次】75