节点文献
基于自主计算的集群故障管理系统结构设计
Design of Cluster System Fault Management Architecture Based On Autonomic Computing
【Author】 LI Jing,LIU Hongwei,DONG Jian,SHU Yanjun (Department of Computer Science and Technology,Harbin Institute of Technology,Harbin,150001)
【机构】 哈尔滨工业大学计算机科学与技术学院;
【摘要】 随着计算机技术的不断发展,系统规模的不断扩大,高可用集群系统的管理和维护变得越来越复杂。为了提供稳定的计算环境,并及时发现定位系统中的故障隐患,提出了故障的主动管理方法。论文首先分析了自主计算的相关概念和技术,在分析集群计算环境管理需求的基础上,提出了一种基于规则的自主故障管理软件结构。该方法根据集群系统的特点,选择分级管理方式,设计了局部故障管理模块(LFM)和全局故障管理模块(GFM),并具体说明了二者内部的功能结构。
【Abstract】 With the continuous development of computer technology,the system continues to expand the scale,highly available cluster system management and maintenance becomes more complex.To provide reliable computing resource and environment,the Proactive Fault Management method for cluster computing systems is introduced.On the basis of the research status of autonomic computing and the requirement of cluster management,the cluster fault management software architecture is put forward and the functional structure is designed.For the rights of cluster system,the level management is proposed,and local fault management(LFM) and globe fault management (GFM) are designed.And also the two models’ functional structure is introduced in detail.
【Key words】 cluster fault management; autonomic computing; level management;
- 【会议录名称】 第十四届全国容错计算学术会议(CFTC’2011)论文集
- 【会议名称】第十四届全国容错计算学术会议(CFTC’2011)
- 【会议时间】2011-07-30
- 【会议地点】中国北京
- 【分类号】TP315
- 【主办单位】中国计算机学会容错计算专业委员会