节点文献
以顶点为中心的分布式图神经网络框架研究
Research on Vertex Centered Distributed Graph Neural Network Framework
【作者】 刘潇;
【导师】 高尚;
【作者基本信息】 吉林大学 , 工程硕士(专业学位), 2023, 硕士
【摘要】 图神经网络(Graph Neural Networks,GNN)是一类把深度神经网络应用到学习图数据中的算法。GNN在深度学习算法中融入了图广播操作,使得图数据中的结构信息和顶点信息都可以被学习,弥补了传统深度神经网络在处理非欧式数据上的缺陷。图神经网络的往往被应用在分布式系统、大规模数据的场景中。然而,目前并没有一个合适的主流深度学习框架能够很好地支持处理图神经网络的特殊存储方式和图上的消息传递,这也牵制住了图神经网络算法进一步在大规模图数据上的应用。许多探索分布式图神经网络框架设计的工作已经展开,目前也有了一些系统的实现方案。本文在总结前人工作的基础上,对当前已有的大规模图神经网络系统从多方面进行分析并进行实验评估。通过对目前工作的总结和实验分析发现,由于在图神经网络的计算中顶点存在依赖性,现有系统所采用的小批量并行计算方法存在内存复杂度高和采样导致的精度损失问题,这将影响系统的可扩展性和模型的有效性,同时限制了图神经网络的进一步发展。为了解决上述问题,本文在简单概括图神经网络发展历史的基础上,总结了在架构图神经网络框架过程中需要解决的难题。在分析了当前图神经网络典型模型之后,本文认为图神经网络的前向计算与反向传播的计算过程可以分解为两部分,分别为深度神经网络和图传播模型。通过将顶点之间的计算进行递归分解,可以将每个顶点的更新视为独立样本通过通信获取邻居顶点特征并进行聚合的过程,从而解决顶点之间的依赖性问题。本文提出一种GNN全批量梯度下降并行计算方法,利用顶点之间的通信解决依赖性,不需要进行邻居顶点采样,大大降低了存储复杂度,保证了模型准确度,同时减少了冗余计算。在该并行计算方法中,需要在计算过程中进行顶点之间的通信,而目前基于计算图表达神经网络计算的系统并不支持该种通信操作。所以,本文基于GNN全批量梯度下降并行计算方法,设计并实现一个分布式图神经网络框架GFrame。GFrame通过结合深度神经网络框架中张量抽象、自动微分这些执行神经网络必要的功能,在图引擎上表达图神经网络模型,利用图引擎的优化分区和通信,支持高效的分布式图神经网络训练。框架主要包括计算和通信两个模块,GFrame与其他系统不同,采用图引擎和深度神经网络框架相结合的架构方式,所以根据图引擎点对点的通信方式以及全批量并行计算方法,论文对GFrame中的计算框架和通信框架进行了设计和实现。在计算框架中,将张量计算部分结合深度神经网络框架实现,并结合分布式图引擎完成图传播部分。同时将模型反向传播进行递归分解,利用链式求导法则进行计算。通信框架包括实现参数同步和图引擎通信获取特征数据。最后,本文将GFrame同现有开源框架进行对比,通过分析和实验证明框架在性能、存储复杂度等多个方面都能达到比较好的效果。在单机情况下,框架的训练精度同模型原论文基本保持一致,并且单机性能在现有单机系统中表现较好。归功于框架的设计与优化,其在稠密图上的内存开销在所有开源系统中最小,其余系统中最大的内存消耗是框架的近10倍。同时,框架在分布式情况下训练模型的精度比开源系统Euler的误差更小,性能也比AliGraph更优。
【Abstract】 Graph Neural Networks(GNN)is a generic term for algorithms that use deep neural networks to learn graph data.It combines graph broadcast operations with deep learning algorithms to allow graph structure information and vertex attribute information to participate in the learning.Graph neural networks are often used in distributed systems and large-scale data scenarios.However,the existing deep learning frameworks do not provide efficient storage support and message passing support for GNN training,which limits its usage on large-scale graph data.At present,many works have explored the design and implementation of largescale GNN systems based on the data characteristics of graph structure and the computational characteristics of GNN.Thesis introduces the work of the current GNN system,analyzes the system from many aspects,and uses part of the open source GNN system for experimental evaluation.Through the summary and experimental analysis of the current work,it is found that due to the dependence of the vertices in the calculation of the GNN,the small batch parallel calculation method adopted by the existing system has the problems of high memory complexity and accuracy loss caused by sampling.It affects the scalability of the system and the effectiveness of the model,while limiting the further development of GNN.In order to solve the above problems,thesis first briefly summarizes the development of GNN and summarizes the challenges that need to be faced in designing GNN systems.Through the analysis of the typical model of GNN,it is found that the calculation process of forward calculation and back propagation of GNN can be decomposed into DNN and graph propagation model.By recursively decomposing the calculation between vertices,the update of each vertex can be regarded as a process in which independent samples obtain neighbor vertex features through communication and aggregate them,thereby solving the dependency problem between vertices.A GNN full-batch gradient descent parallel calculation method is proposed in this paper,which uses the communication between vertices to resolve dependencies without neighbouring vertices sampling,which greatly reduces storage complexity,ensures model accuracy,and reduces redundant calculations.In this parallel computing method,communication between vertices needs to be carried out during the calculation process,and the current system for expressing neural network calculations based on calculation graphs does not support this communication operation.Therefore,thesis designs and implements a distributed graph neural network framework GFrame based on the GNN full-batch gradient descent parallel computing method.GFrame combines the necessary functions of tensor abstraction and automatic differentiation in the DNN framework to execute neural networks,the GNN model is expressed on the graph engine,and the optimized partition and communication of the graph engine are used to support efficient distributed GNN training.The framework mainly includes two modules of calculation and communication.GFrame is different from other systems.It uses a combination of graph engine and deep neural network framework.Therefore,according to the point-to-point communication method of the graph engine and the full-batch parallel computing method,thesis compares the The computing framework and communication framework are designed and implemented.In the calculation framework,the tensor calculation part is combined with the DNN framework to implement,and the distributed graph engine is combined to complete the graph propagation part.At the same time,the model is back-propagated for recursive decomposition,and the chain derivation rule is used for calculation.The communication framework includes the realization of parameter synchronization and graph engine communication to obtain characteristic data.Finally,GFrame is compared with existing open source frameworks,analysis and experiments can prove that the framework can achieve better results in performance,storage complexity and other aspects.In the case of a single machine,the training accuracy of the framework is basically the same as the original thesis of the model,and the single machine performance is better in the existing single machine system.Due to the design and optimization of the framework,the memory overhead on the dense graph is the smallest among all open source systems,and the largest memory consumption in the remaining systems is nearly 10 times that of the framework.At the same time,the accuracy error of the framework in a distributed situation is smaller than the open source system Euler,and the performance is better than AliGraph.
【Key words】 GNN; large-scale graph data; distributed systems; deep learning;
- 【网络出版投稿人】 吉林大学 【网络出版年期】2024年 02期
- 【分类号】TP183