节点文献

基于最大熵深度强化学习的AUV运动规划方法研究

Research on AUV Motion Planning Method Based on Maximum Entropy Deep Reinforcement Learning

【作者】 于鑫

【导师】 孙玉山;

【作者基本信息】 哈尔滨工程大学 , 船舶与海洋结构物设计制造, 2022, 硕士

【摘要】 本研究探究自主水下机器人(Autonomous Underwater Vehicle,AUV)在未知、复杂环境下,如何依靠全局路径信息及传感器获得的局部信息,高效、快速地制定决策目标,从而躲避形状各异的密集障碍,在满足各类约束条件的同时到达指定目标位置,完成运动规划任务。针对目前AUV运动规划任务中存在探索能力差、策略单一、训练成本高、环境奖励稀疏的问题,本研究提出了一种基于深度强化学习算法的端到端运动规划系统。为解决以上问题,并提高AUV运动规划效果,进行了以下内容的研究:(1)充分考虑系统动力学、传感器性能、障碍物碰撞范围及海流干扰等多约束条件限制,对复杂的运动规划问题进行了公式化处理。基于神经网络模型,构建了以状态信息—动作输出的端到端运动规划体系结构,确定了以位置信息、速度信息以及障碍物信息为基础的状态空间,同时搭建了一个简易的声纳模型来实现局部避障,并对声纳死区的问题进行了研究。而后确定了AUV的动作空间,并对神经网络输出的动作值进行了剪切与线性变换。(2)设计了基于柔性演员-评论家算法(Soft Actor-Critic,SAC)算法的运动规划系统,利用最大熵的方式增加了策略的随机程度,进而增强了AUV对环境的探索能力。针对环境奖励稀疏的问题,将运动规划任务分解,设计了一个综合的外部奖励函数,能够引导AUV靠近目标点,同时约束其航行状态,优化航行距离和航行时间。(3)针对强化学习从零开始学习一个策略困难且耗时的问题,引入生成对抗性模仿学习(Generative Adversarial Imitation Learning,GAIL)的方法来辅助AUV进行训练,利用专家策略来指导AUV的学习,进而本研究提出了一种SAC-GAIL的结合算法,算法采用GAIL内部奖励信号与外部奖励信号混合的方式进行训练,降低了AUV与环境交互的成本。通过协调好内、外部奖励的权重,GAIL奖励信号将引导AUV航行,鼓励其发现外部环境奖励。(4)基于Unity3D软件进行可视化的仿真,构建了随机分布的密集障碍环境,同时确定了训练过程中的回合终止判定流程,选择了合适的奖励值与算法参数。针对单目标点与多目标点任务,对基于PPO、SAC、SAC-GAIL算法的运动规划系统分别进行了训练,并对训练结果进行了分析。基于训练所得到的策略,随机生成目标点序列,对几种算法进行了测试与比较,最终取得了良好的结果,验证了算法的有效性与稳定性,体现了算法的优势。

【Abstract】 This research explores how autonomous underwater vehicle(AUV)can rely on global path information and local information obtained by sensors to make decisions efficiently and quickly in an unknown and complex environment,so as to avoid dense obstacles with different shapes,reach the specified target location and complete the motion planning task while meeting various constraints.Aiming at the problems of poor exploration ability,single strategy,high training cost and sparse reward environment in AUV motion planning task,an end-to-end motion planning system based on deep reinforcement learning algorithm is proposed.In order to solve the above problems and improve the effect of AUV motion planning,the following contents are studied:(1)Considering the constraints of system dynamics,sensor performance,obstacle collision range and ocean current interference,the complex motion planning problem is formulated.Based on the neural network model,the end-to-end motion planning architecture based on state information action output is constructed,and the state space based on position information,speed information and obstacle information is determined.At the same time,a simple sonar model is built to realize local obstacle avoidance,and the problem of sonar dead zone is studied.Then the action space of AUV is determined,and the action value output by neural network is clipped and linearly transformed.(2)A motion planning system based on the soft actor-critic(SAC)algorithm is designed,and the maximum entropy method is used to increase the randomness of the strategy,thereby enhancing the AUV’s ability to explore the environment.Aiming at the problem of sparse environmental reward,the motion planning task is decomposed,and a comprehensive external reward function is designed,which can guide the AUV to approach the target point,while constraining its navigation state and optimizing the navigation distance and time.(3)Aiming at the difficulty and time-consuming problem of learning a strategy from scratch in reinforcement learning,the method of generative adversarial imitation learning(GAIL)is introduced to assist AUV training,and expert strategies are used to guide the learning of AUV.Furthermore,a combination algorithm of SAC-GAIL is proposed.The algorithm is trained by mixing GAIL internal reward signals with external reward signals,which reduced the cost of interaction between AUV and the environment.By coordinating the weights of internal and external rewards,the GAIL reward signal will guide the AUV to navigate and encourage it to discover external environmental rewards.(4)Based on the visual simulation of Unity3 D software,this research constructs the randomly distributed dense obstacle environment,determines the episode termination judgment process in the training process,and selects appropriate reward value and algorithm parameters.For the tasks of single target point and multi-target point,the motion planning system based on PPO,SAC and SAC-GAIL algorithms are trained respectively,and the training results are analyzed.Based on the strategy obtained by training,the target point sequence is randomly generated,and several algorithms are tested and compared.Finally good results are obtained,which verifies the effectiveness and stability of the algorithm and reflects the advantages of the algorithm.

  • 【分类号】P752;TP18;TP242
节点文献中: 

本文链接的文献网络图示:

本文的引文网络