节点文献
基于深度强化学习的机械臂路径规划研究
Path Planning of Manipulator Base on Deep Reinforcement Learning
【作者】 杨洋;
【导师】 马家辰;
【作者基本信息】 哈尔滨工业大学 , 控制工程(专业学位), 2023, 硕士
【摘要】 随着纺织智能制造行业的快速发展,机械臂在纺织自动化生产过程中逐步代替人工,精确地完成很多繁重、重复的操作。面向纺织机器人自动化生产线的非结构化动态空间环境,小空间内机械臂的路径规划是需要攻克的技术难点问题,本文以此为切入点,将深度强化学习算法引入小空间的机械臂路径规划中,旨在克服传统机械臂只能在未知、动态和非结构化的场景中按照预设的方法作业的缺点,通过人工智能学习的方法提高纺织智能制造行业中机械臂的动态环境适应性具有重要的研究意义和应用价值。首先,针对六轴机械臂在有障碍物的小空间中的路径规划问题,即在特定环境下给定起始位置和目标位置,机械臂在自身工作范围受限的情况下规划出一条无碰撞的路径,本文提出了一种基于深度确定性策略梯度(Deep Deterministic Policy Gradient,DDPG)算法的机械臂路径规划方法。针对稀疏奖励导致训练过程缓慢甚至无法收敛的问题,设计了一种复合形式的奖励函数。在V-REP平台上搭建仿真环境,通过仿真实验验证了该方法的可行性。其次,针对DDPG算法等概率对样本进行采样,忽视了样本之间价值差异的问题,引入优先级经验回放代替原来的经验回放,根据时序差分误差的大小赋予样本不同的采样概率。同时,针对优先级经验回放改变了期望梯度,使得训练出来的模型的预测结果是有偏的这一缺陷,提出了一种改进损失函数的优先级经验回放方法,通过使用胡贝尔损失函数和剪裁样本优先级抑制时序差分误差离群值给模型训练带来的偏差。此外,在DDPG算法中引入噪声网络代替OU过程(Ornstein-Uhlenbeck Process),为训练过程提供稳定可靠的探索。将上述两个改进点结合DDPG算法,仿真实验表明改进后的算法平均累计回报提高了37%,学习速率提升了32%,验证了改进后算法的优越性。最后,针对机械臂在小空间中的避障路径规划问题,提出了一种基于随机性策略梯度的近端策略优化(Proximal Policy Optimization,PPO)算法的路径规划方法。广义状态相关搜索(generalized State-Dependent Exploration,g SDE)算法具有每回合探索更稳定、方差更小的特点。在PPO算法的基础上结合了g SDE算法,相比原始算法,PPO-g SDE算法平均累计回报提高了25%,有效提高了算法的性能。
【Abstract】 With the rapid advancement of intelligent manufacturing technology in the textile industry,robotic arms progressively replace manual labor in the textile automation production process and accurately complete many weighty and repetitive operations.The non-structured dynamic spatial environment of the textile robot automation production line,the planning of the path of the mechanical arm in confined space is a technical difficulty that needs to be surmounted.Taking this as an entry point,this paper introduces deep reinforcement learning algorithms into the path planning of small-space robotic arms,aiming to overcome the limitations of robotic arms that can only operate according to preset methods in unknown,dynamic,and unstructured scenarios.Improving the dynamic environmental adaptability of the mechanical arm in the textile intelligent manufacturing industry through artificial intelligence learning has important research significance and application value.First,for the problem of path planning in small spaces with obstacles in the six-axis mechanical arm,i.e.,given the starting position and target position in a specific environment,the mechanical arm plans a collision-free path in the case of its own work scope.This article proposes a methodology based on the Deep Deterministic Policy Gradient(DDPG)algorithm.For the problem of scarce rewards that cause the training process to be sluggish or even impossible to converge,a compound form of reward function was designed.The simulation environment was developed on the V-REP platform,and the feasibility of the method was verified by simulation experiments.Secondly,for probability sampling of samples such as the DDPG algorithm,ignoring the problem of value differences between samples and introducing the prioritized experience replay instead of the original experience replay gives samples a different probability based on the size of the temporal-difference error.Furthermore,the defect for prioritized experience replay changes the expected gradient,making the forecast results of the trained model biased,and proposes a method of improving the loss function of prioritized experience replay by using the Hubble loss function and cutting sample priority to inhibit the deviation to the model training resulting from the temporal-difference error cluster values.In addition,the incorporation of noisy networks in the DDPG algorithm to supplant the Ornstein-Uhlenbeck(OU)process provides a stable and reliable exploration of the training process.Combining the above two improvements with the DDPG algorithm,the simulation experiment has shown that the average cumulative return of the improved algorithm increased by 37 percent,the rate of learning improved by 32 percent,and the superiority of the post-improvement algorithm has been verified.Finally,for obstacle avoidance path planning of manipulators in confined spaces,a path planning approach based on the Proximal Policy Optimization(PPO)algorithm is proposed.The generalized state-dependent exploration(g SDE)algorithm has the characteristics of being more stable and having fewer disparities per round of exploration.Combining the g SDE algorithm with the PPO algorithm has increased the average cumulative return of the PPO-g SDE algorithm by 25%,effectively augmenting the performance of the algorithm.
【Key words】 robotic arm; path planning; deep reinforcement learning; deep deterministic policy gradient; proximal policy optimization;
- 【网络出版投稿人】 哈尔滨工业大学 【网络出版年期】2025年 04期
- 【分类号】TP241;TP18