节点文献

基于Transformer的状态-动作-奖赏预测表征学习

State-Action-Reward Prediction Representation Learning Based on Transformer

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 刘民颂朱圆恒赵冬斌

【Author】 LIU Min-Song;ZHU Yuan-Heng;ZHAO Dong-Bin;State Key Laboratory of Multi-modal Artificial Intelligence Systems,Institute of Automation,Chinese Academy of Sciences;School of Artificial Intelligence,University of Chinese Academy of Sciences;

【通讯作者】 朱圆恒;

【机构】 中国科学院自动化研究所多模态人工智能系统全国重点实验室中国科学院大学人工智能学院

【摘要】 为了提升具有高维动作空间的复杂连续控制任务的性能和样本效率,提出一种基于Transformer的状态-动作-奖赏预测表征学习框架(Transformer-based state-action-reward prediction representation learning framework, TSAR).具体来说, TSAR提出一种基于Transformer的融合状态-动作-奖赏信息的序列预测任务.该预测任务采用随机掩码技术对序列数据进行预处理,通过最大化掩码序列的预测状态特征与实际目标状态特征间的互信息,同时学习状态与动作表征.为进一步强化状态和动作表征与强化学习(Reinforcement learning, RL)策略的相关性, TSAR引入动作预测学习和奖赏预测学习作为附加的学习约束以指导状态和动作表征学习. TSAR同时将状态表征和动作表征显式地纳入到强化学习策略的优化中,显著提高了表征对策略学习的促进作用.实验结果表明,在DMControl的9个具有挑战性的困难环境中,TSAR的性能和样本效率超越了现有最先进的方法.

【Abstract】 To enhance the performance and sample efficiency of complex continuous control tasks with high-dimensional action spaces, this paper introduces a Transformer-based state-action-reward prediction representation learning framework(TSAR). Specifically, TSAR proposes a sequence prediction task integrating state-action-reward information using the Transformer architecture. This prediction task employs random masking techniques for preprocessing sequence data and seeks to maximize the mutual information between predicted features of masked sequences and actual target state features, thus concurrently learning state representation and action representation.To further strengthen the relevance of state representation and action representation to reinforcement learning(RL)strategies, TSAR incorporates an action prediction model and a reward prediction model as additional learning constraints to guide the learning of state and action representations. TSAR explicitly incorporates state representation and action representation into the optimization of reinforcement learning strategies, significantly enhancing the facilitative role of representations in policy learning. Experimental results demonstrate that, across nine challenging and difficult environments in DMControl, the performance and sample efficiency of TSAR exceed those of existing state-of-the-art methods.

【基金】 中国科学院战略性先导研究(XDA27030400);国家自然科学基金(62136008,62293541);北京市自然科学基金(4232056)资助~~
  • 【文献出处】 自动化学报 ,Acta Automatica Sinica , 编辑部邮箱 ,2025年01期
  • 【分类号】TP391.41;TP18
  • 【下载频次】59
节点文献中: 

本文链接的文献网络图示:

本文的引文网络