节点文献

激励学习的若干新算法及其理论研究

Some New Algorithms of Reinforcement Learning and Their Theoretical Study

【作者】 殷苌茗

【导师】 王汉兴;

【作者基本信息】 上海大学 , 运筹学与控制论, 2006, 博士

【摘要】 本博士论文大体上可以分成两大部分,第一部分我们给出了激励学习的一些新算法,其目的是为了改进现有算法所面临的诸于维数灾难与计算速度等问题。第二部分是我们在基于风险敏感度概念的基础上,研究了与激励学习有关的最优方程与最优解的理论问题。本论文首先提出了一种新的激励学习算法,即我们所说的激励学习的遗忘算法。这种算法的基本思想是基于下面的考虑:以前的有关激励学习算法都只是通过对状态被访问次数的短时记忆来确定状态值函数空间的更新长度。对于这些算法来说,无论是用lookup表的形式,还是用函数逼近器的形式,对所有状态的值函数都必须要全部记忆。这些方法对大状态空间问题会导致需要指数增长的记忆容量,从而呈指数地减慢计算速度。Sutton等人考虑过这个问题,但始终没有得到满意的解决方案。基于上述考虑,将记忆心理学中有关遗忘的基本原理引入值函数的激励学习算法的研究之中,特别是对于著名的SARSA(λ)算法,形成了一类适合于值函数激励学习的遗忘算法。我们提出了基于效用聚类的激励学习算法。这种算法的模型使用了POMDP的某些概念。在激励学习的诸多算法中,把非常多的精力集中在如何对系统的大状态空间进行有效的分解,其中U-Tree算法是其之一。但是,由于U-Tree算法的一个最大的问题是边缘节点生成的随意性和统计检验的计算负担,各种扩展方法都没有解决这个问题。本文提出了一种新的效用聚类激励学习算法,即我们称之为U-Clustering算法。该算法完全不用进行边缘节点的生成和测试,克服了上述提及的U-Tree算法的致命弱点。我们的新算法首先根据实例链的观测动作值对实例进行聚类,然后对每个聚类进行特征选择,最后再进行特征压缩,经过压缩后的新特征就成为新的叶节点。实验与仿真结果表明,我们的新算法比一般的U-Tree算法更有效。针对具有大状态空间的环境系统以及系统状态不完全可观测所面临的问题,本论文提出了求解部分可观测Markov决策过程的动态合并算法。这个方法利用区域这个概念,在环境状态空间上建立一个区域系统,而Agent在区域系统的每个区域上独自实现其最优目标,就好像有若干个Agent在并行工作一样。然后把各组成部分的最优值函数按一定的方式整合,最后得出POMDP的最优解。另外还对提出的算法进行了复杂度分析和仿真实验。通过对New York Driving的仿真和算法的实验分析,结果表明动态合并算法对于大状态空间上的求解问题是一个非常有效的算法。本文提出了风险敏感度的激励学习广义平均算法。这个算法通过潜在地牺牲解的最优性来获取鲁棒性(Robustness)。提出这种算法的主要原因是因为,如果在理论模型与实际的物理系统之间存在不匹配,或者实际系统是非静态的,或者控制动作的“可使用性”随时间的变化而变化时,那么鲁棒性就可能成为一个十分重要的问题。我们利用广义平均算子来替代最大算子max(或sup),对激励学习问题中的动态规划算法进行了研究,并讨论了它们的收敛性,目的就是为了提高激励学习算法的鲁棒性。我们提出了风险敏感度渐进策略递归激励学习算法并对策略的最优性进行了讨论。当系统的计算出现维数灾难时,比如在Markov决策过程的求解问题中,如果系统的动作空间非常之大,那么利用一般的策略递归(PI)算法或值递归(Ⅵ)算法,来进行策略的改进计算是不实际的。我们这个算法所关注的问题是,当状态空间相对较小而动作空间非常之大时如何得到最优策略或好的策略。在本算法的策略改进过程中,不需在整个动作空间上对值函数进行最大化运算,而是通过策略转换的方法来直接处理策略问题的。本文的另一个主要内容是,我们对多时间尺度风险敏感度Markov决策过程的最优方程与解的最优性问题进行了初步研究。由于需要使智能体能够适应更加复杂的环境,因此大规模控制问题在实际应用中显得越来越重要。在本章中采用了一种更加符合实际情况的复杂环境,即多时间尺度下的Markov决策过程模型,并利用风险敏感度的概念,第一次提出了多时间尺度风险敏感度Markov决策过程的概念。这是一个全新的问题。我们利用效用函数和风险敏感度等概念,讨论了二时间尺度风险敏感度Markov决策问题,然后给出了二时间尺度风险敏感度Markov决策过程的Bellman最优控制方程的一般形式,并证明了最优值函数满足这个Bellman最优控制方程,同时还得到了一些相关的结论。本博士论文所讨论的都是现阶段关于激励学习的热点和难点问题。激励学习方法已经成为控制理论与人工智能研究的最重要的分支之一,还有许多问题亟待解决。在本文的最末,我们给出了对这些问题的研究方向和未来的工作,希望能起到抛砖引玉的作用。

【Abstract】 This dissertation consists of two parts on the whole. We give some new algorithms for reinforcement learning in the first part. Our motivation in this part is to develop existing algorithms that are confronted with some problems such as curse of dimension and computational speed. In the second one, we investigate some theory problems that relative to optimal equations and optimal solutions of reinforcement learning based on the conception of risk-sensitive.In this dissertation, firstly, a new reinforcement learning (RL, abbreviation) algorithm is proposed we refer it to the forgetting algorithm of RL. The idea of this algorithm is based on the following considerations: The updating length of state value functions space about existing algorithms for reinforcement learning are all determined only by the temporal remember of which the times of visiting states. Using those methods, the system must remember all of value functions of states whether look-up tables or function approximators are applied. So, this kind of algorithm will bring on needing of remember capacity increasing exponentially, and therefore computational speed of the algorithm will be decreased exponentially. Based on the consideration above, the principles of forgetting in remembering psychology are introduced in this section to investigate the reinforcement learning about value functions, specially for SARSA(λ) algorithm, and we established a class of forgetting algorithms be fit for value functions reinforcement learning.We propose a new algorithm for reinforcement learning based-on utility clustering method. Some conceptions in POMDP are applied in the model of this kind of algorithm. The main problem in U-tree algorithm is the creating of margin nodes with randomness and computational loads for statistical inspection. All of extended methods have not solved this problem above. My dissertation advance a new reinforcement learning based-on utility clustering method called after U-Clustering. This algorithm does not need to creating and inspection for margin nodes, instead of that it clusters the instances firstly according to values of observed actions of the instances chain, and then selects the features for every cluster, finally, compresses those features. The compressed features will become the new leaf nodes.This dissertation contributes the dynamic merge algorithm for solving partially observed Markov decision processes (POMDP). In this method, the conception of region is applied, and an region system on state space of environment is established. The Agent implements its optimal goals independently on each regions of region system, which likes work parallel in some of Agents. And then, we shall merge these optimal functions of each region in some way to one and get the optimal solution of POMDE In this section, we also analyze the complexity of this new algorithm and do simulation for its results. We apply the famous simulation system, New York Driving, for our algorithm, and the results show that the dynamic merge algorithm is a very effective one to solve large state space problems.This dissertation proposes the generalized average algorithm for reinforcement learning Agent based-on the conception of risk-sensitive Markov decision processes. The motivation of this algorithm is to obtain the robustness for a kind of reinforcement learning algorithms via immolation of optimality potentially. The reason for proposing this algorithm is that the robustness will become a very important problem if non-matching exists between theoretic model and practice physical system, or workability of control actions will change along with the time. Here we investigate the dynamic programming algorithm in reinforcement learning through substituting the maximum or supremum operator to generalized average operator and also discuss its convergence. The goal of this idea is to improve robustness for a kind of reinforcement learning algorithms.We contribute a new idea to the risk-sensitive evolution policy iteration algorithm for solving reinforcement learning problem and discuss the optimality of polices for this algorithm. This idea comes from the appearance of the curse of dimension in computational process, for example, in Markov decision processes, it’s not practical for improving policy computation using the general policy iteration or value iteration method. The problem in this algorithm here we focus on is how to obtain optimal or good policies when the action space is very large and the state space small relatively. In the processing of policy improving, this algorithm deal with the policy problem directly via the method of policy transition, instead of having no use for maximizing computation to value functions on whole action space.In the end of this dissertation, we investigate the optimal equations for solving multi-time risk-sensitive Markov decision processes problem. It is more and more important to practical applications with large-scale control problems because of needing that the Agent can accommodate more complex environment. A more complex and practical environment is employed in this section we refer it to be multi-time Markov decision processes model. Moreover, the conception of risk-sensitive is applied and the conception of multi-time risk-sensitive Markov decision processes is proposed firstly. This is a new kind of problems to solving reinforcement learning algorithm based-on Markov models. We use the conceptions of utility function and risk-sensitive to discuss the problems of two-time scale risk-sensitive Markov decision control case. And then we obtain the formal Bellman’s optimal control equation for two-time scale risk-sensitive Markov decision processes and prove that the optimal value function can be satisfied this Bellman’s optimal control equation. Meanwhile, we also give some relative results and their proofs.

  • 【网络出版投稿人】 上海大学
  • 【网络出版年期】2008年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络