[Reinforcement Learning] Policy Gradient Methods

时间 2019-11-06

标签 reinforcement learning policy gradient methods 繁體版

原文原文链接

上一篇博文的内容整理了咱们如何去近似价值函数或者是动做价值函数的方法：
\[ V_{\theta}(s)\approx V^{\pi}(s) \\ Q_{\theta}(s)\approx Q^{\pi}(s, a) \]
经过机器学习的方法咱们一旦近似了价值函数或者是动做价值函数就能够经过一些策略进行控制，好比 \(\epsilon\)-greedy。html

那么咱们简单回顾下 RL 的学习目标：经过 agent 与环境进行交互，获取累计回报最大化。既然咱们最终要学习如何与环境交互的策略，那么咱们能够直接学习策略吗，而以前先近似价值函数，再经过贪婪策略控制的思路更像是"曲线救国"。
这就是本篇文章的内容，咱们如何直接来学习策略，用数学的形式表达就是：
\[\pi_{\theta}(s, a) = P[a | s, \theta]\]web

这就是被称为策略梯度（Policy Gradient，简称PG）算法。算法

固然，本篇内容一样的是针对 model-free 的强化学习。app

Value-Based vs. Policy-Based RL

Value-Based：机器学习

学习价值函数
Implicit policy，好比 \(\epsilon\)-greedy

Policy-Based：函数

没有价值函数
直接学习策略

Actor-Critic：学习

学习价值函数
学习策略

三者的关系能够形式化地表示以下：
优化

认识到 Value-Based 与 Policy-Based 区别后，咱们再来讨论下 Policy-Based RL 的优缺点：google

优势：spa

收敛性更好
对于具备高维或者连续动做空间的问题更加有效
能够学习随机策略

缺点：

绝大多数状况下收敛到局部最优势，而非全局最优
评估一个策略通常状况下低效且存在较高的方差

Policy Search

咱们首先定义下目标函数。

Policy Objective Functions

目标：给定一个带有参数 \(\theta\) 的策略 \(\pi_{\theta}(s, a)\)，找到最优的参数 \(\theta\)。
可是咱们如何评估不一样参数下策略 \(\pi_{\theta}(s, a)\) 的优劣呢？

对于episode 任务来讲，咱们能够使用start value：
\[J_1(\theta)=V^{\pi_{\theta}}(s_1)=E_{\pi_{\theta}}[v_1]\]
对于连续性任务来讲，咱们能够使用 average value：
\[J_{avV}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)\]
或者每一步的平均回报：
\[J_{avR}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s, a)R_s^a\]
其中 \(d^{\pi_{\theta}}(s)\) 是马尔卡夫链在 \(\pi_{\theta}\) 下的静态分布。

Policy Optimisation

在明确目标之后，咱们再来看基于策略的 RL 为一个典型的优化问题：找出 \(\theta\) 最大化 \(J(\theta)\)。
最优化的方法有不少，好比不依赖梯度（gradient-free）的算法：

登山算法
模拟退火
进化算法
...

可是通常来讲，若是咱们能在问题中得到梯度的话，基于梯度的最优化方法具备比较好的效果：

梯度降低
共轭梯度
拟牛顿法
...

咱们本篇讨论梯度降低的方法。

策略梯度定理

假设策略 \(\pi_{\theta}\) 为零的时候可微，而且已知梯度 \(\triangledown_{\theta}\pi_{\theta}(s, a)\)，定义 \(\triangledown_{\theta}\log\pi_{\theta}(s, a)\) 为得分函数（score function）。两者关系以下：
\[\triangledown_{\theta}\pi_{\theta}(s, a) = \triangledown_{\theta}\pi_{\theta}(s, a) \frac{\triangledown_{\theta}\pi_{\theta}(s, a)}{\pi_{\theta}(s, a)}=\pi_{\theta}(s, a)\triangledown_{\theta}\log\pi_{\theta}(s, a)\]
接下来咱们考虑一个只走一步的MDP，对它使用策略梯度降低。\(\pi_{\theta}(s, a)\) 表示关于参数 \(\theta\) 的函数，映射是 \(p(a|s,\theta)\)。它在状态 \(s\) 向前走一步，得到奖励\(r=R_{s, a}\)。那么选择行动 \(a\) 的奖励为 \(\pi_{\theta}(s, a)R_{s, a}\)，在状态 \(s\) 的加权奖励为 \(\sum_{a\in A}\pi_{\theta}(s, a)R_{s, a}\)，应用策略所能得到的奖励指望及梯度为：
\[ J(\theta)=E_{\pi_{\theta}}[r] = \sum_{s\in S}d(s)\sum_{a\in A}\pi_{\theta}(s, a)R_{s, a}\\ \triangledown_{\theta}J(\theta) = \color{Red}{\sum_{s\in S}d(s)\sum_{a\in A}\pi_{\theta}(s, a)}\triangledown_{\theta}\log\pi_{\theta}(s, a)R_{s, a}=E_{\pi_{\theta}}[\triangledown_{\theta}\log\pi_{\theta}(s, a)r] \]

再考虑走了多步的MDP，使用 \(Q^{\pi_{\theta}}(s, a)\) 代替奖励值 \(r\)，对于任意可微的策略，策略梯度为：
\[\triangledown_{\theta}J(\theta) = E_{\pi_{\theta}}[\triangledown_{\theta}\log\pi_{\theta}(s, a)Q^{\pi_{\theta}}(s, a)]\]

策略梯度定理

对于任意可微策略 \(\pi_{\theta}(s, a)\)，任意策略目标方程 \(J = J_1, J_{avR}, ...\)，策略梯度：
\[\triangledown_{\theta}J(\theta) = E_{\pi_{\theta}}[\triangledown_{\theta}\log\pi_{\theta}(s, a)Q^{\pi_{\theta}}(s, a)]\]

蒙特卡洛策略梯度算法（REINFORCE）

Monte-Carlo策略梯度算法，即REINFORCE：

经过采样episode来更新参数：；
使用随机梯度上升法更新参数；
使用return \(v_t\) 做为 \(Q^{\pi_{\theta}}(s_t, a_t)\) 的无偏估计

则 \(\Delta\theta_t = \alpha \triangledown_{\theta}\log\pi_{\theta}(s_t, a_t)v_t\)，具体以下：

Actir-Critic 策略梯度算法

Monte-Carlo策略梯度的方差较高，所以放弃用return来估计行动-价值函数Q，而是使用 critic 来估计Q：
\[Q_w(s, a)\approx Q^{\pi_{\theta}}(s, a)\]
这就是大名鼎鼎的 Actor-Critic 算法，它有两套参数：

Critic：更新动做价值函数参数 \(w\)
Actor：朝着 Critic 方向更新策略参数 \(\theta\)

Actor-Critic 算法是一个近似的策略梯度算法：
\[ \triangledown_\theta J(\theta)\approx E_{\pi_{\theta}}[\triangledown_{\theta}\log \pi_{\theta}(s, a)Q_w(s, a)]\\ \Delta\theta = \alpha\triangledown_\theta\log\pi_{\theta}(s,a)Q_w(s,a) \]

Critic 本质就是在进行策略评估：How good is policy \(\pi_{\theta}\) for current parameters \(\theta\).
策略评估咱们以前介绍过MC、TD、TD(\(\lambda\))，以及价值函数近似方法。以下所示，简单的 Actir-Critic 算法 Critic 为动做价值函数近似，使用最为简单的线性方程，即：\(Q_w(s, a) = \phi(s, a)^T w\)，具体的伪代码以下所示：

在 Actir-Critic 算法中，对策略进行了估计，这会产生偏差（bias），可是当知足如下两个条件时，策略梯度是准确的：

价值函数的估计值没有和策略相违背，即：\(\triangledown_w Q_w(s,a) = \triangledown_\theta\log\pi_{\theta}(s,a)\)
价值函数的参数w可以最小化偏差，即：\(\epsilon = E_{\pi_{\theta}}[(Q^{\pi_{\theta}}(s, a) - Q_w(s,a))^2]\)

优点函数

另外，咱们能够经过将策略梯度减去一个基线函数（baseline funtion）B(s)，能够在不改变指望的状况降低低方差（variance）。证实不改变指望，就是证实相加和为0：
\[ \begin{align} E_{\pi_{\theta}}[\triangledown_\theta\log\pi_{\theta}(s,a)B(s)] &=\sum_{s\in S}d^{\pi_{\theta}}(s)\sum_a \triangledown_\theta\pi_{\theta}(s, a)B(s)\\ &=\sum_{s\in S}d^{\pi_{\theta}}(s)B(s)\triangledown_\theta\sum_{a\in A}\pi_{\theta}(s,a )\\ &= 0 \end{align} \]

状态价值函数 \(V^{\pi_{\theta}}(s)\) 是一个好的基线。所以能够经过使用优点函数（Advantage function）\(A^{\pi_{\theta}}(s,a)\) 来重写价值梯度函数。
\[ A^{\pi_{\theta}}(s,a)=Q^{\pi_{\theta}}(s,a)-V^{\pi_{\theta}}(s)\\ \triangledown_\theta J(\theta)=E_{\pi_{\theta}}[\triangledown_\theta\log\pi_{\theta}(s,a)A^{\pi_{\theta}}(s,a)] \]

设 \(V^{\pi_{\theta}}(s)\) 是真实的价值函数，TD算法利用bellman方程来逼近真实值，偏差为 \(\delta^{\pi_{\theta}}=r+\gamma V^{\pi_{\theta}}(s') - V^{\pi_{\theta}}(s)\)。该偏差是优点函数的无偏估计。所以咱们能够使用该偏差计算策略梯度：
\[\triangledown_\theta J(\theta)=E_{\pi_{\theta}}[\triangledown_\theta\log\pi_{\theta}(s,a)\delta^{\pi_{\theta}}]\]
该方法只须要critic，不须要actor。更多关于 Advantage Function 的能够看这里。

最后总结一下策略梯度算法：

Reference

[1] Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018
[2] David Silver's Homepage
[3] Advantage Learning