强化学习DDPG的tensorflow代码

时间 2019-11-13

标签强化学习 ddpg tensorflow 代码繁體版

原文原文链接

Deep Deterministic Policy Gradient简称DDPG。它是在DPG的基础上，为了提升神经网络的稳定性，而参考DQN的实现方式提出的。DDPG创建两个网络，一个target网络，一个eval网络，同时使用经验回放机制。Deep，其含义主要就是使用经验池和双网络结构来促进神经网络可以有效学习。git

双网络结构的网络参数是每隔必定间隔时间从eval网络复制到target网络的。传统的DQN一般采用的是一种被称为'hard'模式的target-net网络参数更新，即每隔必定的步数就将eval-net中的网络参数赋值过去，而在DDPG中，能够采用另外一种'soft'模式的target-net网络参数更新，即每一步都对target-net网络中的参数更新一点点，这种参数更新方式通过试验代表能够大大的提升学习的稳定性。github

DDPG主要的关键点有如下几个：算法

DDPG能够看作是Nature DQN、Actor-Critic和DPG三种方法的组合算法。
Critic部分的输入为states和action。
Actor部分再也不使用本身的Loss函数和Reward进行更新，而是使用DPG的思想，使用critic部分Q值对action的梯度来对actor进行更新。
使用了Nature DQN的思想，加入了经验池、随机抽样和目标网络，real Q值使用两个target网络共同计算。
target网络更新改成软更新，在每一个batch缓慢更新target网络的参数。
将ε-greedy探索的方法使用在连续值采样上，经过Ornstein-Uhlenbeck process为action添加噪声。

DDPG对actor不直接计算损失而是使用criric对actor的损失，能够这样理解：actor的目的是尽可能获得一个高Q值的action，所以actor的损失能够简单的理解为获得的反馈Q值越大损失越小，获得的反馈Q值越小损失越大。数组

actor(θ)中action对参数的梯度为da/dθ，critic中Q对action的梯度dq/da，最后获得的Q值对actor(θ)的梯度公式就为-(dq/da * da/dθ)（负数的缘由是优化器的方向为最小化loss而咱们的目的是最大化Q值）网络

DDPG代码能够参考https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG.py的实现。 dom

一、定义超参数函数

咱们首先定义网络中的超参数，好比经验池的大小，两个网络的学习率等等:学习

MAX_EPISODES = 200
MAX_EP_STEPS = 200
LR_A = 0.001    # learning rate for actor
LR_C = 0.002    # learning rate for critic
GAMMA = 0.9     # reward discount
TAU = 0.01      # soft replacement
MEMORY_CAPACITY = 10000
BATCH_SIZE = 32
RENDER = False
ENV_NAME = 'Pendulum-v0'

二、定义网络输入优化

咱们须要定义的placeholder包括当前的状态S，下一时刻的状态S',以及对应的奖励R，而动做A由Actor获得，所以不须要再定义：ui

self.S = tf.placeholder(tf.float32, [None, s_dim], 's')
self.S_ = tf.placeholder(tf.float32,  [None, s_dim], 's_')
self.R = tf.placeholder(tf.float32, [None, 1], 'r')

三、构建两个网络

两个网络都是两层全连接的神经网络，Actor输出一个具体的动做，而Critic网络输出一个具体的Q值

def _build_a(self, s, scope, trainable):
  with tf.variable_scope(scope):
    net = tf.layers.dense(s, 30, activation=tf.nn.relu, name='l1', trainable=trainable)
    a = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, name='a', trainable=trainable) 
return tf.multiply(a, self.a_bound, name='scaled_a')

def _build_c(self, s, a, scope, trainable):
  with tf.variable_scope(scope):
    n_l1 = 30
    w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], trainable=trainable)
    w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], trainable=trainable)
    b1 = tf.get_variable('b1', [1, n_l1], trainable=trainable)
    net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1) 
return tf.layers.dense(net, 1, trainable=trainable)  # Q(s,a)

四、soft模式参数更新

能够看到，咱们这里进行的是soft模式的参数更新，每次在原来target-net参数的基础上，改变一丢丢，增长一点点eval-net的参数信息。

# networks parameters
self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval')
self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')
self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval')
self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')
 
# target net replacement
self.soft_replace = [[tf.assign(ta, (1 - TAU) * ta + TAU * ea), tf.assign(tc, (1 - TAU) * tc + TAU * ec)]                 
for ta, ea, tc, ec in zip(self.at_params, self.ae_params, self.ct_params, self.ce_params)]

五、定义两个网络的损失

actor network的更新也很简单，首先咱们须要critic network对动做a的导数，其中a是由actor network根据状态s估计出来的。

先根据actor network估计出action，再用critic network的输出q对估计出来的action求导。

而后咱们把获得的这部分梯度，和actor network的输出对actor network的权重求导的梯度，相乘就能获得最后的梯度

关于两个网络的损失，咱们以前已经详细介绍过了，这里只是对刚才思路的一个代码实现。

q_target = self.R + GAMMA * q_
# in the feed_dic for the td_error, the self.a should change to actions in memory
td_error = tf.losses.mean_squared_error(labels=q_target, predictions=q)
self.ctrain = tf.train.AdamOptimizer(LR_C).minimize(td_error, var_list=self.ce_params) 
a_loss = - tf.reduce_mean(q)    # maximize the q
self.atrain = tf.train.AdamOptimizer(LR_A).minimize(a_loss, var_list=self.ae_params)

六、学习

咱们首先要从经验池中取出一个batch的数据，而后训练咱们的Actor和Critic。

def learn(self):
  # soft target replacement
  self.sess.run(self.soft_replace)
  indices = np.random.choice(MEMORY_CAPACITY, size=BATCH_SIZE)
  bt = self.memory[indices, :]
  bs = bt[:, :self.s_dim]
  ba = bt[:, self.s_dim: self.s_dim + self.a_dim]
  br = bt[:, -self.s_dim - 1: -self.s_dim]
  bs_ = bt[:, -self.s_dim:]
  self.sess.run(self.atrain, {self.S: bs})
  self.sess.run(self.ctrain, {self.S: bs, self.a: ba, self.R: br, self.S_: bs_})

七、存储经验

将s,a,r,s_存储到内存数组。

def store_transition(self, s, a, r, s_):
  transition = np.hstack((s, a, [r], s_))
  index = self.pointer % MEMORY_CAPACITY  # replace the old memory with new memory
  self.memory[index, :] = transition