Why does policy gradiet method has high variance?

策略梯度方法 策略梯度方法中,目标函数是使得整个episode得到的reward的均值最大: maximizeθEπθ[∑t=0T−1γtrt] 由于: ∇θE[f(x)]=∇θ∫pθ(x)f(x)dx=∫pθ(x)pθ(x)∇θpθ(x)f(x)dx=∫pθ(x)∇θlogpθ(x)f(x)dx=E[f(x)∇θlogpθ(x)] 以及: ∇θlogpθ(τ)=∇log(μ(s0)∏t=0T−1
相关文章
相关标签/搜索