强化学习实例2:MDP

红色块移动到黄色,黑色为障碍物python 马尔科夫链,算法 预测最好的路径,值函数为回报r(reward)和the discounted value of the ending statecanvas SARSA表明state, action, reward, next state和next action。it is known as an own policy Reinforcement Le
相关文章
相关标签/搜索