滴滴KDD2018：强化学习派单

时间 2019-12-04

标签滴滴 kdd2018 kdd 强化学习繁體版

原文原文链接

白话解读

离线learning部分

本质上是将任意时刻任意空间位置离散化为时空网格，根据派单记录（含参加调度但无单的司机）计算该时空网格到当天结束时刻的预期收入。ios

关键问题：怎么计算预期收入？算法

动态规划思路：假设总共有时刻区间为[0, T)；先计算T-1时刻的全部网格的预期收入（此时将来收入为0，只有当前收入），其本质就是计算当前收入的均值；而后计算T-2时刻的全部网格的预期收入；...；以此类推app

这样的话，就能够计算出每一个时空网格到当天结束时刻的预期收入。框架

重点：为何按照这个方式获得的值函数是合理的？ide

The resultant value function captures spatiotemporal patterns of both the demand side and the supply side. To make it clearer, asa special case, when using no discount and an episode-length of a day, the state-value function in fact corresponds to the expected revenue that this driver will earn on average from the current time until the end of the day.函数

在线planning部分

使用如下公式描述订单和司机之间的匹配度：学习

价格越高，匹配度越高
当前位置价值越大，匹配度越低
将来位置价值越大，匹配度越高
接驾里程，隐形表达，越大则预计送达时间越大，衰减系数越小，匹配度越低

使用KM算法求解匹配结果this

评估方案

AB-test方案

we adopted a customized A/B testing design thatsplits tra c according to large time slices (three or six hours). Forexample, a three-hour split sets the rst three hours in Day 1 to runvariant A and the next three hours for variant B. The order is thenreversed for Day 2. Such experiments will last for two weeks toeliminate the daily di erence. We select large time slices to observelong-term impacts generated by order dispatch approaches.spa

实际收益

the performance improvementbrought by the MDP method is consistent in all cities, with gains inglobal GMV and completion rate ranging from 0.5% to 5%. Consis-tent to the previous discoveries, the MDP method achieved its bestperformance gain in cities with high order-driver ratios. Meanwhile,the averaged dispatch time was nearly identical to the baselinemethod, indicating little sacrifice in user experienceorm

Value function可视化效果

如何包装为强化学习

将时空网格定义为state；将派单和不派单定义为action；将state的预期收入定义为状态值函数。

强化学习的目的是求解最优策略，也等价于求解最优值函数。派单场景的独特的地方是，建模的时候agent是每一个司机，作决策的时候是平台决策，因此司机实际上是没有策略的，或者说，经过派单机制，司机的策略被统一化为使平台的指望收入最大。所以在强化学习的框架下，能够将离线learning和在线planning认为是policy iteration的两个步骤，learning是更新value function，planning是policy update。然而，其实细想起来，仍是有些勉强。