[email protected],[email protected],MAP,MRR

[email protected]

Let {(ci,ri),1 <= i <=n} be the list of m context-response pairs from the test set. For each context ci, we create a set of m alternative responses, one response being the actual response ri, and the m-1 other responses being sampled at random from the same corpus. The m alternative responses are then ranked based on the output from the conversational model, and the [email protected] measures how often the correct response appears in the top i results of this ranked list. The [email protected] metric is often used for the evaluation of retrieval models as several responses may be equally “correct” given a particular context.

[email protected]

Set a rank threshold K
Compute % relevant in top K
Ignores documents ranked lower than K

Ex: 这里写图片描述
[email protected] of 2/3
[email protected] of 2/4
[email protected] of 3/5

Mean Average Precision

这里写图片描述

这里写图片描述

这里写图片描述

MRR

这里写图片描述