【手撕 - 天然语言处理】手撕 TextRank（02）大佬是怎么实现 C++ 版的

时间 2019-12-07

标签手撕 - 天然语言处理 textrank 大佬怎么实现 c++ 栏目 C&C++ 繁體版

原文原文链接

做者：LogMnode

本文原载于 https://segmentfault.com/u/logm/articles ，不容许转载~git

1. 源码来源

comoody 大佬的源码：https://github.com/comoody/TextRank.gitgithub

本文对应的源码版本：Commits on Oct 23, 2018, 9736be10593b99adc1ea614c5d83f1bfeca17b94web

lostfish 大佬的源码：https://github.com/lostfish/textrank.git算法

本文对应的源码版本：Commits on Sep 29, 2016, e89084374ae0e08490c9cc0fa79f8ae4bb10ad5bsegmentfault

TextRank 论文地址：https://www.aclweb.org/anthology/W04-3252dom

2. 概述

C++ 版本的 TextRank 尚未发现点赞超级多的代码，这里我找了两个不一样的实现来分析。ide

在上一篇博客：TextRank Python 版本，咱们知道，看 TextRank 的源码有两个重点须要看，重点1：句子与句子的类似度是如何计算的；重点2：PageRank的实现。函数

这里，考虑到篇幅，我直接给出对应的函数所在的位置。优化

3. comoody 大佬的源码

先看看大佬是怎么计算句子与句子之间的类似度的。配合我写的那几行中文注释，应该很容易看懂。大体和论文里的公式是一致的，可是分母和论文的公式不同。具体为何用这个分母，我就不得而知了。

// 文件：src/TextRanker.cpp
// 行数：153
float TextRanker::getSimilarity(std::string a, std::string b) const
{
    // no two equivalent sentences should ever be compared, but this logic is included just in case
    if(a == b)
        return 0.f;

    // 大小写转换 -> 分词 -> 把词放到 set 里
    std::transform(a.begin(), a.end(), a.begin(), ::tolower);
    std::vector<std::string> aWords = stringSplit(a, ' ');  
    std::set<std::string> aWordSet;
    for(auto word : aWords)
        aWordSet.insert(word);

    std::transform(b.begin(), b.end(), b.begin(), ::tolower);
    std::vector<std::string> bWords = stringSplit(b, ' ');
    std::set<std::string> bWordSet;
    for(auto word : bWords)
        bWordSet.insert(word);

    // 求两个 set 的交集，至关于两个句子共有的词语
    std::vector<std::string> commonWords;
    std::set_intersection
    (
        aWordSet.begin(),
        aWordSet.end(),
        bWordSet.begin(),
        bWordSet.end(),
        std::back_inserter(commonWords)
    );

    // 这个分母和论文的公式不同，论文里是带 log 的
    float avgWords = (aWords.size() + bWords.size()) / 2;
    return commonWords.size() / avgWords;
}

而后，咱们来看看 PageRank 的实现。哦！看来 comoody 大佬没有使用 PageRank 的公式，而是在图中随机游走，某个点游走通过的次数越多，这个点得分越高。

讲道理，这种随机游走的效果和用 PageRank 的公式求得的效果是差很少的。了解 PageRank 的同窗应该清楚，PageRank 这个算法的本质，就是模拟用户在网页间的随机游走。

// 文件：src/TextRanker.cpp
// 行数：77

// does a single walk that visits sentences around the graph according to probabilities defined in the similarity matrix
// after each iteration inside the walk, there is a 1 - kNewWalkThreshold probability that the walk will end and this
// method will return
// during the walk, the visits map is updated accoridingly
void TextRanker::doSentenceGraphWalk
(
    const FloatMatrix& similarityMatrix,
    const std::vector<std::string>& sentences,
    std::map<std::string, int>& visits
) const
{
    const int kDim = sentences.size();
    bool continueWalk = true;
    // start walk at a random node in the sentence graph
    int curSentenceIndex = rand() % kDim;
    while (continueWalk)
    {
        // visit the curSentence
        std::string curSentence = sentences[curSentenceIndex];
        visits[curSentence]++;

        // the row of the similarity matrix corresponding to the current sentence represents the probabilites
        // of transferring to all ofther sentences from the current sentence
        std::vector<float> probabilites;
        std::copy_if
        (
            similarityMatrix[curSentenceIndex].begin(),
            similarityMatrix[curSentenceIndex].end(),
            std::back_inserter(probabilites),
            [](float f) { return f != 0.f; }
        );

        if (probabilites.size() == 0)
            break; // no possible neighbor to visit

        // normalize probabilites
        float sum = std::accumulate(probabilites.begin(), probabilites.end(), 0.f);
        std::transform(probabilites.begin(), probabilites.end(), probabilites.begin(), [sum](float probability)
        {
            return probability / sum;
        });

        // stack probabilites
        std::vector<float> probabilityDistribution;
        for(std::vector<float>::iterator j = probabilites.begin(); j < probabilites.end(); j++)
            probabilityDistribution.push_back(std::accumulate(probabilites.begin(), j+1, 0.f));

        // get a random number betweeon 0 and 1
        float selector = (rand() % 1000) / 1000.f;

        int selectedIndex = 0;
        while(probabilityDistribution[selectedIndex] <= selector)
            selectedIndex++;

        // the selected index maps to a probability from a distribution with all 0 entries removed
        // iterate through the original probabilites to map back the selected index to its true index from the list
        int trueIndex = 0;
        int nonZeroCount = 0;
        do
        {
            if(similarityMatrix[curSentenceIndex][trueIndex] != 0)
                nonZeroCount++;
        }
        while((nonZeroCount < selectedIndex + 1) && ++trueIndex);

        // update the curSentence index so that it can be visited in the next iter of the current walk
        curSentenceIndex = trueIndex;

        // randomly test for the end of a walk, if the random number is above kNkNewWalkThreshold, start a new walk
        float newWalkSelector = (rand() % 1000) / 1000.f;
        if(newWalkSelector > kNewWalkThreshold)
            continueWalk = false;
    }
}

4. lostfish 大佬的源码

咱们再来看看 lostfish 大佬的源码。一样的，先看看是如何计算句子与句子之间类似度的。

emmm. 也是和论文公式有些不同：统计两个句子共同出现的词语时，没有对每一个句子先作一次去重。

// 文件：src/sentence_rank.cpp
// 行数：47
double SentenceRank::CalcDist(const vector<string> &token_vec1, const vector<string> &token_vec2) 
{
    size_t both_num = 0;
    size_t n1 = token_vec1.size();
    size_t n2 = token_vec2.size();
    if (n1 < 2 || n2 < 2)
        return 0;

    // 统计两句话共同出现的词语数
    for (size_t i = 0; i < n1; ++i)
    {
        const string &token = token_vec1[i];
        for(size_t j = 0; j < n2; ++j)
        {
            if (token == token_vec2[j])
            {
                both_num++;
                break;
            }
        }
    }

    double dist = both_num / ( log(n1) + log(n2) );
    return dist;
}

再来看看 PageRank 的实现吧。和论文里给出的公式同样。我写了一些注释，应该仍是容易看懂的。

void SentenceRank::CalcSentenceScore(map<size_t, double> &score_map)
{
    score_map.clear();

    // initialize
    size_t n = m_sentence_vec.size();
    for (size_t id = 0; id < n; ++id)
        score_map.insert(make_pair(id, 1.0));

    // iterate
    for (size_t i = 0; i < m_max_iter_num; ++i)
    {
        double max_delta = 0;
        map<size_t, double> new_score_map; // 这里记录了每一个句子的得分

        for (size_t id1 = 0; id1 < n; ++id1)
        {
            double new_score = 1 - m_d; // 当前迭代轮次中，每一个句子的得分，这里先把论文公式的左边部分加上

            // 计算论文公式的右边部分
            double sum_weight = 0.0;
            for (size_t id2 = 0; id2 < n; ++id2)
            {
                if (id1 == id2 || m_out_sum_map[id2] < 1e-6)
                    continue;
                double weight = GetWeight(id2, id1);    // 节点2 -> 节点1 的权重
                sum_weight += weight/m_out_sum_map.at(id2)*score_map[id2];
            }
            new_score += m_d * sum_weight;  // 把论文公式的右边部分加上
            new_score_map.insert(make_pair(id1, new_score));

            // 监测每两轮迭代之间score的差值，差值足够小就不用继续迭代
            double delta = fabs(new_score - score_map[id1]);
            max_delta = max(max_delta, delta);
        }
        score_map = new_score_map;
        if (max_delta < m_least_delta)
        {
            break;
        }
    }
}

5. 总结

两位大佬实现的 TextRank C++ 版本，仍是有必定的改进空间：

comoody 大佬使用的节点随机游走的方式可行，用在我的项目上是能够的；可是工程上来讲，rand() 函数致使每次结果都是随机的，没法复现，是比较致命的。

lostfish 大佬的 PageRank 部分，和论文是一致的，可是运算过程还有一些优化空间。

另外，两位大佬计算句子与句子之间类似度的函数与论文有点不一样，固然这个计算类似度的方法不是定死的，是能够本身创做的，好比用word2vec等等。

我打算本身也实现一个 C++ 的版本。