来自麻省理工的信息抽取

时间 2019-11-24

标签来自麻省理工信息抽取繁體版

原文原文链接

MITIE

MITIE 即 MIT 的 NLP 团队发布的一个信息抽取库和工具。它是一款免费且先进的信息抽取工具，目前包含了命名实体抽取、二元关系检测功能，另外也提供了训练自定义抽取器和关系检测器的工具。html

MITIE 是核心代码是使用 C++ 写的，创建在高性能的机器学习库 dlib 上。MIT 团队给咱们提供了一些已训练好了的模型，这其中包含了英语、西班牙语和德语，这些模型都使用了大量的语料进行训练。咱们发现并无咱们要的中文的模型，因此这个还得咱们本身训练。java

尽管 MITIE 是 C++ 写的，但它也提供了其余语言的调用 API 。在我本身的项目中经常会跟 Java 、 Python 混合用，因此只要编译成动态库再分别用 Java 和 Python 调用就好了，很方便。python

为何出现MITIE

看看 MIT 实验室的人怎么说就知道了。ios

I work at a lab and there are a lot of cool things about my job. In fact, I could go on all day about it, but in this post I want to talk about one thing in particular, which is that we recently got funded by the program to make an open source natural language processing library focused on information extraction.git

Why make such a thing when there are already open source libraries out there for this (e.g. OpenNLP, NLTK, Stanford IE, etc.)? Well, if you look around you quickly find out that everything which exists is either expensive, not state-of-the-art, or GPL licensed. If you wanted to use this kind of NLP tool in a non-GPL project then you are either out of luck, have to pay a lot of money, or settle for something of low quality. Well, not anymore! We just released the first version of our MIT Information Extraction library which is built using state-of-the-art statistical machine learning tools.github

怎么使用

提取实体为例，为方即可直接使用 MITIE 提供给咱们的模型，不然你就须要本身训练了。从 github.com/mit-nlp/MIT… 下载。bash

而后建立一个 test.txt 文件，待测试内容为机器学习

I met with john becker at HBU.
The other day at work I saw Brian Smith from CMU.复制代码

最后编写代码以下，工具

#include <mitie/named_entity_extractor.h>
#include <mitie/conll_tokenizer.h>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <cstdlib>

using namespace std;
using namespace mitie;

std::vector<string> tokenize_file (
    const string& filename
)
{
    ifstream fin(filename.c_str());
    if (!fin)
    {
        cout << "Unable to load input text file" << endl;
        exit(EXIT_FAILURE);
    }
    conll_tokenizer tok(fin);
    std::vector<string> tokens;
    string token;
    while(tok(token))
        tokens.push_back(token);

    return tokens;
}


int main(int argc, char** argv)
{
    try
    {
        if (argc != 3)
        {
            printf("You must give a MITIE ner model file as the first command line argument\n");
            printf("followed by a text file to process.\n");
            return EXIT_FAILURE;
        }
        string classname;
        named_entity_extractor ner;
        dlib::deserialize(argv[1]) >> classname >> ner;

        const std::vector<string> tagstr = ner.get_tag_name_strings();
        cout << "The tagger supports "<< tagstr.size() <<" tags:" << endl;
        for (unsigned int i = 0; i < tagstr.size(); ++i)
            cout << " " << tagstr[i] << endl;

        std::vector<string> tokens = tokenize_file(argv[2]);

        std::vector<pair<unsigned long, unsigned long> > chunks;
        std::vector<unsigned long> chunk_tags;
        std::vector<double> chunk_scores;

        ner.predict(tokens, chunks, chunk_tags, chunk_scores);

        cout << "\nNumber of named entities detected: " << chunks.size() << endl;
        for (unsigned int i = 0; i < chunks.size(); ++i)
        {
            cout << " Tag " << chunk_tags[i] << ": ";
            cout << "Score: " << fixed << setprecision(3) << chunk_scores[i] << ": ";
            cout << tagstr[chunk_tags[i]] << ": ";
            for (unsigned long j = chunks[i].first; j < chunks[i].second; ++j)
                cout << tokens[j] << " ";
            cout << endl;
        }

        return EXIT_SUCCESS;
    }
    catch (std::exception& e)
    {
        cout << e.what() << endl;
        return EXIT_FAILURE;
    }
}复制代码

执行结果为，post

The tagger supports 4 tags:
   PERSON
   LOCATION
   ORGANIZATION
   MISC

Number of named entities detected: 4
   Tag 0: Score: 1.532: PERSON: john becker
   Tag 2: Score: 0.340: ORGANIZATION: HBU
   Tag 0: Score: 1.652: PERSON: Brian Smith
   Tag 2: Score: 0.471: ORGANIZATION: CMU复制代码

中文模型训练

主要是要训练全部词向量特征，后面的实名实体模型和关系模型都是创建在它的基础上，MITIE 给咱们提供了工具完成上述操做，咱们能够用 cmake 生成vs项目，但通常咱们没有必要改动到代码，直接使用 cmake 构建一下就可直接使用。主要操做有

D:\MITIE\tools\wordrep>mkdir build
D:\MITIE\tools\wordrep>cd build
D:\MITIE\tools\wordrep\build>cmake ..
D:\MITIE\tools\wordrep\build>cmake --build . --config Release复制代码

再一个是须要收集大量的词汇，能够经过维基百科和百度百科收集，相似处理能够参加前面的文章《如何使用中文维基百科语料》。

接着就能够开始训练了，参数e表示生成全部咱们须要的模型，data为语料库的目录。

wordrep -e data复制代码

if (parser.option("e"))
        {
            count_words(parser);
            word_vects(parser);
            basic_morph(parser);
            cca_morph(parser);
            return 0;
        }复制代码

Java&Python调用

主要的一步都是要生成共享连接库，一样使用 cmake 能够很方便生成，到 mitielib 目录，

D:\MITIE\mitielib>mkdir build
D:\MITIE\mitielib>cd build
D:\MITIE\mitielib\build>cmake ..
D:\MITIE\mitielib\build>cmake --build . --config Release --target install复制代码

生成须要的连接库。

-- Install configuration: "Release"
-- Installing: D:/MITIE/mitielib/msvcp140.dll
-- Installing: D:/MITIE/mitielib/vcruntime140.dll
-- Installing: D:/MITIE/mitielib/concrt140.dll
-- Installing: D:/MITIE/mitielib/mitie.lib
-- Installing: D:/MITIE/mitielib/mitie.dll复制代码

而后 python 就能轻易完成调用。而对于 Java 也而须要相似的操做，但它的构建过程还须要有 SWIG 。生成以下的连接库和 jar 包，而后 Java就能轻易完成调用。

-- Install configuration: "Release"
-- Installing: D:/MITIE/mitielib/java/../javamitie.dll
-- Installing: D:/MITIE/mitielib/java/../javamitie.jar
-- Up-to-date: D:/MITIE/mitielib/java/../msvcp140.dll
-- Up-to-date: D:/MITIE/mitielib/java/../vcruntime140.dll
-- Up-to-date: D:/MITIE/mitielib/java/../concrt140.dll复制代码

github

一个文本分析项目使用MITIE，github.com/sea-boat/Te…

如下是广告

========广告时间========

鄙人的新书《Tomcat内核设计剖析》已经在京东销售了，有须要的朋友能够到 item.jd.com/12185360.ht… 进行预约。感谢各位朋友。

为何写《Tomcat内核设计剖析》

=========================

欢迎关注：