用BERT进行中文短文本分类

时间 2019-11-07

标签 bert 进行中文短文分类繁體版

原文原文链接

　　1. 环境配置python

　　本实验使用操做系统：Ubuntu 18.04.3 LTS 4.15.0-29-generic GNU/Linux操做系统。linux

　　1.1 查看CUDA版本json

　　cat /usr/local/cuda/version.txtapp

　　输出：ide

　　CUDA Version 10.0.130*测试

　　1.2 查看 cudnn版本ui

　　cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2spa

　　输出：操作系统

　　#define CUDNN_MINOR 63d

　　#define CUDNN_PATCHLEVEL 3

　　#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

　　若是没有安装 cuda 和 cudnn，到官网根本身的 GPU 型号版本安装便可

　　1.3 安装tensorflow-gpu

　　经过Anaconda建立虚拟环境来安装tensorflow-gpu(Anaconda安装步骤就不说了)

　　建立虚拟环境

　　虚拟环境名为：tensorflow

　　conda create -n tensorflow python=3.7.1

　　进入虚拟环境

　　下次使用也能够经过此命令进入虚拟环境

　　source activate tensorflow

　　安装tensorflow-gpu

　　不推荐直接pip install tensorflow-gpu 由于速度比较慢。能够从豆瓣的镜像中下载，速度仍是很快的。https://pypi.doubanio.com/simple/tensorflow-gpu/

　　找到本身适用的版本(cp37表示python版本为3.7)

　　而后经过pip install 安装

　　pip install https://pypi.doubanio.com/packages/15/21/17f941058556b67ce6d1e3f0e0932c9c2deaf457e3d45eecd93f2c20827d/tensorflow_gpu-1.14.0rc1-cp37-cp37m-manylinux1_x86_64.whl

　　我选择了1.14.0的tensorflow-gpu linux版本，python版本为3.7。使用BERT的话，tensorflow-gpu版本必须大于1.11.0。同时，不建议选择2.0版本，2.0版本好像修改了一些方法，还须要本身手动修改代码

　　环境测试

　　在tensorflow虚拟环境中，python命令进入Python环境中，输入import tensorflow，看是否能成功导入

　　2. 准备工做

　　2.1 预训练模型下载

　　Bert-base Chinese

　　BERT-wwm ：由哈工大和讯飞联合实验室发布的，效果比Bert-base Chinese要好一些(连接地址为讯飞云，密码：mva8。无奈当时用wwm训练完提交结果时，提交通道已经关闭了，呜呜)

　　bert_model.ckpt：负责模型变量载入

　　vocab.txt：训练时中文文本采用的字典

　　bert_config.json：BERT在训练时，可选调整的一些参数

　　2.2 数据准备

　　1)将本身的数据集格式改为以下格式：第一列是标签，第二列是文本数据，中间用tab隔开(若测试集没有标签，只保留一列样本数据)。分别将训练集、验证集、测试集文件名改成train.tsv、val.tsv、test.tsv。文件格式为UTF-8(无BOM)

　　2)新建data文件夹，存放这三个文件。

　　3)预训练模型解压，存放到新建文件夹chinese中

　　2.3 代码修改

　　咱们须要对bert源码中run_classifier.py进行两处修改

　　1)在run_classifier.py中添加咱们的任务类

　　能够参照其余Processor类，添加本身的任务类

　　# 自定义Processor类

　　class MyProcessor(DataProcessor):

　　def __init__(self):

　　self.labels = ['Addictive Behavior',

　　'Address',

　　'Age',

　　'Alcohol Consumer',

　　'Allergy Intolerance',

　　'Bedtime',

　　'Blood Donation',

　　'Capacity',

　　'Compliance with Protocol',

　　'Consent',

　　'Data Accessible',

　　'Device',

　　'Diagnostic',

　　'Diet',

　　'Disabilities',

　　'Disease',

　　'Education',

　　'Encounter',

　　'Enrollment in other studies',

　　'Ethical Audit',

　　'Ethnicity',

　　'Exercise',

　　'Gender',

　　'Healthy',

　　'Laboratory Examinations',

　　'Life Expectancy',

　　'Literacy',

　　'Multiple',

　　'Neoplasm Status',

　　'Non-Neoplasm Disease Stage',

　　'Nursing',

　　'Oral related',

　　'Organ or Tissue Status',

　　'Pharmaceutical Substance or Drug',

　　'Pregnancy-related Activity',

　　'Receptor Status',

　　'Researcher Decision',

　　'Risk Assessment',

　　'Sexual related',

　　'Sign',

　　'Smoking Status',

　　'Special Patient Characteristic',

　　'Symptom',

　　'Therapy or Surgery']

　　def get_train_examples(self, data_dir):

　　return self._create_examples(

　　self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

　　def get_dev_examples(self, data_dir):

　　return self._create_examples(

　　self._read_tsv(os.path.join(data_dir, "val.tsv")), "val")

　　def get_test_examples(self, data_dir):

　　return self._create_examples(

　　self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

　　def get_labels(self):

　　return self.labels

　　def _create_examples(self, lines, set_type):

　　examples = []

　　for (i, line) in enumerate(lines):

　　guid = "%s-%s" % (set_type, i)

　　if set_type == "test":

　　"""

　　由于个人测试集中没有标签，因此对test进行单独处理，

　　test的label值设为任意一标签(必定是存在的类标签，

　　否则predict时会keyError)，若是测试集中有标签，就

　　不须要if了，统一处理便可。

　　"""

　　text_a = tokenization.convert_to_unicode(line[0])

　　label = "Address"

　　else:

　　text_a = tokenization.convert_to_unicode(line[1])

　　label = tokenization.convert_to_unicode(line[0])

　　examples.append(

　　InputExample(guid=guid, text_a=text_a, text_b=None, label=label))

　　return examples

　　2)修改processor字典

　　def main(_):

　　tf.logging.set_verbosity(tf.logging.INFO)

　　processors = {

　　"cola": ColaProcessor,

　　"mnli": MnliProcessor,

　　"mrpc": MrpcProcessor,

　　"xnli": XnliProcessor,

　　"mytask": MyProcessor, # 将本身的Processor添加到字典

　　}

　　3 开工

　　3.1 配置训练脚本

　　建立并运行run.sh这个文件

　　python run_classifier.py \

　　--data_dir=data \

　　--task_name=mytask \

　　--do_train=true \

　　--do_eval=true \

　　--vocab_file=chinese/vocab.txt \

　　--bert_config_file=chinese/bert_config.json \

　　--init_checkpoint=chinese/bert_model.ckpt \

　　--max_seq_length=128 \

　　--train_batch_size=8 \

　　--learning_rate=2e-5 \

　　--num_train_epochs=3.0

　　--output_dir=out \

　　fine-tune须要必定的时间，个人训练集有两万条，验证集有八千条，GPU为2080Ti，须要20分钟左右。若是显存不够大，记得适当调整max_seq_length 和 train_batch_size

　　3.2 预测

　　建立并运行test.sh(注：init_checkpoint为本身以前输出模型地址)

　　python run_classifier.py \

　　--task_name=mytask \

　　--do_predict=true \

　　--data_dir=data \

　　--vocab_file=chinese/vocab.txt \

　　--bert_config_file=chinese/bert_config.json \

　　--init_checkpoint=out \

　　--max_seq_length=128 \

　　--output_dir=out

　　预测完会在out目录下生成test_results.tsv。生成文件中，每一行对应你训练集中的每个样本，每一列对应的是每一类的几率(对应以前自定义的label列表)。如第5行第8列表示第5个样本是第8类的几率。

　　3.3 预测结果处理郑州妇科医院 http://www.zykdfkyy.com/

　　由于预测结果是几率，咱们须要对其处理，选取每一行中的最大值最为预测值，并转换成对应的真实标签。

　　data_dir = "C:\\test_results.tsv"

　　lable = ['Addictive Behavior',

　　'Address',

　　'Age',

　　'Alcohol Consumer',

　　'Allergy Intolerance',

　　'Bedtime',

　　'Blood Donation',

　　'Capacity',

　　'Compliance with Protocol',

　　'Consent',

　　'Data Accessible',

　　'Device',

　　'Diagnostic',

　　'Diet',

　　'Disabilities',

　　'Disease',

　　'Education',

　　'Encounter',

　　'Enrollment in other studies',

　　'Ethical Audit',

　　'Ethnicity',

　　'Exercise',

　　'Gender',

　　'Healthy',

　　'Laboratory Examinations',

　　'Life Expectancy',

　　'Literacy',

　　'Multiple',

　　'Neoplasm Status',

　　'Non-Neoplasm Disease Stage',

　　'Nursing',

　　'Oral related',

　　'Organ or Tissue Status',

　　'Pharmaceutical Substance or Drug',

　　'Pregnancy-related Activity',

　　'Receptor Status',

　　'Researcher Decision',

　　'Risk Assessment',

　　'Sexual related',

　　'Sign',

　　'Smoking Status',

　　'Special Patient Characteristic',

　　'Symptom',

　　'Therapy or Surgery']

　　# 用pandas读取test_result.tsv，将标签设置为列名

　　data_df = pd.read_table(data_dir, sep="\t", names=lable, encoding="utf-8")

　　label_test = []

　　for i in range(data_df.shape[0]):

　　# 获取一行中最大值对应的列名，追加到列表

　　label_test.append(data_df.loc[i, :].idxmax())