BERT模型在多类别文本分类时的precision, recall, f1值的计算

　　BERT预训练模型在诸多NLP任务中都取得最优的结果。在处理文本分类问题时，便可以直接用BERT模型做为文本分类的模型，也能够将BERT模型的最后层输出的结果做为word embedding导入到咱们定制的文本分类模型中（如text-CNN等）。总之如今只要你的计算资源能知足，通常问题均可以用BERT来处理，这次针对公司的一个实际项目——一个多类别（61类）的文本分类问题，其就取得了很好的结果。python

　　咱们这次的任务是一个数据分布极度不平衡的多类别文本分类（有的类别下只有几个或者十几个样本，有的类别下又有几千个样本），在不作不平衡数据处理且不采用BERT模型时，其取得的F1值只有50%，而在不作不平衡数据处理但采用BERT模型时，其F1值能达到65%，可是在用bert模型时得到F1值时却存在一些问题。git

　　在tensorflow中只提供了二分类的precision，recall，f1值的计算接口，而bert源代码中的run_classifier.py文件中训练模型，验证模型等都是用的estimator API，这些高层API极大的限制了修改代码的灵活性。好在tensorflow源码中有一个方法能够计算混淆矩阵的方法，而且会返回一个operation。注意：这个和tf.confusion_matrix()不一样，具体看源代码中下面这段代码：app

elif mode == tf.estimator.ModeKeys.EVAL: def metric_fn(per_example_loss, label_ids, logits, num_labels): predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) accuracy = tf.metrics.accuracy( labels=label_ids, predictions=predictions) 　　　　　　　　　　
　　　　　　　　　　# 这里的metrics时咱们定义的一个python文件，在下面会介绍
 conf_mat = metrics.get_metrics_ops(label_ids, predictions, num_labels) loss = tf.metrics.mean(values=per_example_loss) return { "eval_accuracy": accuracy, "eval_cm": conf_mat, "eval_loss": loss, }

　　验证时的性能指标计算都在这个方法里面，并且在return的这个字典中每一个值必须是一个tuple。以accuracy为例，tf.metrics.accuracy返回的是一个（accuracy, update_op）这样一个tuple，而咱们上一段说的tf.confusion_matrix只返回一个混淆矩阵。所以在这里咱们使用一个内部的方法，方法导入以下：函数

from tensorflow.python.ops.metrics_impl import _streaming_confusion_matrix

这个方法会返回一个（confusion_matrix, update_op）的tuple。咱们新建一个metrics.py文件，里面的代码以下：性能

import numpy as np import tensorflow as tf from tensorflow.python.ops.metrics_impl import _streaming_confusion_matrix def get_metrics_ops(labels, predictions, num_labels):
　　# 获得混淆矩阵和update_op，在这里咱们须要将生成的混淆矩阵转换成tensor cm, op = _streaming_confusion_matrix(labels, predictions, num_labels) tf.logging.info(type(cm)) tf.logging.info(type(op)) return (tf.convert_to_tensor(cm), op) def get_metrics(conf_mat, num_labels): 　　# 获得numpy类型的混淆矩阵，而后计算precision，recall，f1值。 precisions = [] recalls = [] for i in range(num_labels): tp = conf_mat[i][i].sum() col_sum = conf_mat[:, i].sum() row_sum = conf_mat[i].sum() precision = tp / col_sum if col_sum > 0 else 0 recall = tp / row_sum if row_sum > 0 else 0 precisions.append(precision) recalls.append(recall) pre = sum(precisions) / len(precisions) rec = sum(recalls) / len(recalls) f1 = 2 * pre * rec / (pre + rec) return pre, rec, f1

最上面一段代码中return的字典中的值能够在run_classifier.py中main函数中的下面一段代码中获得：ui

if FLAGS.do_eval: eval_examples = processor.get_dev_examples(FLAGS.data_dir) num_actual_eval_examples = len(eval_examples) if FLAGS.use_tpu: # TPU requires a fixed batch size for all batches, therefore the number
            # of examples must be a multiple of the batch size, or else examples
            # will get dropped. So we pad with fake examples which are ignored
            # later on. These do NOT count towards the metric (all tf.metrics
            # support a per-instance weight, and these get a weight of 0.0).
            while len(eval_examples) % FLAGS.eval_batch_size != 0: eval_examples.append(PaddingInputExample()) eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") file_based_convert_examples_to_features( eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) tf.logging.info("***** Running evaluation *****") tf.logging.info(" Num examples = %d (%d actual, %d padding)", len(eval_examples), num_actual_eval_examples, len(eval_examples) - num_actual_eval_examples) tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) # This tells the estimator to run through the entire set.
        eval_steps = None # However, if running eval on the TPU, you will need to specify the
        # number of steps.
        if FLAGS.use_tpu: assert len(eval_examples) % FLAGS.eval_batch_size == 0 eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size) eval_drop_remainder = True if FLAGS.use_tpu else False eval_input_fn = file_based_input_fn_builder( input_file=eval_file, seq_length=FLAGS.max_seq_length, is_training=False, drop_remainder=eval_drop_remainder) 
　　　　　# result中就是return返回的字典 result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") with tf.gfile.GFile(output_eval_file, "w") as writer: tf.logging.info("***** Eval results *****") 　　　　　　　
　　　　　　　# 咱们能够拿到混淆矩阵（如今时numpy的形式），调用metrics.py文件中的方法来获得precision，recall，f1值 pre, rec, f1 = metrics.get_metrics(result["eval_cm"], len(label_list)) tf.logging.info("eval_precision: {}".format(pre)) tf.logging.info("eval_recall: {}".format(rec)) tf.logging.info("eval_f1: {}".format(f1)) tf.logging.info("eval_accuracy: {}".format(result["eval_accuracy"])) tf.logging.info("eval_loss: {}".format(result["eval_loss"])) np.save("conf_mat.npy", result["eval_cm"])

经过上面的代码拿到混淆矩阵后，调用metrics.py文件中的get_metrics方法就能够获得precision，recall，f1值。lua