知识蒸馏,其目的是为了让小模型学到大模型的知识,通俗说,让student模型的输出接近(拟合)teacher模型的输出。因此知识蒸馏的重点在于拟合二字,即咱们要定义一个方法去衡量student模型和teacher模型接近程度,说白了就是损失函数。git
为何咱们须要知识蒸馏?由于大模型推理慢难以应用到工业界。小模型直接进行训练,效果较差。github
下面介绍四个比较热门的蒸馏文章,这四个本人均有实践,但愿能帮到你们。算法
Hinton 在论文: Distilling the Knowledge in a Neural Network 提出了知识蒸馏的方法。网上关于这方面的资料实在是太多了,我就简单总结下吧。
损失函数:$$Loss = aL_{soft} + (1-a)L_{hard}$$
其中\(L_{soft}\)是StudentModel和TeacherModel的输出的交叉熵,\(L_{hard}\)是StudentModel输出和真实标签的交叉熵。
再细说一下\(L_{soft}\)。咱们知道TeacherModel的输出是通过Softmax处理的,指数e拉大了各个类别之间的差距,最终输出结果特别像一个one-hot向量,这样不利于StudentModel的学习,所以咱们但愿输出更加软一些。所以咱们须要改一下softmax函数:架构
显然T越大输出越软。这样改完以后,对比原始softmax,梯度至关于乘了\(1/T^2\),所以\(L_{soft}\)须要再乘以\(T^2\)来与\(L_{hard}\)在一个数量级上。app
算法的总体框架图以下:(图片来自https://blog.csdn.net/nature553863/article/details/80568658)框架
说到对Bert的蒸馏, 首先想到的方法就是用微调好的Bert做为TeacherModel去训练一个StudentModel,这正是TinyBert的作法。那么下面的问题就是选取什么模型做为StudentModel,这个已经有一些尝试了,好比有人使用BiLSTM,可是更多的人仍是继续使用了Bert,只不过这个Bert会比原始的Bert小。在TinyBert中,StudentModel使用的是减小了embedding size、hidden size和num hidden layers的小bert。dom
那么怎么初始化StudentModel呢?最简单的办法就是随机化模型参数,可是更好的方法是用预训练模型,所以咱们须要一个预训练的StudentModel。TinyBert的作法是用预训练好的Bert蒸馏出一个预训练好的StudentModel。ide
Ok,TinyBert基本讲完了,简单总结下,TinyBert一共分为两步:函数
下面说一说TinyBert的损失函数。学习
公式以下:
解释下这个公式:
再补充一句:在进行蒸馏的时候,会先进行隐层蒸馏(即m<=M),而后再执行m=M+1时的蒸馏。
总结一下,有助于你们理解:TinyBERT在蒸馏的时候,不只要让StudentModel学到最后一层的输出,还要学到中间几层的输出。换言之,StudentModel的某一隐层能够学到TeacherModel若干隐层的输出。感受蒸馏的粒度比较细,我以为能够叫作LayerBasedDistillation。
说完了TinyBert,想再和你们聊一聊DistilBert,DistilBert要比TinyBert简单很多,我就少用些文字,DistilBert使用预训练好的Bert做为TeacherModel训练了一个StudentModel,这里的StudentModel就是层数少的Bert,注意这里获得的DistilBERT本质上仍是一个预训练模型,所以用到具体下游任务上时,仍是须要用专门的数据去微调,这里就是纯粹的微调,不须要考虑再用蒸馏学习辅助。HuggingFace已经提供了若干蒸馏好的预训练模型,你们直接拿过来当Bert用就好了。
DistillBERT的损失函数:\(L_{ce} + L_{mlm} + L_{cos}\)。
这个准确的来讲不是知识蒸馏,可是它确实减少了模型体积,并且思路和TinyBERT、DistillBERT都有相似,所以就放到这里讲了。这个思路很是优雅,它经过随机使用小模型的一层替换大模型中若干层,来完成训练。我来举一个例子:假设大模型是input->tfc1->tfc2->tfc3->tfc4->tfc5->tfc6->output,而后再定义一个小模型input->sfc1->sfc2->sfc3->output。再训练过程当中仍是要训练大模型,只是在每一步中,会随机的将(tfc1,tfc2),(tfc3,tfc4),(tfc5,tfc6)替换为sfc1,sfc2,sfc3,并且随着训练的进行,替换的几率不断变大,所以最后就是在训练一个小模型。
放一张图便于你们理解
方式优雅,做者提供了源码,强烈推荐你们用一用。
刚刚发布的一篇新论文, 也是关于BERT蒸馏的,我简单总结下三个创新点:
放一下图以便你们理解:
本人实现了一个基于Pytorch的知识蒸馏框架,有兴趣的朋友能够试一试。该框架尽量抽象了多层模型的蒸馏方法,能够实现TInyBERT、DistillBERT等算法。后续在维护过程当中发现知识蒸馏还不够成熟,常常出现新的蒸馏算法,没办法制定一个统一的框架把各种算法集成进去。所以本人稍微调整该库,将该库分为两个部分:
欢迎给位上传新的知识蒸馏算法示例代码,示例代码尽可能简洁易懂,便于执行,最好是算法做者没有提供源码的。项目地址:
Pypi:https://pypi.org/project/KnowledgeDistillation/
Github:https://github.com/DunZhang/KnowledgeDistillation
给你们提供一个使用基于多层模型的知识蒸馏框架的范例代码,使用12层bert蒸馏3层bert,使用TinyBERT的损失函数,代码完整能够直接运行,不须要外部数据:
# import packages import torch import logging import numpy as np from transformers import BertModel, BertConfig from torch.utils.data import DataLoader, RandomSampler, TensorDataset from knowledge_distillation import KnowledgeDistiller, MultiLayerBasedDistillationLoss from knowledge_distillation import MultiLayerBasedDistillationEvaluator logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') # Some global variables train_batch_size = 40 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") learning_rate = 1e-5 num_epoch = 10 # define student and teacher model # Teacher Model bert_config = BertConfig(num_hidden_layers=12, hidden_size=60, intermediate_size=60, output_hidden_states=True, output_attentions=True) teacher_model = BertModel(bert_config) # Student Model bert_config = BertConfig(num_hidden_layers=3, hidden_size=60, intermediate_size=60, output_hidden_states=True, output_attentions=True) student_model = BertModel(bert_config) ### Train data loader input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 50))) attention_mask = torch.LongTensor(np.ones((100000, 50))) token_type_ids = torch.LongTensor(np.zeros((100000, 50))) train_data = TensorDataset(input_ids, attention_mask, token_type_ids) train_sampler = RandomSampler(train_data) train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=train_batch_size) ### Train data adaptor ### It is a function that turn batch_data (from train_dataloader) to the inputs of teacher_model and student_model ### You can define your own train_data_adaptor. Remember the input must be device and batch_data. ### The output is either dict or tuple, but must be consistent with you model's input def train_data_adaptor(device, batch_data): batch_data = tuple(t.to(device) for t in batch_data) batch_data_dict = {"input_ids": batch_data[0], "attention_mask": batch_data[1], "token_type_ids": batch_data[2], } # In this case, the teacher and student use the same input return batch_data_dict, batch_data_dict ### The loss model is the key for this generation. ### We have already provided a general loss model for distilling multi bert layer ### In most cases, you can directly use this model. #### First, we should define a distill_config which indicates how to compute ths loss between teacher and student. #### distill_config is a list-object, each item indicates how to calculate loss. #### It also defines which output of which layer to calculate loss. #### It shoulde be consistent with your output_adaptor distill_config = [ # means that compute a loss by their embedding_layer's embedding {"teacher_layer_name": "embedding_layer", "teacher_layer_output_name": "embedding", "student_layer_name": "embedding_layer", "student_layer_output_name": "embedding", "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0 }, # means that compute a loss between teacher's bert_layer12's hidden_states and student's bert_layer3's hidden_states {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "hidden_states", "student_layer_name": "bert_layer3", "student_layer_output_name": "hidden_states", "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0 }, {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "attention", "student_layer_name": "bert_layer3", "student_layer_output_name": "attention", "loss": {"loss_function": "attention_mse_with_mask", "args": {}}, "weight": 1.0 }, {"teacher_layer_name": "pred_layer", "teacher_layer_output_name": "pooler_output", "student_layer_name": "pred_layer", "student_layer_output_name": "pooler_output", "loss": {"loss_function": "mse", "args": {}}, "weight": 1.0 }, ] ### teacher_output_adaptor and student_output_adaptor ### In most cases, model's output is tuple-object, However, in our package, we need the output is dict-object, ### like: { "layer_name":{"output_name":value} .... } ### Hence, the output adaptor is to turn your model's output to dict-object output ### In my case, teacher and student can use one adaptor def output_adaptor(model_output): last_hidden_state, pooler_output, hidden_states, attentions = model_output output = {"embedding_layer": {"embedding": hidden_states[0]}} for idx in range(len(attentions)): output["bert_layer" + str(idx + 1)] = {"hidden_states": hidden_states[idx + 1], "attention": attentions[idx]} output["pred_layer"] = {"pooler_output": pooler_output} return output # loss_model loss_model = MultiLayerBasedDistillationLoss(distill_config=distill_config, teacher_output_adaptor=output_adaptor, student_output_adaptor=output_adaptor) # optimizer param_optimizer = list(student_model.named_parameters()) no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} ] optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=learning_rate) # evaluator # this is a basic evalator, it can output loss value and save models # You can define you own evaluator class that implements the interface IEvaluator evaluator = MultiLayerBasedDistillationEvaluator(save_dir="save_model", save_step=1000, print_loss_step=20) # Get a KnowledgeDistiller distiller = KnowledgeDistiller(teacher_model=teacher_model, student_model=student_model, train_dataloader=train_dataloader, dev_dataloader=None, train_data_adaptor=train_data_adaptor, dev_data_adaptor=None, device=device, loss_model=loss_model, optimizer=optimizer, evaluator=evaluator, num_epoch=num_epoch) # start distillate distiller.distillate()
再介绍一个知识蒸馏库TextBrewer,该库由哈工大实现,和本人的库相比实现算法更多,运行更为稳定,推荐你们使用。
Github地址:https://github.com/airaria/TextBrewer
在这里一样的也提供一个完整可运行的代码,且不须要任何外部数据:
import torch import numpy as np import pickle import textbrewer from textbrewer import GeneralDistiller from textbrewer import TrainingConfig, DistillationConfig from transformers import BertConfig, BertModel from torch.utils.data import DataLoader, RandomSampler, TensorDataset device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") ## 定义模型 bert_config = BertConfig(num_hidden_layers=12, output_hidden_states=True, output_attentions=True) teacher_model = BertModel(bert_config).to(device) bert_config = BertConfig(num_hidden_layers=3, output_hidden_states=True, output_attentions=True) student_model = BertModel(bert_config).to(device) # optimizer param_optimizer = list(student_model.named_parameters()) no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} ] optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=2e-5) ### data input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 64))) attention_mask = torch.LongTensor(np.ones((100000, 64))) token_type_ids = torch.LongTensor(np.zeros((100000, 64))) train_data = TensorDataset(input_ids, attention_mask, token_type_ids) train_sampler = RandomSampler(train_data) train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=16) # Define an adaptor for translating the model inputs and outputs # 整合成蒸馏器须要的数据格式 # key须要是固定的??? def bert_adaptor(batch, model_outputs): last_hidden_state, pooler_output, hidden_states, attentions = model_outputs hidden_states = list(hidden_states) hidden_states.append(pooler_output) output = {"inputs_mask": batch[1], "attention": attentions, "hidden": hidden_states} return output # Training configuration train_config = TrainingConfig(gradient_accumulation_steps=1, ckpt_frequency=10, ckpt_epoch_frequency=1, log_dir='logs', output_dir='saved_models', device='cuda') # Distillation configuration # Matching different layers of the student and the teacher # 重要,如何蒸馏的定义 # 不支持自定义损失函数 # 不支持CLS LOSS,可是能够强行写在hidden loss里面 distill_config = DistillationConfig( intermediate_matches=[ {'layer_T': 0, 'layer_S': 0, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # embedding loss {'layer_T': 4, 'layer_S': 1, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # hidden loss {'layer_T': 8, 'layer_S': 2, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': 3, 'layer_S': 0, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, # attention loss {'layer_T': 7, 'layer_S': 1, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, {'layer_T': 11, 'layer_S': 2, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # 实际上是CLS loss ] ) # Build distiller distiller = GeneralDistiller( train_config=train_config, distill_config=distill_config, model_T=teacher_model, model_S=student_model, adaptor_T=bert_adaptor, adaptor_S=bert_adaptor) # Start! # callbacker 能够在dev上进行评估 # 注意存的是state_dict with distiller: distiller.train(optimizer=optimizer, scheduler=None, dataloader=train_dataloader, num_epochs=10, callback=None)
还有不少其余加速BERT的方法,我就不细说了,有兴趣的能够研究下:
文章能够转载, 但请注明出处: