【FAQ】本地训练与预测相关问题汇总

导语python

在使用指南的最后一部分，咱们汇总了使用PaddlePaddle过程当中的常见问题，本部分推文目录以下：
算法

2.22：【FAQ】模型配置相关问题汇总express

2.23：【FAQ】参数设置相关问题汇总apache

2.24：【FAQ】本地训练与预测相关问题汇总api

2.25：【FAQ】集群训练与预测相关问题汇总缓存

2.26：如何贡献代码网络

2.27：如何贡献文档多线程

本地训练与预测相关问题汇总app

|1. 如何减小内存占用less

神经网络的训练自己是一个很是消耗内存和显存的工做，常常会消耗数10GB的内存和数GB的显存。 PaddlePaddle的内存占用主要分为以下几个方面:

DataProvider缓冲池内存（只针对内存）
神经元激活内存（针对内存和显存）
参数内存（针对内存和显存）
其余内存杂项

其中，其余内存杂项是指PaddlePaddle自己所用的一些内存，包括字符串分配，临时变量等等，暂不考虑在内。

A.减小DataProvider缓冲池内存

PyDataProvider使用的是异步加载，同时在内存里直接随即选取数据来作Shuffle。即：

因此，减少这个内存池便可减少内存占用，同时也能够加速开始训练前数据载入的过程。可是，这个内存池实际上决定了shuffle的粒度。因此，若是将这个内存池减少，又要保证数据是随机的，那么最好将数据文件在每次读取以前作一次shuffle。可能的代码为：

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

@provider(min_pool_size=0, ...)

def process(settings, filename):

os.system('shuf %s > %s.shuf' % (filename, filename)) # shuffle before.

with open('%s.shuf' % filename, 'r') as f:

for line in f:

yield get_sample_from_line(line)

这样作能够极大的减小内存占用，而且可能会加速训练过程，详细文档参考 api_pydataprovider2 。

B.神经元激活内存

神经网络在训练的时候，会对每个激活暂存一些数据，如神经元激活值等。在反向传递的时候，这些数据会被用来更新参数。这些数据使用的内存主要和两个参数有关系，一是batch size，另外一个是每条序列(Sequence)长度。因此，其实也是和每一个mini-batch中包含的时间步信息成正比。

因此作法能够有两种：

减少batch size。即在网络配置中 settings(batch_size=1000) 设置成一个小一些的值。可是batch size自己是神经网络的超参数，减少batch size可能会对训练结果产生影响。
减少序列的长度，或者直接扔掉很是长的序列。好比，一个数据集大部分序列长度是100-200, 可是忽然有一个10000长的序列，就很容易致使内存超限，特别是在LSTM等RNN中。

C.参数内存

PaddlePaddle支持很是多的优化算法(Optimizer)，不一样的优化算法须要使用不一样大小的内存。例如使用 adadelta 算法，则须要使用等于权重参数规模大约5倍的内存。举例，若是参数保存下来的模型目录文件为 100M，那么该优化算法至少须要 500M 的内存。

能够考虑使用一些优化算法，例如 momentum。

|2.如何加速训练速度

加速PaddlePaddle训练能够考虑从如下几个方面：

减小数据载入的耗时
加速训练速度
利用分布式训练驾驭更多的计算资源

A.减小数据载入的耗时

使用pydataprovider时，能够减小缓存池的大小，同时设置内存缓存功能，便可以极大的加速数据载入流程。 DataProvider 缓存池的减少，和以前减少经过减少缓存池来减少内存占用的原理一致。

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

@provider(min_pool_size=0, ...)

def process(settings, filename):

os.system('shuf %s > %s.shuf' % (filename, filename)) # shuffle before.

with open('%s.shuf' % filename, 'r') as f:

for line in f:

yield get_sample_from_line(line)

同时 @provider 接口有一个 cache 参数来控制缓存方法，将其设置成 CacheType.CACHE_PASS_IN_MEM 的话，会将第一个 pass (过完全部训练数据即为一个pass)生成的数据缓存在内存里，在以后的 pass 中，不会再从 python 端读取数据，而是直接从内存的缓存里读取数据。这也会极大减小数据读入的耗时。

B.加速训练速度

PaddlePaddle支持Sparse的训练，sparse训练须要训练特征是 sparse_binary_vector 、 sparse_vector 、或者 integer_value 的任一一种。同时，与这个训练数据交互的Layer，须要将其Parameter设置成 sparse 更新模式，即设置 sparse_update=True

这里使用简单的 word2vec 训练语言模型距离，具体使用方法为:使用一个词前两个词和后两个词，来预测这个中间的词。这个任务的DataProvider为:

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

DICT_DIM = 3000

@provider(input_types=[integer_sequence(DICT_DIM), integer_value(DICT_DIM)])

def process(settings, filename):

with open(filename) as f:

# yield word ids to predict inner word id

# such as [28, 29, 10, 4], 4

# It means the sentance is 28, 29, 4, 10, 4.

yield read_next_from_file(f)

这个任务的配置为:

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

... # the settings and define data provider is omitted.

DICT_DIM = 3000 # dictionary dimension.

word_ids = data_layer('word_ids', size=DICT_DIM)

emb = embedding_layer(

input=word_ids, size=256, param_attr=ParamAttr(sparse_update=True))

emb_sum = pooling_layer(input=emb, pooling_type=SumPooling())

predict = fc_layer(input=emb_sum, size=DICT_DIM, act=Softmax())

outputs(

classification_cost(

input=predict, label=data_layer(

'label', size=DICT_DIM)))

C.利用更多的计算资源

利用更多的计算资源能够分为如下几个方式来进行:

单机CPU训练

使用多线程训练。设置命令行参数 trainer_count。

单机GPU训练

使用显卡训练。设置命令行参数 use_gpu。

使用多块显卡训练。设置命令行参数 use_gpu 和 trainer_count 。

多机训练

请参考 cluster_train。

|3. 如何指定GPU设备

例如机器上有4块GPU，编号从0开始，指定使用二、3号GPU：

方式1：经过 CUDA_VISIBLE_DEVICES（连接：http://www.acceleware.com/blog/cudavisibledevices-masking-gpus）环境变量来指定特定的GPU。

env CUDA_VISIBLE_DEVICES=2,3 paddle train --use_gpu=true --trainer_count=2

方式2：经过命令行参数 --gpu_id 指定。

paddle train --use_gpu=true --trainer_count=2 --gpu_id=2

|4.如何调用 infer 接口输出多个layer的预测结果

将须要输出的层做为 paddle.inference.Inference() 接口的 output_layer 参数输入，代码以下：

inferer = paddle.inference.Inference(output_layer=[layer1, layer2], parameters=parameters)

指定要输出的字段进行输出。以输出 value 字段为例，代码以下：

out = inferer.infer(input=data_batch, field=["value"])

须要注意的是：

若是指定了2个layer做为输出层，实际上须要的输出结果是两个矩阵；
假设第一个layer的输出A是一个 N1 * M1 的矩阵，第二个 Layer 的输出B是一个 N2 * M2 的矩阵；
paddle.v2 默认会将A和B 横向拼接，当N1 和 N2 大小不同时，会报以下的错误：

ValueError: all the input array dimensions except for the concatenation axis must match exactly

多个层的输出矩阵的高度不一致致使拼接失败，这种状况经常发生在：

同时输出序列层和非序列层；
多个输出层处理多个不一样长度的序列;

此时能够在调用infer接口时经过设置 flatten_result=False , 跳过“拼接”步骤，来解决上面的问题。这时，infer接口的返回值是一个python list:

list 中元素的个数等于网络中输出层的个数；
list 中每一个元素是一个layer的输出结果矩阵，类型是numpy的ndarray；
每个layer输出矩阵的高度，在非序列输入时：等于样本数；序列输入时等于：输入序列中元素的总数；宽度等于配置中layer的size；

|5. 如何在训练过程当中得到某一个layer的output

能够在event_handler中，经过 event.gm.getLayerOutputs("layer_name") 得到在模型配置中某一层的name layer_name 在当前 mini-batch forward的output的值。得到的值类型均为 numpy.ndarray ，能够经过这个输出来完成自定义的评估指标计算等功能。例以下面代码：

def score_diff(right_score, left_score):

return np.average(np.abs(right_score - left_score))

def event_handler(event):

if isinstance(event, paddle.event.EndIteration):

if event.batch_id % 25 == 0:

diff = score_diff(

event.gm.getLayerOutputs("right_score")["right_score"][

"value"],

event.gm.getLayerOutputs("left_score")["left_score"][

"value"])

logger.info(("Pass %d Batch %d : Cost %.6f, "

"average absolute diff scores: %.6f") %

(event.pass_id, event.batch_id, event.cost, diff))

注意：此方法不能获取 paddle.layer.recurrent_group 里step的内容，但能够获取 paddle.layer.recurrent_group 的输出。

|6. 如何在训练过程当中得到参数的权重和梯度

在某些状况下，得到当前mini-batch的权重（或称做weights, parameters）有助于在训练时观察具体数值，方便排查以及快速定位问题。能够经过在 event_handler 中打印其值（注意，须要使用 paddle.event.EndForwardBackward 保证使用GPU训练时也能够得到），示例代码以下：

...

parameters = paddle.parameters.create(cost)

...

def event_handler(event):

if isinstance(event, paddle.event.EndForwardBackward):

if event.batch_id % 25 == 0:

for p in parameters.keys():

logger.info("Param %s, Grad %s",

parameters.get(p), parameters.get_grad(p))

注意：“在训练过程当中得到某一个layer的output”和“在训练过程当中得到参数的权重和梯度”都会形成训练中的数据从C++拷贝到numpy，会对训练性能形成影响。不要在注重性能的训练场景下使用。

end

*原创贴，版权全部，未经许可，禁止转载

*值班小Paddle：wangp

*欢迎在留言区分享您的观点

本文分享 CSDN - 飞桨PaddlePaddle。
若有侵权，请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一块儿分享。