2018-04-24

ML►trick

【深度学习-trick系列】batch、batch_size、batch normalization

简介

“you want zero-mean unit-variance activations? just make them so.”

采用强制归一化，而不通过小心翼翼的设计激活函数，实现中间层的N(0,1)分布。

为什么要在激活前做BN？

因为我们期望的是激活函数的输入服从N(0,1)，而不是激活函数的output服从N(0,1)。

有了BN，是否还需要对输入数据归一化处理？

不用了

Train mode:
- 𝜇, 𝜎 are functions of 𝑥; backprop gradients
Test mode:
- 𝜇, 𝜎 are pre-computed on training set (by running average, or post-processing after training)

DNN中的normalization

白化（whitening）
-

ResNet采用了Batch Normalization

实例

resnet中的BN

1
2
3

x = Conv2D(filters1, (1, 1), name=conv_name_base + '2a')(input_tensor)
x = BatchNormalization(axis=bn_axis, name=bn_name_base + '2a')(x)
x = Activation('relu')(x)

RNN中的BN

rnn中

transformer中

没有BN，只有LN。为什么？

batch_size

为了充分利用大规模集群算力以达到提升训练速度的目的，人们不断的提升batch size大小，这是因为更大的batch size允许我们在扩展GPU数量的同时不降低每个GPU的计算负载。

调参经验

难以捉摸的batch size

通常的经验:

速度:
1. batch size越大，速度越快(每小时处理的样本数越多)。
2. 速度有上限，计算资源会饱和
精度:
1. batch size越大，泛化能力却变差（在测试集上效果差）
2. batch size极小(比如1, SGD)，模型可能会收敛困难，甚至发散

实测

语言模型

我的测试

Batch_size	Emb_size	N_unit	Epoch	Elapse	min/epoch	TrainPPL	TestPPL	wps (train)	显存	comments
512	512	512	20	11h	33	213	363,313	90k-150k	2G
2048	512	512	20	6.5h	19.5	350,208	514,300	170k-220k	4G	大batch能提速，测试集上的收敛情况不不确定
4096	512	512	20	6h	18	438,208	687,332	170k-220k	8G	速度达到上限，也许计算资源饱和
512	1024	2048	13	23h	106	190,105	236,185	30k-35k	4G	emb_size和num_unit增大，效果提升
2048	1024	2048	15	23.5h	94	229,102	284,185	35k-40k	8G	大batch，微弱提速

验证了经验 1.1、1.2
并未验证经验2

机器翻译 - transformer

paper中的测试，
来自training tips for transformer:

base model:
- 大batch size不仅速度快，而且收敛快。所以transformer中batch size越大越好，不OOM就行。
- 验证了经验1，与经验2.1相抵触
big model:
- batch_size=1450效果挺好，1400在2小时候突然效果变差。可能是因为，batch太小导致较大误差，从而训练变得发散。这种发散可能是临时的(1400)，也可能是不可恢复的(1000)。另一部分原因可能是 big model有可能比较比较难初始化。
- 验证了经验1，2
推荐: transformer中batch size能设置多大就多大

transformer-**base** 模型在不同batch size的效果 (单卡)
transformer-**big** 模型在不同batch size的效果 (单卡)

我的测试

…

总结 - 面临的挑战

大batch size带来精度损失

过度增大batch size会带来明显的精度损失！这是因为在大batch size（相对于训练样本数）情况下，样本随机性降低，梯度下降方向趋于稳定，训练就由SGD向GD趋近，这导致模型更容易收敛于初始点附近的某个局部最优解，从而抵消了计算力增加带来的好处。如何既增大batch size，又不降低精度，是机智团队面临的首要挑战。

TODO: 测试大batch_size的影响

解决办法

为了提升大batch size情况下的可扩展性，机智团队将训练数据和参数采用半精度浮点数的方式来表示，以减少计算量的同时降低带宽需求。但半精度浮点数的表示方式不可避免的会降低模型收敛精度。

我的测试 - 机器翻译

接口

1	keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001,...)

参数

axis: Integer, the axis that should be normalized
  (typically the features axis). For instance, after a Conv2D layer
  with data_format="channels_first", set axis=1 in BatchNormalization.
momentum: Momentum for the moving mean and the moving variance.
epsilon: Small float added to variance to avoid dividing by zero.
center: If True, add offset of beta to normalized tensor. If False, beta is ignored.
scale: If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
beta_initializer: Initializer for the beta weight.
gamma_initializer: Initializer for the gamma weight.
moving_mean_initializer: Initializer for the moving mean.
moving_variance_initializer: Initializer for the moving variance.
beta_regularizer: Optional regularizer for the beta weight.
gamma_regularizer: Optional regularizer for the gamma weight.
beta_constraint: Optional constraint for the beta weight.
gamma_constraint: Optional constraint for the gamma weight.

实例

2018年6月25日，OpenAI在其Dota2 5v5中取得一定成绩后介绍，其在训练中batch size=100W，而1v1的训练batch_size=800W；训练时间则是以周计。
腾讯内部的游戏AI也面临大batch size收敛精度和低训练速度慢的问题；目前batch_size超过10K则收敛不到基准精度
腾讯在ImageNet数据集上，6.6分钟训练好ResNet-50，batch_size=65536
transformer翻译模型，batch_size=2048、4096

各平台ResNet-50训练软硬件参数配置及性能
注：batch size为256时基准准确度为75.3%。

百万级的batch_size是用GPU训的？显存够吗？要多大显存啊？

。。。。

显存不够情况下，能否把单卡放一部分参数？

...

源码实现

参考

Batch normalization layer (Ioffe and Szegedy, 2014)
Batch Normalization: Accelerating Deep Network.. 2015
4分钟训练ImageNet！腾讯创纪录 | 机器之心
LARS算法与batch_size相关
- Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour - Facebook
- ImageNet Training in 24 Minutes - UC Berkeley
- ImageNet Training by CPU: AlexNet in 11 Minutes and ResNet-50 in 48 Minutes 科普文 | 搜狐