首页 > 其他分享> 文章详细

ICCV2019_Slimmable:(US-Nets)Universally Slimmable Networks and Improved Training Techniques

2022-09-02 15:02:48 阅读：183 来源： 互联网

标签：Slimmable Training 子网 Improved 训练 BN US width networks

Institute：University of Illinois at Urbana-Champaign
Author：Jiahui Yu, Thomas Huang
GitHub：https://github. com/JiahuiYu/slimmable_networks

Introduction

　　最初的Slimmable networks基于预定义的width set切换网络宽度

　　=> Motivation:can a single neural network run at arbitrary width？作者认为更宽的网络性能不会差于他的slim子网，并且残余误差会处于以下不等式的范围中：

　　其中是指前k个通道聚集的结果，是固定的超参数。

　　Challenges:

　　First, how to deal with neural networks with batch normalization?(BN设计)

　　Second, how to train US-Nets efficiently?(Train)

　　Third, compared with training individual networks, what else can we explore in US-Nets to improve overall performance?

　　在BN层上存在问题

　　First, accumulating independent BN statistics of all sub-networks in a US-Net during training is computationally intensive and inefficient.

　　Second, if in each iteration we only update some sampled sub-networks, then these BN statistics are insufficiently accumulated thus inaccurate, leading to much worse accuracy in our experiments.

　　对于上述问题，作者的贡献有以下几点：

　　(1)训练了可在任意宽度执行的网络

　　(2)提出了两个训练技术（the sandwich rule和inplace distillation）

　　(3)在图像分类，超分，强化学习进行实验和消融实验

　　(4)进一步研究了网络的几个参数：宽度下界K0，宽度除数d，采样宽度的数量，BN后统计子集的大小

　　(5)进一步提出每层可以采用单独的宽度比值

　　(6)为后续工作铺垫（one-shot architecture search）

Related Work

　　Slimmable Networks.

　　Knowledge Distilling：Transfer the learned knowledge from a pretrained network to a new one by training it with predicted features, soft-targets or both.

Method

　　输出单元特征聚集：其中，n是通道数。fully aggregated feature yn和partially aggregated feature yk 的residual error δ :

　　公式(3)说明了slimmable network可以在区间中任意宽度运行(US-Nets)，并且概念上有界不等式适用于任何神经网络，与何种BN层无关。

　　在当前训练阶段BN层的标准化过程为

　　其中是防止除0的小数值, γ和β学习的尺度和偏置。全局的特征平均值和方差采用移动平均的方法进行更新

　　其中m是动量，t是训练迭代。第T次迭代推理阶段则采用这些全局统计信息。

　　其中γ∗ ，β∗是优化的参数。公式（6）可进一步表示为

　　除了公式（5）移动平均来计算统计信息外还可以采用精确平均：

　　实际上作者的做法就是对每个宽度在训练后计算BN统计信息（结果写了这么多？就当复习BN的理论了），因为训练集上随机采样的子集可以产生精确的统计信息的估计。

　　对于训练，作者提出了the sandwich rule和inplace distillation。

　　The sandwich rule：训练中，在width multiplier set[0.25, 1.0]×随机采样n-2个子网，然后加上最大和最小子网得到n个子网。采样的子网性能也处于0.25x和1.0x之间。

　　the sandwich rule展示了更好的收敛和全局表现性能，优点有：

　　1.训练大子网和小子网，观察他们的验证集误差，相当于得到了性能的上界和下界。

　　2.大子网的训练对inplace distillation是必要的。

　　Inplace Distillation：利用最大子网训练中的预测标签作为其他子网的训练标签，最大子网使用Ground truth。

　　图像分类：predicted soft-probabilities by largest width with cross entropy as objective function.

　　图像超分：predicted high-resolution patches are used as labels with either `1 or `2 as training objective.

　　强化学习：the policy predicted by the model at largest width as roll-outs

　　作者还尝试了将预测标签和GT标签结合作为子网的训练标签，但是效果差。

　　训练过程：

Experiments

　　ImageNet Classification：

　　Image Super-Resolution.

　　作者认为使用了独立模型最优的参数而不是US-Nets最优导致性能变差。

　　Deep Reinforcement Learning

Ablation Study

　　The Sandwich Rule:

　　the sandwich rule has better performance on average, with good accuracy at both smallest width and largest width.

　　训练小的比训练大的更重要.

　　Inplace Distillation:

　　Post-Statistics of Batch Normalization：

　　Width Lower Bound

　　Width Divisor d ：MobileNets中floor the channel number approximately as

　　Number of Sampled Widths Per Iteration n

评价：本文是针对ICLR19的slimmable中fixed width的坑，更多是说明网络可以根据任意宽度执行。the sandwich rule和inplace distillation给后续工作提供一个训练的思路。

标签：Slimmable,Training,子网,Improved,训练,BN,US,width,networks
来源： https://www.cnblogs.com/huang-hz/p/16647511.html