[论文阅读 2020 CVPR 目标跟踪]Siamese Box Adaptive Network for Visual Tracking

2021-04-13 19:01:15 阅读：264 来源： 互联网

标签：Box Tracking frac Network times right pj pi left

简介

paper:Siamese Box Adaptive Network for Visual Tracking

这篇论文和SiamCAR的思想有点撞车，都是发现SiamRPN系列的跟踪算法需要预先设置好anchor bbox的相关参数，而这需要花费很多精力去调整这些参数。基于这个动机，这篇论文通过FCN对目标的bbox进行端到端的回归训练得到。

在这里插入图片描述

主要内容

在这里插入图片描述

如上图所示是这篇论文中SiamBAN的网络结构。不同于SiamRPN系列跟踪器，该模型将跟踪划分为一个分类任务(通过Cls Module分支得到代表目标的分数)和一个回归任务(通过Reg Module分支得到代表目标位置的相关系数).

Siamese Network Backbone

SiamBAN采用ResNet-50作为backbone network,并对其进行如下修改:

We remove the downsampling operations from the last two convolution blocks.In order to improve the receptive field, we use atrous convolution. In addition, inspired by multi-grid methods, we adopt different atrous rates in our model. Specifically, we set the stride to 1 in the conv4 and conv5 blocks, the atrous rate to 2 in the conv4 block,and the atrous rate to 4 in the conv5 block.we add a 1 × 1 1 × 1 1×1 convolution to reduce the output features channel to 256 256 256, and use only the features of the template branch center 7 × 7 7 × 7 7×7 regions.

Box Adaptive Head

Cls Module输出结果中，每个位置的输出是一个代表foreground-background的二维分类分数；而Reg Module输出结果中，每个位置的输出是一个代表位置偏移的四维向量，用公式描述如下:

P w × h × 2 d s = [ φ ( x ) ] d s ⋆ [ φ ( z ) ] c l s P w × h × 4 r e g = [ φ ( x ) ] r e g ⋆ [ φ ( z ) ] r e g \begin{array}{l} P_{w \times h \times 2}^{d s}=[\varphi(x)]_{d s} \star[\varphi(z)]_{c l s} \\ P_{w \times h \times 4}^{r e g}=[\varphi(x)]_{r e g} \star[\varphi(z)]_{r e g} \end{array} Pw×h×2ds=[φ(x)]ds⋆[φ(z)]clsPw×h×4reg=[φ(x)]reg⋆[φ(z)]reg

where ⋆ ⋆ ⋆ denotes the convolution operation with [ ϕ ( z ) ] c l s [ϕ(z)]_{cls} [ϕ(z)]cls or [ ϕ ( z ) ] r e g [ϕ(z)]_{reg} [ϕ(z)]reg as the convolution kernel, P w × h × 2 c l s P^{cls}_{w×h×2} Pw×h×2cls denotes classification map, P w × h × 4 r e g P^{reg}_{w×h×4} Pw×h×4reg indicates regression map.

同时，得到的classification map和regression map中位置 ( i , j ) (i,j) (i,j)通过下面式子映射为输入的搜索区域中的位置 ( p i , p j ) (p_i,p_j) (pi,pj):

[ p i , p j ] = [ ⌊ w i m 2 ⌋ + ( i − ⌊ w 2 ⌋ ) × s , ⌊ h i m 2 ⌋ + ( j − ⌊ h 2 ⌋ ) × s ] [p_i,p_j]=\left[\left\lfloor\frac{w_{i m}}{2}\right\rfloor+(i-\right.\left.\left.\left\lfloor\frac{w}{2}\right\rfloor\right) \times s,\left\lfloor\frac{h_{i m}}{2}\right\rfloor+\left(j-\left\lfloor\frac{h}{2}\right\rfloor\right) \times s\right] [pi,pj]=[⌊2wim⌋+(i−⌊2w⌋)×s,⌊2him⌋+(j−⌊2h⌋)×s]

where, w i m w_im wim and h i m h_im him represent the width and height of the input search patch and s s s represents the total stride of the network.

Multi-level Prediction

由于浅层的网络提取的特征更有助于目标定位，而深层的网络提取的特征代表了丰富的目标语意信息，更有利于提高模型对目标外形变化的鲁棒性。

所以，这篇论文将ResNet-50的后三层提取特征分别进行预测后，再按权重进行“融合”，用公式描述如下:

P w × h × 2 c l s − a l l = ∑ l = 3 5 α l P l d s P_{w \times h \times 2}^{c l s-a l l}=\sum_{l=3}^{5} \alpha_{l} P_{l}^{d s} Pw×h×2cls−all=l=3∑5αlPlds
P w × h × 4 r e g − a l l = ∑ l = 3 5 β l P l r e g P_{w \times h \times 4}^{r e g-a l l}=\sum_{l=3}^{5} \beta_{l} P_{l}^{r e g} Pw×h×4reg−all=l=3∑5βlPlreg

where α l α_l αl and β l β_l βl are the weights corresponding to each map and are optimized together with the network.

Ground-truth and Loss

这一节作者主要是借鉴了目标检测中的anchor-free检测算法中的一些做法，想了解更多可以参考目标检测：Anchor-Free时代

作者将grounding-truth bouding box的宽、高、左上角坐标、中点坐标、右下角坐标分别定义为 g w g_w gw, g h g_h gh, ( g x 1 , g y 1 ) (g_{x_1},g_{y_1}) (gx1,gy1), ( g x c , g y c ) (g_{x_c},g_{y_c}) (gxc,gyc), ( g x 2 , g y 2 ) (g_{x_2},g_{y_2}) (gx2,gy2).

并定义两个椭圆 E 1 , E 2 E_1,E_2 E1,E2:

( p i − g x c ) 2 ( g w 2 ) 2 + ( p j − g y c ) 2 ( g h 2 ) 2 = 1 \frac{\left(p_{i}-g_{x_{c}}\right)^{2}}{\left(\frac{g_{w}}{2}\right)^{2}}+\frac{\left(p_{j}-g_{y_{c}}\right)^{2}}{\left(\frac{g_{h}}{2}\right)^{2}}=1 (2gw)2(pi−gxc)2+(2gh)2(pj−gyc)2=1

( p i − g x c ) 2 ( g w 4 ) 2 + ( p j − g y c ) 2 ( g h 4 ) 2 = 1 \frac{\left(p_{i}-g_{x_{c}}\right)^{2}}{\left(\frac{g_{w}}{4}\right)^{2}}+\frac{\left(p_{j}-g_{y_{c}}\right)^{2}}{\left(\frac{g_{h}}{4}\right)^{2}}=1 (4gw)2(pi−gxc)2+(4gh)2(pj−gyc)2=1

如果对应的位置 ( p i , p j ) (p_i,p_j) (pi,pj)落在 E 2 E_2 E2内，则它被标记为正样本，如果落在 E 1 E_1 E1外面则它被标记为负样本，如果落在 E 1 E_1 E1 和 E 2 E_2 E2之间则忽略。

而对于bbox标签，通过下面的式子计算:

d l = p i − g x 1 d t = p j − g y 1 d r = g x 2 − p i d b = g y 2 − p j \begin{array}{l} d_{l}=p_{i}-g_{x_{1}} \\ d_{t}=p_{j}-g_{y_{1}} \\ d_{r}=g_{x_{2}}-p_{i} \\ d_{b}=g_{y_{2}}-p_{j} \end{array} dl=pi−gx1dt=pj−gy1dr=gx2−pidb=gy2−pj

where d l d_l dl, d t d_t dt, d r d_r dr, d b d_b db are the distances from the location to the four sides of the bounding box

最终的损失函数由两部分组成：分类的cross entropy loss和回归框的IoU loss:

L = λ 1 L c l s + λ 2 L r e g L=\lambda_{1} L_{c l s}+\lambda_{2} L_{r e g} L=λ1Lcls+λ2Lreg

L I o U = 1 − I o U L_{I o U}=1-I o U LIoU=1−IoU

Training and Inference

训练时，在ImageNet VID, YouTube-BoundingBoxes, COCO, ImageNet DET, GOT10k 和 LaSOT 这些数据集上进行训练。

预测时，通过下面式子计算目标的位置信息

p x 1 = p i − d l r e g p y 1 = p j − d t r e g p x 2 = p i + d r r e g p y 2 = p j + d b r e g \begin{array}{l} p_{x_{1}}=p_{i}-d_{l}^{r e g} \\ p_{y_{1}}=p_{j}-d_{t}^{r e g} \\ p_{x_{2}}=p_{i}+d_{r}^{r e g} \\ p_{y_{2}}=p_{j}+d_{b}^{r e g} \end{array} px1=pi−dlregpy1=pj−dtregpx2=pi+drregpy2=pj+dbreg

where d l r e g , d t r e g , d r r e g d_{l}^{r e g}, d_{t}^{r e g}, d_{r}^{r e g} dlreg,dtreg,drreg and d b r e g d_{b}^{r e g} dbreg denote the prediction values of the regression map, ( p x 1 , p y 1 ) \left(p_{x_{1}}, p_{y_{1}}\right) (px1,py1) and ( p x 2 , p y 2 ) \left(p_{x_{2}}, p_{y_{2}}\right) (px2,py2) are the top-left corner and bottom-right corner of the prediction box.

实验结果

在这里插入图片描述

小结

总的来说，这篇论文最大的看点就是通过回归来预测bbox，省去了之前SiamRPN复杂的超参数设置。从这篇论文中，也可以看到一些目标检测算法的影子，检测的论文还是要多看多学!

标签：Box,Tracking,frac,Network,times,right,pj,pi,left
来源： https://blog.csdn.net/qq_39621037/article/details/115675588

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

[论文阅读 2020 CVPR 目标跟踪]Siamese Box Adaptive Network for Visual Tracking

简介

主要内容

Siamese Network Backbone

Box Adaptive Head

Multi-level Prediction

Ground-truth and Loss

Training and Inference

实验结果

小结