【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS

2021-12-28 16:04:15 阅读：287 来源： 互联网

标签：PRE BERT TRAINING pre embedding visual Embedding 文本

For tasks at the intersection of vision and language, there lacks such pre-trained generic feature representations.

motivation：这篇文章和unified的思想很接近，希望训练出能够适应各类下游任务的通用表示模型。

简介

To better exploit the generic representation, we pre-train VL-BERT at both large visual-linguistic corpus and text-only datasets. The pre-training loss on the visual-linguistic corpus is incurred via predicting randomly masked words or RoIs. Such pre-training sharpens the capability of VL-BERT in aggregating and aligning visual-linguistic clues. While the loss on the text-only corpus is of the standard MLM loss in BERT, improving the generalization on long and complex sentences.

这篇文章与类似原版BERT的相似度非常之高，类似的工作也很多，有比较多的内容我并没有记录。

值得一提的是，预训练语料不仅包含双模态数据，还包含纯文本数据。纯文本数据是为了提升模型对于长难句子的处理能力。

。

相关工作

The authors of ViLBERT claim that such two-stream design is superior than a single-stream unified model.

这里对两类模型做了界定：

像LXMERT那种二合一形式的模型叫做 two-stream
像本文这种模型叫做 single-stream unified
本文作者认为 single-stream unified的自由度更高，对于attention的范围和方式不做任何限制是更优秀的

。

there are three noticeable differences between VL-BERT and other concurrent works in pre-training.

(1) We found the task of Sentence-Image Relationship Prediction used in all of the other concurrent works is of no help in pre-training visual-linguistic representations.
(2) We pre-train VL-BERT on both visual-linguistic
and text-only datasets.
(3) In VL-BERT, the parameters of Fast R-CNN, deriving the visual features, are also updated.
(4) To avoid visual clue leakage in the pre-training task of Masked RoI Classification with Linguistic Clues, the masking operation is conducted on the input raw pixels, other than the feature maps produced by layers of convolution.

有很多操作都和我目前看过的几篇是完全相反的：

Sentence-Image Relationship Prediction这个预训练任务被取消了，理由是没有实际作用。（但是别的文章应该是有做消融实验的）
这个纯文本数据的作用可以理解。
目标检测的网络Fast R-CNN的参数也是更新的（在别的文章里这个步骤有不少是作为数据前处理存在的，不参与训练）
在对图片进行Mask的时候，不是mask特征，而是将原图的像素区域置零（这个操作和LXMERT中完全相反）

VL-BERT

在这里插入图片描述

It is worth noting that the input formats vary for different visual-linguistic tasks (e.g., <Caption,Image> for image captioning, and <Question, Answer, Image> for VQA and VCR ).

值得注意的是针对不同的任务，输入会有所不同，那么排列的方式也会发生微小的变化。不过得益于Transformer对于位置信息的不敏感，排列方式的影响不大，而且只需要安排好Position-Embedding和Segment-Embedding就可以很好的解决。

。

For each input element, its embedding feature is the summation of four types of embedding, namely, token embedding, visual feature embedding, segment embedding, and sequence position embedding.

四种embedding，文本、图片、分段、位置：
Token Embedding：文本部分和BERT没有区别，图片部分都是[IMG]
Visual Feature Embedding：在视觉部分对应的是ROI特征，在文本部分对应的是整张图的特征。值得一提的是，此特征与位置特征的级联，再经过线性变换后才是最终的Visual Feature Embedding。位置特征是下面的4维向量经过sin和cos曲线变换得到的，参考的是Relation Network的成果。还有一点很关键的就是如果输入为纯文本数据，那么对应的Visual Feature Embedding是一个可学习的Embedding。>- Segment Embedding：ABC三个类别，AB是文本的意思，C是图片的意思。AB是为了有两段输入文本时做区分的，平时A就够用了。>- Position Embedding：文本部分与BERT一样，视觉部分因为没有先后之分，就都按照相同的位置去处理了。（这里不太合理）

。

Task #1: Masked Language Modeling with Visual Clues
Task #2: Masked RoI Classification with Linguistic Clues

两种预训练任务：

MLM：只遮盖文本，进行预测（应用很广泛，不再多讲了）
MRC：只遮盖图片，进行分类预测。遮盖时遮盖原图，Fast-R-CNN的分类结果被作为ground-truth

实验

VCR任务

在这里插入图片描述

VQA任务

在这里插入图片描述

REFERRING EXPRESSION COMPREHENSION

在这里插入图片描述

消融实验

在这里插入图片描述

标签：PRE,BERT,TRAINING,pre,embedding,visual,Embedding,文本
来源： https://blog.csdn.net/gjh1716718326/article/details/122190180

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS

简介

相关工作

VL-BERT

实验

VCR任务

VQA任务

REFERRING EXPRESSION COMPREHENSION

消融实验