基于Bert的Vison-Language多模态网络

多模态任务

VQA
用自然语言对给定图片作相关提问.
Image-text retrieval
图像-文本跨模态检索,任务是根据语言描述,从候选中选定相关的图片, 反之亦然, 即也可以给定图片选文本.数据集如MSCOCO ,Flickr30K.
VCR, Visual Commonsense Reasoning
不太明白, 先搬过来.
Given an image, the VCR task presents two problems visual question answering (Q→ A) and answer justification (QA→ R) both being posed as multiple choice problems. The holistic setting (Q→AR) requires both the chosen answer and then the chosen rationale to be correct. The Visual Commonsense Reasoning (VCR) dataset consists of 290k multiple choice QA problems derived from 110k movie scenes. Different from the VQA dataset, VCR integrates object tags into the language providing direct grounding supervision and explicitly excludes referring expressions.

文本. 同Bert一致, 分词后作 emb_lookup.
图片. 用 Faster-RCNN 选定若干个 ROI(Region of Interest), 每个ROI对应一个 anchor box 及 feature vector. 此时就能类比文本的token及position, 往后续网络送了.

都是基于BERT开展多模态工作. 关于视觉,文本两模态的融合方式上, 有以下两种.

单流的代表.
基于Bert的Vison-Language多模态网络
注意到 Vision 侧, token,position 都是一样的, 凑数用.

Image-text retrieval 与 VCR.

双流的代表.
基于Bert的Vison-Language多模态网络

基于Bert的Vison-Language多模态网络
其预训练任务多达5个.

基于Bert的Vison-Language多模态网络

究竟是单流更佳还是双流更佳还不是很明确，虽然主张单流的论文里有作者与双流模型进行比较得出单流更好的结论，但是在双流的论文里同样有与单流的比较而结果是双流更好。关于单双流究竟哪个更好或者是与特定任务相关，看来还需要未来更严谨的对比实验来进一步进行验证。

paper_weely公众号,BERT在多模态领域中的应用
paper, Unicoder-VL,A Universal Encoder for Vision and Language by Cross-modal Pre-training
paper, LXMERT,LXMERT: Learning Cross-Modality Encoder Representations from Transformers
LXMERT 源码链接