nips 208 visual question answering 导读

Overcoming Language Priors in Visual Question
Answering with Adversarial Regularization
Modern Visual Question Answering (VQA) models have been shown to rely
heavily on superficial correlations between question and answer words learned
during training – e.g. overwhelmingly reporting the type of room as kitchen or
the sport being played as tennis, irrespective of the image. Most alarmingly, this
shortcoming is often not well reflected during evaluation because the same strong
priors exist in test distributions; however, a VQA system that fails to ground
questions in image content would likely perform poorly in real-world settings.
In this work, we present a novel regularization scheme for VQA that reduces
this effect. We introduce a question-only model that takes as input the question
encoding from the VQA model and must leverage language biases in order to
succeed. We then pose training as an adversarial game between the VQA model
and this question-only adversary – discouraging the VQA model from capturing
language biases in its question encoding. Further, we leverage this question-only
model to estimate the increase in model confidence after considering the image,
which we maximize explicitly to encourage visual grounding. Our approach is a
model agnostic training procedure and simple to implement. We show empirically
that it can improve performance significantly on a bias-sensitive split of the VQA
dataset for multiple base models – achieving state-of-the-art on this task. Further,
on standard VQA tasks, our approach shows significantly less drop in accuracy
compared to existing bias-reducing VQA models.
核心解决的问题是语言有很大偏差（biaes）的问题，这里的偏差的意思是很多问题，即使不看图片，也能单独根据问题回答正确。这里提出一个单独的只有问题和答案的模型，然后用这个模型来约束训练问题的编码，让引入的单独的问题模型的效果更差一些，这样可以避免问题编码的时候引入整体语言的biaes。
nips 208 visual question answering 导读
Chain of Reasoning for Visual Question Answering
Reasoning plays an essential role in Visual Question Answering (VQA). Multi-step
and dynamic reasoning is often necessary for answering complex questions. For
example, a question “What is placed next to the bus on the right of the picture?”
talks about a compound object “bus on the right,” which is generated by the
relation <bus, on the right of, picture>. Furthermore, a new relation including this
compound object <sign, next to, bus on the right> is then required to infer the
answer. However, previous methods support either one-step or static reasoning,
without updating relations or generating compound objects. This paper proposes
a novel reasoning model for addressing these problems. A chain of reasoning
(CoR) is constructed for supporting multi-step and dynamic reasoning on changed
relations and objects. In detail, iteratively, the relational reasoning operations form
new relations between objects, and the object refining operations generate new
compound objects from relations. We achieve new state-of-the-art results on four
publicly available datasets. The visualization of the chain of reasoning illustrates
the progress that the CoR generates new compound objects that lead to the answer of the question step by step.
动态的多步推理，形成了一个chain.
nips 208 visual question answering 导读
Learning to Specialize with Knowledge Distillation for Visual Question Answering
Visual Question Answering (VQA) is a notoriously challenging problem because it
involves various heterogeneous tasks defined by questions within a unified framework.
Learning specialized models for individual types of tasks is intuitively
attracting but surprisingly difficult; it is not straightforward to outperform naïve
independent ensemble approach. We present a principled algorithm to learn specialized
models with knowledge distillation under a multiple choice learning (MCL)
framework, where training examples are assigned dynamically to a subset of models
for updating network parameters. The assigned and non-assigned models are
learned to predict ground-truth answers and imitate their own base models before
specialization, respectively. Our approach alleviates the limitation of data
deficiency in existing MCL frameworks, and allows each model to learn its own
specialized expertise without forgetting general knowledge. The proposed framework
is model-agnostic and applicable to any tasks other than VQA, e.g., image
classification with a large number of labels but few per-class examples, which
is known to be difficult under existing MCL schemes. Our experimental results
indeed demonstrate that our method outperforms other baselines for VQA and
image classification.
提出之前分开多个模型不好是因为数据不足，本文提出采用知识蒸馏的方法来补充这种信息不足，所以能够取得更好的效果。
nips 208 visual question answering 导读
Learning Conditioned Graph Structures for Interpretable Visual Question Answering
Visual Question answering is a challenging problem requiring a combination of
concepts from Computer Vision and Natural Language Processing. Most existing
approaches use a two streams strategy, computing image and question features
that are consequently merged using a variety of techniques. Nonetheless, very few
rely on higher level image representations, which can capture semantic and spatial
relationships. In this paper, we propose a novel graph-based approach for Visual
Question Answering. Our method combines a graph learner module, which learns
a question specific graph representation of the input image, with the recent concept
of graph convolutions, aiming to learn image representations that capture question
specific interactions. We test our approach on the VQA v2 dataset using a simple
baseline architecture enhanced by the proposed graph learner module. We obtain
promising results with 66.18% accuracy and demonstrate the interpretability of the
proposed method. Code can be found at github.com/aimbrain/vqa-project.
如何将图结构引进来？将区域的目标检测作为图的点，边的权重是基于问题来计算的。优秀的想法。
nips 208 visual question answering 导读
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding
We marry two powerful ideas: deep representation learning for visual recognition
and language understanding, and symbolic program execution for reasoning. Our
neural-symbolic visual question answering (NS-VQA) system first recovers a
structural scene representation from the image and a program trace from the
question. It then executes the program on the scene representation to obtain an
answer. Incorporating symbolic structure as prior knowledge offers three unique
advantages. First, executing programs on a symbolic space is more robust to long
program traces; our model can solve complex reasoning tasks better, achieving an
accuracy of 99.8% on the CLEVR dataset. Second, the model is more data- and
memory-efficient: it performs well after learning on a small number of training
data; it can also encode an image into a compact representation, requiring less
storage than existing methods for offline question answering. Third, symbolic
program execution offers full transparency to the reasoning process; we are thus
able to interpret and diagnose each execution step.
用于推理的符号程序执行，就是用一段一段的代码来进行逻辑推理，个人觉得有点类似决策分类树，事先知道所有的情况，一个一个去分。
nips 208 visual question answering 导读