（补发）多模态文献阅读周记（一）——2020/06/08-2020/06/12

Week Literature Reading Report

2020/06/08-2020/06/12 week 1

Paper 1

Tensor Fusion Network for Multimodal Sentiment Analysis

Issues that need resolving

Multimodal fusion and representation

Key insight

This paper use three-fold Cartesian product to fusion three modalities, which is called Tensor Fusion Network. This 3D tensor contains unimodal, bimodal and trimodal dynamics. Then put this tensor to a fully connected deep neural network to do prediction. Tensor not only can keep intra-modality information, but also learns inter-modality information.

The net structure

（补发）多模态文献阅读周记（一）——2020/06/08-2020/06/12

Dataset and experiment result: CMU-MOSI(Multimodal Opinion Sentiment Intensity)
Compared with other baselines, TFN outperforms. When using only one modality, the language outperforms than other modalities.

Paper 2

Learning Representations from Imperfect Time Series Data viaTensor Rank Regularization

Issues that need resolving

Naturally, multimodal data is often imperfect as a result of imperfect modalities, missing entries or noise corruption.

Key insight

The author observed that high-dimensional multimodal time series data often exhibit correlations across time and modalities which leads to low-rank tensor representations. However, the presence of noise or incomplete values breaks these correlations and results in tensor representations of higher rank. We design a model to learn such tensor representations and effectively regularize their rank.

The net structure

（补发）多模态文献阅读周记（一）——2020/06/08-2020/06/12

The rank of an order-M tensor is computed:
（补发）多模态文献阅读周记（一）——2020/06/08-2020/06/12

However, it is NP-hard for tensors of order >= 3, there exist efficiently computable upper bounds.

（补发）多模态文献阅读周记（一）——2020/06/08-2020/06/12

This upper bound used as a regularization.

Dataset and experiment result

CMU-MOSI, the author add noise to the data, random noise and structured noise.

The experiment prove that tensor’s rank varies from input noise, the more complicated structure the input has, the tensor has higher rank.
T2FN outperforms than other baselines when the input is imperfect time series on binary classification
Both training and testing added noise

Issues still not resolved

How about other situation? Multi-classification or regression?

Inspiration

Multimodal representation is based on high dimensional data representation and fusion mathematically, so basic mathematics theory may help.

Paper 3

Multimodal Transformer for Unaligned Multimodal Language Sequences

Issues that need resolving

inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities.

Key insight

At the heart of our model is the directional pairwise cross- modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another.
Inspiration from standard Trans- former network (Vaswani et al., 2017)

（补发）多模态文献阅读周记（一）——2020/06/08-2020/06/12

Classical cross-modal alignment, on the other hand, can be expressed as a special (step diagonal cross-modal attention matrix (i.e., mono- tonic attention)

Dataset and experiment result

CMU-MOSI, CMU-MOSEI, IEMOCAP
Unaligned dataset: using different sample frequency on sampling different modalities data.

Unresolved issue

All modalities are time series, what if some are static?

Paper 4

Foundin Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities

Issues that need resolving

Multimodal representation

Key insight

In this paper, we propose a method to learn robust joint representations by translating between modalities. Our method is based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input.

Translation from a source modality S to target modality T results in an intermediate representation that captures joint representation between modalities S and T.

Inspiration from

Seq2Seq models for unsupervised representation learning.

Network Structure

（补发）多模态文献阅读周记（一）——2020/06/08-2020/06/12

Dataset: CMU-MOSI, ICT-MMMO, YouTube