Week Literature Reading Report
2020/06/08-2020/06/12 week 1
Paper 1
Tensor Fusion Network for Multimodal Sentiment Analysis
Issues that need resolving
Multimodal fusion and representation
Key insight
This paper use three-fold Cartesian product to fusion three modalities, which is called Tensor Fusion Network. This 3D tensor contains unimodal, bimodal and trimodal dynamics. Then put this tensor to a fully connected deep neural network to do prediction. Tensor not only can keep intra-modality information, but also learns inter-modality information.
The net structure
Dataset and experiment result: CMU-MOSI(Multimodal Opinion Sentiment Intensity)
Compared with other baselines, TFN outperforms. When using only one modality, the language outperforms than other modalities.
Paper 2
Learning Representations from Imperfect Time Series Data viaTensor Rank Regularization
Issues that need resolving
Naturally, multimodal data is often imperfect as a result of imperfect modalities, missing entries or noise corruption.
Key insight
The author observed that high-dimensional multimodal time series data often exhibit correlations across time and modalities which leads to low-rank tensor representations. However, the presence of noise or incomplete values breaks these correlations and results in tensor representations of higher rank. We design a model to learn such tensor representations and effectively regularize their rank.
The net structure
The rank of an order-M tensor is computed:
However, it is NP-hard for tensors of order >= 3, there exist efficiently computable upper bounds.
This upper bound used as a regularization.
Dataset and experiment result
CMU-MOSI, the author add noise to the data, random noise and structured noise.
- The experiment prove that tensor’s rank varies from input noise, the more complicated structure the input has, the tensor has higher rank.
- T2FN outperforms than other baselines when the input is imperfect time series on binary classification
- Both training and testing added noise
Issues still not resolved
How about other situation? Multi-classification or regression?
Inspiration
Multimodal representation is based on high dimensional data representation and fusion mathematically, so basic mathematics theory may help.
Paper 3
Multimodal Transformer for Unaligned Multimodal Language Sequences
Issues that need resolving
- inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities.
Key insight
At the heart of our model is the directional pairwise cross- modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another.
Inspiration from standard Trans- former network (Vaswani et al., 2017)
Classical cross-modal alignment, on the other hand, can be expressed as a special (step diagonal cross-modal attention matrix (i.e., mono- tonic attention)
Dataset and experiment result
CMU-MOSI, CMU-MOSEI, IEMOCAP
Unaligned dataset: using different sample frequency on sampling different modalities data.
Unresolved issue
All modalities are time series, what if some are static?
Paper 4
Foundin Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities
Issues that need resolving
Multimodal representation
Key insight
In this paper, we propose a method to learn robust joint representations by translating between modalities. Our method is based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input.
Translation from a source modality S to target modality T results in an intermediate representation that captures joint representation between modalities S and T.
Inspiration from
Seq2Seq models for unsupervised representation learning.
Network Structure
Dataset: CMU-MOSI, ICT-MMMO, YouTube