paper review : Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

In this paper, author propose DIMNets, a framework that formulates the problem of cross-modal matching of voices and faces. It is not map voices to faces directly.
In this framework, we can make use of multiple kinds of label information provided through covariates.

this paper focuses on the task of devising computational mechanisms for cross-modal matching of voice recordings and images of the speakers’ faces.

Background
- the vocal tract that generates voice also shapes the face. And, humans have been shown to be able to associate voices of unknown individuals to pictures of their faces [5].
Problem Statement
- The specific problem setting we look at is one wherein we have an existing database of samples of people’s voices and images of their faces
- we must automatically and accurately determine which voices match to which faces.

Nagrani et al. [14] formulate the mapping as a binary task: given a voice recording, one must successfully select the speaker’s face from a pair of face images(or the reverse).

Object
Using an effective way to fusion mutimodal represention. It will make our method’s Top-N accurcy higher.
Methods
- Using Bag of Words to extract textual descriptions, adopt CNN networks to extract raw features from items(i.e., videos) in our framework.
- through three autoencoder[45] to reduce fimensionality and learn feature.(i) undercomplete autoencoders, (ii) sparse autoencoders,and (iii) denoising autoencoders
- three different architectures

paper review : Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

Recommender module :
- build an item-item similarity matrix B by exploiting the new representation of items as given by the F matrix(new representation
  for items by fusing modalities using autoencoders).
- different form SSLIM(a tradictiional recommender) We firstly computing items similarties matrix B by F, Then using line7 to compute argmin S. ( R,F,g is calculate in training perior)
- Test result,Give the Top-N argmin(s) as recommendate result.

Metrics :Normalized Discounted Cumulative Gain at top-N ([email protected])
Baselines : 10 baselines
Results : .too long…
Analysis:
- In this section, the author analysis fusion architecture impact , autoencoder type impact and Overall performance.

present three different architectures to learn multimodal representations of items.
conduct several experiments to analyze different aspects of our framework using three real-world datasets.

investigate different Deep Learning architectures as well as other feature representations.
study how other modalities, such as audio, may impact the quality of suggested items.
consider other recommendation domains, such as social network, product and music content.

In this paper, the author using a section to introduce fundamental concepts relevant to this work. I need to pay more attention into it. This kind of section is benefit for paper to be understand.
I should study how to write related work. The author writing a related work section by a logic way. My related work is little logically.
I need to add the curve fig in experiment.
Add more quotes, don’t be shy.