Disjoint Mapping Network for Cross-modal Matching of Voices and Faces
Summary
- In this paper, author propose DIMNets, a framework that formulates the problem of cross-modal matching of voices and faces. It is not map voices to faces directly.
- In this framework, we can make use of multiple kinds of label information provided through covariates.
Research Objective
this paper focuses on the task of devising computational mechanisms for cross-modal matching of voice recordings and images of the speakers’ faces.
Background and Problems
-
Background
- the vocal tract that generates voice also shapes the face. And, humans have been shown to be able to associate voices of unknown individuals to pictures of their faces [5].
-
Problem Statement
- The specific problem setting we look at is one wherein we have an existing database of samples of people’s voices and images of their faces
- we must automatically and accurately determine which voices match to which faces.
Related work
- Nagrani et al. [14] formulate the mapping as a binary task: given a voice recording, one must successfully select the speaker’s face from a pair of face images(or the reverse).
Method(s)
-
Object
Using an effective way to fusion mutimodal represention. It will make our method’s Top-N accurcy higher. -
Methods
- Using Bag of Words to extract textual descriptions, adopt CNN networks to extract raw features from items(i.e., videos) in our framework.
- through three autoencoder[45] to reduce fimensionality and learn feature.(i) undercomplete autoencoders, (ii) sparse autoencoders,and (iii) denoising autoencoders
- three different architectures
- Recommender module :
- build an item-item similarity matrix B by exploiting the new representation of items as given by the F matrix(new representation
for items by fusing modalities using autoencoders). - different form SSLIM(a tradictiional recommender) We firstly computing items similarties matrix B by F, Then using line7 to compute argmin S. ( R,F,g is calculate in training perior)
- Test result,Give the Top-N argmin(s) as recommendate result.
- build an item-item similarity matrix B by exploiting the new representation of items as given by the F matrix(new representation
Evaluation
- DtataSet :
- MovieLens dataset: (i) ML-1M, and (ii) ML-10M
- Vine dataset
-
Metrics :Normalized Discounted Cumulative Gain at top-N ([email protected])
-
Baselines : 10 baselines
-
Results : .too long…
-
Analysis:
- In this section, the author analysis fusion architecture impact , autoencoder type impact and Overall performance.
Conclusion
- main controbution
- present three different architectures to learn multimodal representations of items.
- conduct several experiments to analyze different aspects of our framework using three real-world datasets.
- week point
- not reflected .
- further work
- investigate different Deep Learning architectures as well as other feature representations.
- study how other modalities, such as audio, may impact the quality of suggested items.
- consider other recommendation domains, such as social network, product and music content.
Reference(optional)
Arouse for me
- In this paper, the author using a section to introduce fundamental concepts relevant to this work. I need to pay more attention into it. This kind of section is benefit for paper to be understand.
- I should study how to write related work. The author writing a related work section by a logic way. My related work is little logically.
- I need to add the curve fig in experiment.
- Add more quotes, don’t be shy.