paper review : Learning Discriminative Joint Embeddings for Efficient Face and Voice Association

On Learning Associations of Faces and Voices

conference : SIGIR ’20,

The author

We examine whether faces and voices encode redundant identity information and measure to which extent.

Background
- Many cognitive researches have shown that humans are able to hear voices of known individuals to form mental pictures of their facial appearances, and vice versa.
previous methods brief introduction
- DIMNet [13] utilizes a multi-task classification network to learn the common embeddings, but which does not fully consider the high-level semantic correlations between faces and voices.
Problem Statement
- their performances are far from the expectation. To the best of our knowledge,
  automatic face and voice association is still under early study.

An end-to-end deep correlated network is proposed to learn discriminative joint embeddings for face-voice associations.
An end-to-end deep correlated network is proposed to learn discriminative joint embeddings for face-voice associations.

not reflect

Methods
- 1. network consisting of three parts: face subnetwork, voice subnetwork and shared FC structure. Specifically, face subnetwork and voice subnetwork, with independent network parameters, are
    utilized to learn the high-level correlated and modality-specific features with respect to face and voice. The shared FC structures with the same parameters are employed to jointly learn a shared latent space to bridge the semantic gap between face and voice.

paper review : Learning Discriminative Joint Embeddings for Efficient Face and Voice Association

1. Loss : By combining the the ranking constraint, identity constraint and
  center constraint, the overall loss function is defined as follows:
1. Online Hard Negative Mining ：负样本太多了，远远多过正样本，如果都用来训练，肯定出问题，所以只选其中一部分，那就选最难的负样本呗

1. dataset : VoxCeleb [12], . About face data, the author extract video frames at a sampling rate of fps = 1, and employ MTCNN to detect facial landmarks from the extracted video frames. About voice data, utilize voice activity detector (VAD) to eliminate the long silence period in the voice segments. Accordingly, 64-dimensional log melspectrograms are generated (window size: 25ms, hop size: 10ms)
1. evaluation metrics: AUC,ACC, mAP
1. Implementation details : omit, it about pararmeter setting.
1. task : the author first use gender, ago… stratified experiments. And make 1 :2 , 1: N matching task,Cross-modal retrieval task.
1. the result show that our method has big improvement.

, the derived cross-modal embeddings are beneficial for various face-voice association and cross-matching tasks, and the extensive experiments have shown its outstanding performances.

The author use a ranking constract make data more distinguish after train. Here is a pircture show the train effect.
The author make some reasonable experiment in self work. It compares many different pararmeters. This is worth we to study, But It is more important is this compare need add after previous method .
I found that chiese paper is easy to read. Mybe the grammmar is more easy. So maybe I can pay more attendtion in chinese paper.
The conference regular paper is not short and breif. Although it is short, the read experience is pretty well. It inlcude much detail and ignorn some not important detail.
This paper is very newly. it summaried most of all relvant work. We can see these paper in this picture. I just read 2 papers.
-