Utterance-level Aggregation For Speaker Recognition In The Wild笔记

输入：每帧257维向量，256维的频率量+1维的DC量
主干网络：Thin-ResNet，提取frame-level特征
NetVLAD或GhostVLAD层：将frame-level的特征转换成utterance-level特征。大多数算法是采用Average pooling层直接对帧维度进行平均，这样做的缺点是每帧的weight是一样的，但是实际上每帧对结果的contribution肯定是不一样的，比如有说话的帧肯定比没说话帧的contribution高，本文采用的方法其实是自动学习给予每帧不同的权重。
trainning loss:标准的softmax loss和additive margin softmax(AM-Softmax)