Voice Conversion 项目笔记（含从VCC 2016匿名比赛深挖的各前沿方法性能对比）

voice conversion 基本架构：

voice conversion 任务主要由两个步骤构成，特征提取与特征参数转换，对于这两个步骤，都有相应的常用的技术，这两个步骤中常用的技术各种排列组合，就产生了众多VC系统，以下做小汇总。

STEP1：Feature extraction			STEP2：Feature conversion
Feature	Extraction toolkits	description
LSF（Line Spectral Frequency）	STRAIGHT	STRAIGHT is an MATLAB Lib design for VC.	GMM/JDGMM
MGC（mel-generalized coeffi- cient ）	STRAIGHT	STRAIGHT is an MATLAB Lib design for VC.	DNN
LF0(log f0)	AHOCODER	Ahocoder parameterizes speech waveforms into three different streams	RNN(BLSTM)
MCP(mel-spectgram)	AHOCODER		seq2seq/(with Attention)
MVF（maximum voiced frequency）	AHOCODER		GRU
	DBN		VQ
	Mixture of Factor Analyzer
parameter generation algorithm with global variance	HTS	HMM-based Speech Synthesis System

其中,LF0为语音基频log变换，为主要的语音转换参数，是表征不同人，不同性别的最重要参数之一。

其他参数为语音的高阶分量，控制合成语音的细节。

衡量标准：

常用的两种：基于主观评分的MOS和基于Mel谱失真的MCD

MOS（Mean Opinion Score）主观评分分值1-5 由受试者主观打分一般有GMM方法的MOS分作为beseline，对于不同的测试可以考虑把GMM的分数对齐从而在不同的MOS测试中统一基准

打分基准：

4－5分优秀(excelent) 很好，听的清楚，延迟很小，交流流畅

3－4分良好(good) 稍差，听的清楚，延迟小，交流欠缺顺畅，有点杂音

2－3分一般(fair) 还可以，听不太清，有一定延迟，可以交流

1－2分差(poor) 勉强，听不太清，延迟较大，交流重复多次1分以下很差(bad) 极差，听不懂，延迟大，交流不通畅

MCD（Mel-cepstral distortion）[dB] 越低越好

例：

seq2seq with LSTM:

Voice Conversion 项目笔记（含从VCC 2016匿名比赛深挖的各前沿方法性能对比）

Variational Auto-encoder:

Deep sequence-to-sequence Attention Model :

VCC2016:

各组综合分数对比：

各组方案对比（红圈为MOS>2.5 similarity>60%的方案，紫色横线表示不会公开方法的组别）

其中G，L,O 组

很多都采用了STRAIGHT工具包进行MGC特征的提取

转换方法上，常采用GMM或者LSTM方法

开源的组转换结果样例：

效果较好的：

略逊的：

各组的报告，包括使用的参数，工具包，及其方法：

各组来源：

现在可用的转换方法：

用不同的神经网络对不同参数进行转换

用AHOcoder进行特征提取：

LF0：LSTM

MVF：DNN

MCP：GRU

转换结果样例：（基于VCC2016 数据）

基于LSF LF0 UV的方法：

转换结果样例：

还在下载

基于GMM的转换：

利用SPTK提取MCEP参数，对MCEP参数用24阶GMM进行转换：

转换结果样例：

各机构的Demo：

Voice Conversion 方法，链接如下：

LyreBird : https://lyrebird.ai/demo

微软：DNN VC
香港科技大学：BLSTM VC
印度OHSU：Joint AE VC
日本东京大学：GMM VC
法国tut：基于DKPL回归
Voice Morphing：Voice Morphing
Adobe VOCO：http://ieeexplore.ieee.org/abstract/document/7472761/

TTS方法，链接如下：

ModelTalker ：(采集语音以进行语音合成):https://www.modeltalker.org/
日本Kobayashi 实验室：Speaker-Independent HMM-Based Voice Conversion
爱丁堡大学：Listening test materials for “A study of speaker adaptation for DNN-based speech synthesis”
TOKUDA and NANKAKU LABORATORY
Princeton VOCO：http://gfx.cs.princeton.edu/pubs/Jin_2017_VTI/【5月11日发表】【10分钟音频内寻找最接近音素】

建议：

在VCC2016比赛中，效果较好的组在语音特征提取的环节都采用了STRAIGHT工具包，应对此进行进一步的探索

未开源及展示论文的方案可能由于经过cherry-picked的调参，复现可能有一定困难

开源的方案主要是都以LSTM作为主要模型，用于转换LF0，MCEP等特征

现有的转换方法中，以基于GMM的MCEP特征转换作为基准，利用LSTM转换LF0的这种方法性能较好，其MOS评分应在2-3分左右，普分辨率较低

可以用STRAIGHT替换AHOcoder进行特征提取，并使用LSTM进行LF0和MCEP的转换（尝试研究G组的方案）。

参考论文及其性能展示：

常用传统转换技术对比（不包括各类神经网络方法）：

https://arxiv.org/pdf/1612.07523.pdf

IEEE（只看了一下，包括其中G组的方案）:

Voice conversion using deep neural networks with speaker-independent pre-training

http://ieeexplore.ieee.org/document/7078543/

Dictionary update for NMF-based voice conversion using an encoder-decoder network

http://ieeexplore.ieee.org/document/7918382/

Text-independent voice conversion using deep neural network based phonetic level features

http://ieeexplore.ieee.org/document/7918382/

F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential

http://ieeexplore.ieee.org/document/7846338/

Enhancing a glossectomy patient's speech via GMM-based voice conversion

http://ieeexplore.ieee.org/document/7820909/

Deep neural network based voice conversion with a large synthesized parallel corpus

http://ieeexplore.ieee.org/document/7820716/

SEQ2SEQ类:

https://arxiv.org/pdf/1704.02360.pdf

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

https://arxiv.org/pdf/1704.00849.pdf

VAE：

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

https://arxiv.org/pdf/1610.04019.pdf

DNN：

4 Proposed Framework: Deep sequence-to-sequence Attention Model

寻找对应组：

I组：NII National Institute of Informatics：

http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/i-vector-VC_icassp2017.pdf

NON-PARALLEL VOICE CONVERSION USING I-VECTOR PLDA: TOWARDS UNIFYING SPEAKER VERIFICATION AND TRANSFORMATION

L组：西北工业大学，谢磊：

http://www.nwpu-aslp.org/lxie/papers/ssw9_PS1-4_Huang.pdf

An Automatic Voice Conversion Evaluation Strategy Based on Perceptual Background Noise Distortion and Speaker Similarity

J组：NU-NAIST：

F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential

http://ieeexplore.ieee.org/abstract/document/7846338/

C组：UTSC-NELLSILP：（jun du）

https://www.researchgate.net/publication/307889222_The_USTC_System_for_Voice_Conversion_Challenge_2016_Neural_Network_Based_Approaches_for_Spectrum_Aperiodicity_and_F0_Conversion

未知：University College London voice conversion challenge：

Hulk组【G组】：Phone-aware LSTM-RNN for voice conversion

http://ieeexplore.ieee.org/abstract/document/7877819/

B组：Zhizheng WU：

CSTR The Centre for Speech Technology Research, The University of Edinburgh, UK

http://ieeexplore.ieee.org/abstract/document/7472761/

https://link.springer.com/article/10.1007/s11042-014-2180-2

F组：HCCL-CUHK Human-Computer Communications Laboratory

The Chinese University of Hong Kong, Hong Kong

http://www1.se.cuhk.edu.hk/~lfsun/ICME2016_Lifa_Sun.pdf

日本人的复现：

https://github.com/sesenosannko/ppg_vc

【一个用DNN的组（O（也有可能是C））】VoiceKontrol Center for Spoken Language Understanding (CSLU), Oregon Health & Science University, Portland, OR, USA

S.H. Mohammadi, A. Kain, Semi-supervised Training of a Voice Conversion Mapping Function using Joint-Autoencoder, Interspeech (To Appear), 2015.

S.H. Mohammadi, A. Kain, Voice Conversion Using Deep Neural Networks With Speaker-Independent Pre-Training, 2014 IEEE Spoken Language Technology Workshop (SLT), 2014.

复现：

https://github.com/shamidreza/dnnmapper

还没找到的：

CASIA-NLPR-Taogroup National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Team Initiator Tsinghua University, Beijing, China

AST Academia Sinica, Taipei, Taiwan

https://arxiv.org/pdf/1610.04019.pdf

https://github.com/JeremyCCHsu/vae4vc

github:

RNN:

https://github.com/Pooja-Donekal/Voice-Conversion

LSTM:(还不错)（F组）