Audio samples from "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis"

Abstract: Audio deepfakes are algorithm-based deep neural networks (i.e., learning approach) based on audio generation. Deefake audio is essential in speech synthesis, restoration, and voice conversion. Effectively, generating deepfake speech is a complex task to extract localized features from speech data. In this paper, we synthesize and analyze the GAN-based deepfake audio speeches. GANs framework is a powerful tool that can build the best-in-class synthesis of any data. We used the tacotron2 synthesizer to create this model. In addition, we performed the comparative experiment of genuine vs. deep fake speech. The analysis evaluation is done in a subjective and objective Module, where subjective Module uses the Mean opinion score(MOS) and preference tests like Speech Audiometry and Degradation MOS. The objective module covers the Mel-Cepstral Distortion(MCD), Perceptual Evaluation of Speech Quality (PESQ), and prosody-based features. Experimental evaluation with both synthesized and actual data shows the our model achieves significantly near-about quality, the same as the genuine speech. We can also use these techniques to implement audio sample corpora.

Single Speaker (LJ Speech Dataset)

Original Audio
90k Steps HiFi GAN
90k Steps Modified HiFi GAN

Audio samples from "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis"

Single Speaker (LJ Speech Dataset)

0.5M and 1.5M Steps Audio is coming soon...The model is on the GPU

Point files of the model will be uploaded here soon...