VITS#
VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech ) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of SOTA DL techniques like GANs, VAE, Normalizing Flows. It does not require external alignment annotations and learns the text-to-audio alignment using MAS, as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder. It is a feed-forward model with x67.12 real-time factor on a GPU.
๐ธ YourTTS is a multi-speaker and multi-lingual TTS model that can perform voice conversion and zero-shot speaker adaptation. It can also learn a new language or voice with a ~ 1 minute long audio clip. This is a big open gate for training TTS models in low-resources languages. ๐ธ YourTTS uses VITS as the backbone architecture coupled with a speaker encoder model.
Important resources & papers#
๐ธ YourTTS: https://arxiv.org/abs/2112.02418
Neural Spline Flows: https://arxiv.org/abs/1906.04032
Variational Autoencoder: https://arxiv.org/pdf/1312.6114.pdf
Generative Adversarial Networks: https://arxiv.org/abs/1406.2661
HiFiGAN: https://arxiv.org/abs/2010.05646
Normalizing Flows: https://blog.evjang.com/2018/01/nf1.html
VitsConfig#
- class TTS.tts.configs.vits_config.VitsConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='๐ธCoqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, run_eval_steps=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=<factory>, scheduler_after_epoch=True, lr=0.001, optimizer='AdamW', optimizer_params=<factory>, lr_scheduler=None, lr_scheduler_params=<factory>, use_grad_scaler=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=False, training_seed=54321, model='vits', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=VitsAudioConfig(fft_size=1024, sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None), use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=True, batch_group_size=0, loss_masking=None, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=False, compute_energy=False, compute_linear_spec=True, precompute_num_workers=0, start_by_longest=False, shuffle=False, drop_last=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, use_length_weighted_sampler=False, length_weighted_sampler_alpha=1.0, model_args=<factory>, lr_gen=0.0002, lr_disc=0.0002, lr_scheduler_gen='ExponentialLR', lr_scheduler_gen_params=<factory>, lr_scheduler_disc='ExponentialLR', lr_scheduler_disc_params=<factory>, kl_loss_alpha=1.0, disc_loss_alpha=1.0, gen_loss_alpha=1.0, feat_loss_alpha=1.0, mel_loss_alpha=45.0, dur_loss_alpha=1.0, speaker_encoder_loss_alpha=1.0, return_wav=True, use_weighted_sampler=False, weighted_sampler_attrs=<factory>, weighted_sampler_multipliers=<factory>, r=1, num_speakers=0, use_speaker_embedding=False, speakers_file=None, speaker_embedding_channels=256, language_ids_file=None, use_language_embedding=False, use_d_vector_file=False, d_vector_file=None, d_vector_dim=None)[source]#
Defines parameters for VITS End2End TTS model.
- Parameters
model (str) โ Model name. Do not change unless you know what you are doing.
model_args (VitsArgs) โ Model architecture arguments. Defaults to VitsArgs().
audio (VitsAudioConfig) โ Audio processing configuration. Defaults to VitsAudioConfig().
grad_clip (List) โ Gradient clipping thresholds for each optimizer. Defaults to [1000.0, 1000.0].
lr_gen (float) โ Initial learning rate for the generator. Defaults to 0.0002.
lr_disc (float) โ Initial learning rate for the discriminator. Defaults to 0.0002.
lr_scheduler_gen (str) โ Name of the learning rate scheduler for the generator. One of the torch.optim.lr_scheduler.*. Defaults to ExponentialLR.
lr_scheduler_gen_params (dict) โ Parameters for the learning rate scheduler of the generator. Defaults to {โgammaโ: 0.999875, โlast_epochโ:-1}.
lr_scheduler_disc (str) โ Name of the learning rate scheduler for the discriminator. One of the torch.optim.lr_scheduler.*. Defaults to ExponentialLR.
lr_scheduler_disc_params (dict) โ Parameters for the learning rate scheduler of the discriminator. Defaults to {โgammaโ: 0.999875, โlast_epochโ:-1}.
scheduler_after_epoch (bool) โ If true, step the schedulers after each epoch else after each step. Defaults to False.
optimizer (str) โ Name of the optimizer to use with both the generator and the discriminator networks. One of the torch.optim.*. Defaults to AdamW.
kl_loss_alpha (float) โ Loss weight for KL loss. Defaults to 1.0.
disc_loss_alpha (float) โ Loss weight for the discriminator loss. Defaults to 1.0.
gen_loss_alpha (float) โ Loss weight for the generator loss. Defaults to 1.0.
feat_loss_alpha (float) โ Loss weight for the feature matching loss. Defaults to 1.0.
mel_loss_alpha (float) โ Loss weight for the mel loss. Defaults to 45.0.
return_wav (bool) โ If true, data loader returns the waveform as well as the other outputs. Do not change. Defaults to True.
compute_linear_spec (bool) โ If true, the linear spectrogram is computed and returned alongside the mel output. Do not change. Defaults to True.
use_weighted_sampler (bool) โ If true, use weighted sampler with bucketing for balancing samples between datasets used in training. Defaults to False.
weighted_sampler_attrs (dict) โ Key retuned by the formatter to be used for weighted sampler. For example {โroot_pathโ: 2.0, โspeaker_nameโ: 1.0} sets sample probabilities by overweighting root_path by 2.0. Defaults to {}.
weighted_sampler_multipliers (dict) โ Weight each unique value of a key returned by the formatter for weighted sampling. For example {โroot_pathโ:{โ/raid/datasets/libritts-clean-16khz-bwe-coqui_44khz/LibriTTS/train-clean-100/โ:1.0, โ/raid/datasets/libritts-clean-16khz-bwe-coqui_44khz/LibriTTS/train-clean-360/โ: 0.5}. It will sample instances from train-clean-100 2 times more than train-clean-360. Defaults to {}.
r (int) โ Number of spectrogram frames to be generated at a time. Do not change. Defaults to 1.
add_blank (bool) โ If true, a blank token is added in between every character. Defaults to True.
test_sentences (List[List]) โ List of sentences with speaker and language information to be used for testing.
language_ids_file (str) โ Path to the language ids file.
use_language_embedding (bool) โ If true, language embedding is used. Defaults to False.
Note
Check
TTS.tts.configs.shared_configs.BaseTTSConfig
for the inherited parameters.Example
>>> from TTS.tts.configs.vits_config import VitsConfig >>> config = VitsConfig()
VitsArgs#
- class TTS.tts.models.vits.VitsArgs(num_chars=100, out_channels=513, spec_segment_size=32, hidden_channels=192, hidden_channels_ffn_text_encoder=768, num_heads_text_encoder=2, num_layers_text_encoder=6, kernel_size_text_encoder=3, dropout_p_text_encoder=0.1, dropout_p_duration_predictor=0.5, kernel_size_posterior_encoder=5, dilation_rate_posterior_encoder=1, num_layers_posterior_encoder=16, kernel_size_flow=5, dilation_rate_flow=1, num_layers_flow=4, resblock_type_decoder='1', resblock_kernel_sizes_decoder=<factory>, resblock_dilation_sizes_decoder=<factory>, upsample_rates_decoder=<factory>, upsample_initial_channel_decoder=512, upsample_kernel_sizes_decoder=<factory>, periods_multi_period_discriminator=<factory>, use_sdp=True, noise_scale=1.0, inference_noise_scale=0.667, length_scale=1, noise_scale_dp=1.0, inference_noise_scale_dp=1.0, max_inference_len=None, init_discriminator=True, use_spectral_norm_disriminator=False, use_speaker_embedding=False, num_speakers=0, speakers_file=None, d_vector_file=None, speaker_embedding_channels=256, use_d_vector_file=False, d_vector_dim=0, detach_dp_input=True, use_language_embedding=False, embedded_language_dim=4, num_languages=0, language_ids_file=None, use_speaker_encoder_as_loss=False, speaker_encoder_config_path='', speaker_encoder_model_path='', condition_dp_on_speaker=True, freeze_encoder=False, freeze_DP=False, freeze_PE=False, freeze_flow_decoder=False, freeze_waveform_decoder=False, encoder_sample_rate=None, interpolate_z=True, reinit_DP=False, reinit_text_encoder=False)[source]#
VITS model arguments.
- Parameters
num_chars (int) โ Number of characters in the vocabulary. Defaults to 100.
out_channels (int) โ Number of output channels of the decoder. Defaults to 513.
spec_segment_size (int) โ Decoder input segment size. Defaults to 32 (32 * hoplength = waveform length).
hidden_channels (int) โ Number of hidden channels of the model. Defaults to 192.
hidden_channels_ffn_text_encoder (int) โ Number of hidden channels of the feed-forward layers of the text encoder transformer. Defaults to 256.
num_heads_text_encoder (int) โ Number of attention heads of the text encoder transformer. Defaults to 2.
num_layers_text_encoder (int) โ Number of transformer layers in the text encoder. Defaults to 6.
kernel_size_text_encoder (int) โ Kernel size of the text encoder transformer FFN layers. Defaults to 3.
dropout_p_text_encoder (float) โ Dropout rate of the text encoder. Defaults to 0.1.
dropout_p_duration_predictor (float) โ Dropout rate of the duration predictor. Defaults to 0.1.
kernel_size_posterior_encoder (int) โ Kernel size of the posterior encoderโs WaveNet layers. Defaults to 5.
dilatation_posterior_encoder (int) โ Dilation rate of the posterior encoderโs WaveNet layers. Defaults to 1.
num_layers_posterior_encoder (int) โ Number of posterior encoderโs WaveNet layers. Defaults to 16.
kernel_size_flow (int) โ Kernel size of the Residual Coupling layers of the flow network. Defaults to 5.
dilatation_flow (int) โ Dilation rate of the Residual Coupling WaveNet layers of the flow network. Defaults to 1.
num_layers_flow (int) โ Number of Residual Coupling WaveNet layers of the flow network. Defaults to 6.
resblock_type_decoder (str) โ Type of the residual block in the decoder network. Defaults to โ1โ.
resblock_kernel_sizes_decoder (List[int]) โ Kernel sizes of the residual blocks in the decoder network. Defaults to [3, 7, 11].
resblock_dilation_sizes_decoder (List[List[int]]) โ Dilation sizes of the residual blocks in the decoder network. Defaults to [[1, 3, 5], [1, 3, 5], [1, 3, 5]].
upsample_rates_decoder (List[int]) โ Upsampling rates for each concecutive upsampling layer in the decoder network. The multiply of these values must be equal to the kop length used for computing spectrograms. Defaults to [8, 8, 2, 2].
upsample_initial_channel_decoder (int) โ Number of hidden channels of the first upsampling convolution layer of the decoder network. Defaults to 512.
upsample_kernel_sizes_decoder (List[int]) โ Kernel sizes for each upsampling layer of the decoder network. Defaults to [16, 16, 4, 4].
periods_multi_period_discriminator (List[int]) โ Periods values for Vits Multi-Period Discriminator. Defaults to [2, 3, 5, 7, 11].
use_sdp (bool) โ Use Stochastic Duration Predictor. Defaults to True.
noise_scale (float) โ Noise scale used for the sample noise tensor in training. Defaults to 1.0.
inference_noise_scale (float) โ Noise scale used for the sample noise tensor in inference. Defaults to 0.667.
length_scale (float) โ Scale factor for the predicted duration values. Smaller values result faster speech. Defaults to 1.
noise_scale_dp (float) โ Noise scale used by the Stochastic Duration Predictor sample noise in training. Defaults to 1.0.
inference_noise_scale_dp (float) โ Noise scale for the Stochastic Duration Predictor in inference. Defaults to 0.8.
max_inference_len (int) โ Maximum inference length to limit the memory use. Defaults to None.
init_discriminator (bool) โ Initialize the disciminator network if set True. Set False for inference. Defaults to True.
use_spectral_norm_disriminator (bool) โ Use spectral normalization over weight norm in the discriminator. Defaults to False.
use_speaker_embedding (bool) โ Enable/Disable speaker embedding for multi-speaker models. Defaults to False.
num_speakers (int) โ Number of speakers for the speaker embedding layer. Defaults to 0.
speakers_file (str) โ Path to the speaker mapping file for the Speaker Manager. Defaults to None.
speaker_embedding_channels (int) โ Number of speaker embedding channels. Defaults to 256.
use_d_vector_file (bool) โ Enable/Disable the use of d-vectors for multi-speaker training. Defaults to False.
d_vector_file (List[str]) โ List of paths to the files including pre-computed speaker embeddings. Defaults to None.
d_vector_dim (int) โ Number of d-vector channels. Defaults to 0.
detach_dp_input (bool) โ Detach duration predictorโs input from the network for stopping the gradients. Defaults to True.
use_language_embedding (bool) โ Enable/Disable language embedding for multilingual models. Defaults to False.
embedded_language_dim (int) โ Number of language embedding channels. Defaults to 4.
num_languages (int) โ Number of languages for the language embedding layer. Defaults to 0.
language_ids_file (str) โ Path to the language mapping file for the Language Manager. Defaults to None.
use_speaker_encoder_as_loss (bool) โ Enable/Disable Speaker Consistency Loss (SCL). Defaults to False.
speaker_encoder_config_path (str) โ Path to the file speaker encoder config file, to use for SCL. Defaults to โโ.
speaker_encoder_model_path (str) โ Path to the file speaker encoder checkpoint file, to use for SCL. Defaults to โโ.
condition_dp_on_speaker (bool) โ Condition the duration predictor on the speaker embedding. Defaults to True.
freeze_encoder (bool) โ Freeze the encoder weigths during training. Defaults to False.
freeze_DP (bool) โ Freeze the duration predictor weigths during training. Defaults to False.
freeze_PE (bool) โ Freeze the posterior encoder weigths during training. Defaults to False.
freeze_flow_encoder (bool) โ Freeze the flow encoder weigths during training. Defaults to False.
freeze_waveform_decoder (bool) โ Freeze the waveform decoder weigths during training. Defaults to False.
encoder_sample_rate (int) โ If not None this sample rate will be used for training the Posterior Encoder, flow, text_encoder and duration predictor. The decoder part (vocoder) will be trained with the config.audio.sample_rate. Defaults to None.
interpolate_z (bool) โ If encoder_sample_rate not None and this parameter True the nearest interpolation will be used to upsampling the latent variable z with the sampling rate encoder_sample_rate to the config.audio.sample_rate. If it is False you will need to add extra upsample_rates_decoder to match the shape. Defaults to True.
Vits Model#
- class TTS.tts.models.vits.Vits(config, ap=None, tokenizer=None, speaker_manager=None, language_manager=None)[source]#
VITS TTS model
- Paper::
- Paper Abstract::
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel endto-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.
Check
TTS.tts.configs.vits_config.VitsConfig
for class arguments.Examples
>>> from TTS.tts.configs.vits_config import VitsConfig >>> from TTS.tts.models.vits import Vits >>> config = VitsConfig() >>> model = Vits(config)
- format_batch(batch)[source]#
Compute speaker, langugage IDs and d_vector for the batch if necessary.
- forward(x, x_lengths, y, y_lengths, waveform, aux_input={'d_vectors': None, 'language_ids': None, 'speaker_ids': None})[source]#
Forward pass of the model.
- Parameters
x (torch.tensor) โ Batch of input character sequence IDs.
x_lengths (torch.tensor) โ Batch of input character sequence lengths.
y (torch.tensor) โ Batch of input spectrograms.
y_lengths (torch.tensor) โ Batch of input spectrogram lengths.
waveform (torch.tensor) โ Batch of ground truth waveforms per sample.
aux_input (dict, optional) โ Auxiliary inputs for multi-speaker and multi-lingual training. Defaults to {โd_vectorsโ: None, โspeaker_idsโ: None, โlanguage_idsโ: None}.
- Returns
model outputs keyed by the output name.
- Return type
Dict
Shapes
x: \([B, T_seq]\)
x_lengths: \([B]\)
y: \([B, C, T_spec]\)
y_lengths: \([B]\)
waveform: \([B, 1, T_wav]\)
d_vectors: \([B, C, 1]\)
speaker_ids: \([B]\)
language_ids: \([B]\)
- Return Shapes:
model_outputs: \([B, 1, T_wav]\)
alignments: \([B, T_seq, T_dec]\)
z: \([B, C, T_dec]\)
z_p: \([B, C, T_dec]\)
m_p: \([B, C, T_dec]\)
logs_p: \([B, C, T_dec]\)
m_q: \([B, C, T_dec]\)
logs_q: \([B, C, T_dec]\)
waveform_seg: \([B, 1, spec_seg_size * hop_length]\)
gt_spk_emb: \([B, 1, speaker_encoder.proj_dim]\)
syn_spk_emb: \([B, 1, speaker_encoder.proj_dim]\)
- get_criterion()[source]#
Get criterions for each optimizer. The index in the output list matches the optimizer idx used in train_step()
- get_lr()[source]#
Set the initial learning rates for each optimizer.
- Returns
learning rates for each optimizer.
- Return type
List
- get_optimizer()[source]#
Initiate and return the GAN optimizers based on the config parameters. It returnes 2 optimizers in a list. First one is for the generator and the second one is for the discriminator. :returns: optimizers. :rtype: List
- get_scheduler(optimizer)[source]#
Set the schedulers for each optimizer.
- Parameters
optimizer (List[torch.optim.Optimizer]) โ List of optimizers.
- Returns
Schedulers, one for each optimizer.
- Return type
List
- inference(x, aux_input={'d_vectors': None, 'durations': None, 'language_ids': None, 'speaker_ids': None, 'x_lengths': None})[source]#
Note
To run in batch mode, provide x_lengths else model assumes that the batch size is 1.
Shapes
x: \([B, T_seq]\)
x_lengths: \([B]\)
d_vectors: \([B, C]\)
speaker_ids: \([B]\)
- Return Shapes:
model_outputs: \([B, 1, T_wav]\)
alignments: \([B, T_seq, T_dec]\)
z: \([B, C, T_dec]\)
z_p: \([B, C, T_dec]\)
m_p: \([B, C, T_dec]\)
logs_p: \([B, C, T_dec]\)
- inference_voice_conversion(reference_wav, speaker_id=None, d_vector=None, reference_speaker_id=None, reference_d_vector=None)[source]#
Inference for voice conversion
- Parameters
reference_wav (Tensor) โ Reference wavform. Tensor of shape [B, T]
speaker_id (Tensor) โ speaker_id of the target speaker. Tensor of shape [B]
d_vector (Tensor) โ d_vector embedding of target speaker. Tensor of shape [B, C]
reference_speaker_id (Tensor) โ speaker_id of the reference_wav speaker. Tensor of shape [B]
reference_d_vector (Tensor) โ d_vector embedding of the reference_wav speaker. Tensor of shape [B, C]
- static init_from_config(config, samples=None, verbose=True)[source]#
Initiate model from config
- Parameters
config (VitsConfig) โ Model config.
samples (Union[List[List], List[Dict]]) โ Training samples to parse speaker ids for training. Defaults to None.
- init_multilingual(config)[source]#
Initialize multilingual modules of a model.
- Parameters
config (Coqpit) โ Model configuration.
- init_multispeaker(config)[source]#
Initialize multi-speaker modules of a model. A model can be trained either with a speaker embedding layer or with external d_vectors computed from a speaker encoder model.
You must provide a speaker_manager at initialization to set up the multi-speaker modules.
- Parameters
config (Coqpit) โ Model configuration.
data (List, optional) โ Dataset items to infer number of speakers. Defaults to None.
- load_checkpoint(config, checkpoint_path, eval=False, strict=True, cache=False)[source]#
Load the model checkpoint and setup for training or inference
- test_run(assets)[source]#
Generic test run for tts models used by Trainer.
You can override this for a different behaviour.
- Returns
Test figures and audios to be projected to Tensorboard.
- Return type
Tuple[Dict, Dict]
- train_log(batch, outputs, logger, assets, steps)[source]#
Create visualizations and waveform examples.
For example, here you can plot spectrograms and generate sample sample waveforms from these spectrograms to be projected onto Tensorboard.
- Parameters
ap (AudioProcessor) โ audio processor used at training.
batch (Dict) โ Model inputs used at the previous training step.
outputs (Dict) โ Model outputs generated at the previoud training step.
- Returns
training plots and output waveform.
- Return type
Tuple[Dict, np.ndarray]
- train_step(batch, criterion, optimizer_idx)[source]#
Perform a single training step. Run the model forward pass and compute losses.
- Parameters
batch (Dict) โ Input tensors.
criterion (nn.Module) โ Loss layer designed for the model.
optimizer_idx (int) โ Index of optimizer to use. 0 for the generator and 1 for the discriminator networks.
- Returns
Model ouputs and computed losses.
- Return type
Tuple[Dict, Dict]
- voice_conversion(y, y_lengths, speaker_cond_src, speaker_cond_tgt)[source]#
Forward pass for voice conversion
TODO: create an end-point for voice conversion
- Parameters
y (Tensor) โ Reference spectrograms. Tensor of shape [B, T, C]
y_lengths (Tensor) โ Length of each reference spectrogram. Tensor of shape [B]
speaker_cond_src (Tensor) โ Reference speaker ID. Tensor of shape [B,]
speaker_cond_tgt (Tensor) โ Target speaker ID. Tensor of shape [B,]