VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech ) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of SOTA DL techniques like GANs, VAE, Normalizing Flows. It does not require external alignment annotations and learns the text-to-audio alignment using MAS, as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder. It is a feed-forward model with x67.12 real-time factor on a GPU.

๐Ÿธ YourTTS is a multi-speaker and multi-lingual TTS model that can perform voice conversion and zero-shot speaker adaptation. It can also learn a new language or voice with a ~ 1 minute long audio clip. This is a big open gate for training TTS models in low-resources languages. ๐Ÿธ YourTTS uses VITS as the backbone architecture coupled with a speaker encoder model.

class TTS.tts.configs.vits_config.VitsConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='๐ŸธCoqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', save_on_interrupt=True, log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, run_eval_steps=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, precision='fp16', epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=<factory>, scheduler_after_epoch=True, lr=0.001, optimizer='AdamW', optimizer_params=<factory>, lr_scheduler=None, lr_scheduler_params=<factory>, use_grad_scaler=False, allow_tf32=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=False, training_seed=54321, model='vits', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=True, batch_group_size=0, loss_masking=None, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=False, compute_energy=False, compute_linear_spec=True, precompute_num_workers=0, start_by_longest=False, shuffle=False, drop_last=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, use_length_weighted_sampler=False, length_weighted_sampler_alpha=1.0, model_args=<factory>, lr_gen=0.0002, lr_disc=0.0002, lr_scheduler_gen='ExponentialLR', lr_scheduler_gen_params=<factory>, lr_scheduler_disc='ExponentialLR', lr_scheduler_disc_params=<factory>, kl_loss_alpha=1.0, disc_loss_alpha=1.0, gen_loss_alpha=1.0, feat_loss_alpha=1.0, mel_loss_alpha=45.0, dur_loss_alpha=1.0, speaker_encoder_loss_alpha=1.0, return_wav=True, use_weighted_sampler=False, weighted_sampler_attrs=<factory>, weighted_sampler_multipliers=<factory>, r=1, num_speakers=0, use_speaker_embedding=False, speakers_file=None, speaker_embedding_channels=256, language_ids_file=None, use_language_embedding=False, use_d_vector_file=False, d_vector_file=None, d_vector_dim=None)[source]#

Defines parameters for VITS End2End TTS model.

  • model (str) โ€“ Model name. Do not change unless you know what you are doing.

  • model_args (VitsArgs) โ€“ Model architecture arguments. Defaults to VitsArgs().

  • audio (VitsAudioConfig) โ€“ Audio processing configuration. Defaults to VitsAudioConfig().

  • grad_clip (List) โ€“ Gradient clipping thresholds for each optimizer. Defaults to [1000.0, 1000.0].

  • lr_gen (float) โ€“ Initial learning rate for the generator. Defaults to 0.0002.

  • lr_disc (float) โ€“ Initial learning rate for the discriminator. Defaults to 0.0002.

  • lr_scheduler_gen (str) โ€“ Name of the learning rate scheduler for the generator. One of the torch.optim.lr_scheduler.*. Defaults to ExponentialLR.

  • lr_scheduler_gen_params (dict) โ€“ Parameters for the learning rate scheduler of the generator. Defaults to {โ€˜gammaโ€™: 0.999875, โ€œlast_epochโ€:-1}.

  • lr_scheduler_disc (str) โ€“ Name of the learning rate scheduler for the discriminator. One of the torch.optim.lr_scheduler.*. Defaults to ExponentialLR.

  • lr_scheduler_disc_params (dict) โ€“ Parameters for the learning rate scheduler of the discriminator. Defaults to {โ€˜gammaโ€™: 0.999875, โ€œlast_epochโ€:-1}.

  • scheduler_after_epoch (bool) โ€“ If true, step the schedulers after each epoch else after each step. Defaults to False.

  • optimizer (str) โ€“ Name of the optimizer to use with both the generator and the discriminator networks. One of the torch.optim.*. Defaults to AdamW.

  • kl_loss_alpha (float) โ€“ Loss weight for KL loss. Defaults to 1.0.

  • disc_loss_alpha (float) โ€“ Loss weight for the discriminator loss. Defaults to 1.0.

  • gen_loss_alpha (float) โ€“ Loss weight for the generator loss. Defaults to 1.0.

  • feat_loss_alpha (float) โ€“ Loss weight for the feature matching loss. Defaults to 1.0.

  • mel_loss_alpha (float) โ€“ Loss weight for the mel loss. Defaults to 45.0.

  • return_wav (bool) โ€“ If true, data loader returns the waveform as well as the other outputs. Do not change. Defaults to True.

  • compute_linear_spec (bool) โ€“ If true, the linear spectrogram is computed and returned alongside the mel output. Do not change. Defaults to True.

  • use_weighted_sampler (bool) โ€“ If true, use weighted sampler with bucketing for balancing samples between datasets used in training. Defaults to False.

  • weighted_sampler_attrs (dict) โ€“ Key retuned by the formatter to be used for weighted sampler. For example {โ€œroot_pathโ€: 2.0, โ€œspeaker_nameโ€: 1.0} sets sample probabilities by overweighting root_path by 2.0. Defaults to {}.

  • weighted_sampler_multipliers (dict) โ€“ Weight each unique value of a key returned by the formatter for weighted sampling. For example {โ€œroot_pathโ€:{โ€œ/raid/datasets/libritts-clean-16khz-bwe-coqui_44khz/LibriTTS/train-clean-100/โ€:1.0, โ€œ/raid/datasets/libritts-clean-16khz-bwe-coqui_44khz/LibriTTS/train-clean-360/โ€: 0.5}. It will sample instances from train-clean-100 2 times more than train-clean-360. Defaults to {}.

  • r (int) โ€“ Number of spectrogram frames to be generated at a time. Do not change. Defaults to 1.

  • add_blank (bool) โ€“ If true, a blank token is added in between every character. Defaults to True.

  • test_sentences (List[List]) โ€“ List of sentences with speaker and language information to be used for testing.

  • language_ids_file (str) โ€“ Path to the language ids file.

  • use_language_embedding (bool) โ€“ If true, language embedding is used. Defaults to False.


Check TTS.tts.configs.shared_configs.BaseTTSConfig for the inherited parameters.


>>> from TTS.tts.configs.vits_config import VitsConfig
>>> config = VitsConfig()


class TTS.tts.models.vits.VitsArgs(num_chars=100, out_channels=513, spec_segment_size=32, hidden_channels=192, hidden_channels_ffn_text_encoder=768, num_heads_text_encoder=2, num_layers_text_encoder=6, kernel_size_text_encoder=3, dropout_p_text_encoder=0.1, dropout_p_duration_predictor=0.5, kernel_size_posterior_encoder=5, dilation_rate_posterior_encoder=1, num_layers_posterior_encoder=16, kernel_size_flow=5, dilation_rate_flow=1, num_layers_flow=4, resblock_type_decoder='1', resblock_kernel_sizes_decoder=<factory>, resblock_dilation_sizes_decoder=<factory>, upsample_rates_decoder=<factory>, upsample_initial_channel_decoder=512, upsample_kernel_sizes_decoder=<factory>, periods_multi_period_discriminator=<factory>, use_sdp=True, noise_scale=1.0, inference_noise_scale=0.667, length_scale=1, noise_scale_dp=1.0, inference_noise_scale_dp=1.0, max_inference_len=None, init_discriminator=True, use_spectral_norm_disriminator=False, use_speaker_embedding=False, num_speakers=0, speakers_file=None, d_vector_file=None, speaker_embedding_channels=256, use_d_vector_file=False, d_vector_dim=0, detach_dp_input=True, use_language_embedding=False, embedded_language_dim=4, num_languages=0, language_ids_file=None, use_speaker_encoder_as_loss=False, speaker_encoder_config_path='', speaker_encoder_model_path='', condition_dp_on_speaker=True, freeze_encoder=False, freeze_DP=False, freeze_PE=False, freeze_flow_decoder=False, freeze_waveform_decoder=False, encoder_sample_rate=None, interpolate_z=True, reinit_DP=False, reinit_text_encoder=False)[source]#

VITS model arguments.

  • num_chars (int) โ€“ Number of characters in the vocabulary. Defaults to 100.

  • out_channels (int) โ€“ Number of output channels of the decoder. Defaults to 513.

  • spec_segment_size (int) โ€“ Decoder input segment size. Defaults to 32 (32 * hoplength = waveform length).

  • hidden_channels (int) โ€“ Number of hidden channels of the model. Defaults to 192.

  • hidden_channels_ffn_text_encoder (int) โ€“ Number of hidden channels of the feed-forward layers of the text encoder transformer. Defaults to 256.

  • num_heads_text_encoder (int) โ€“ Number of attention heads of the text encoder transformer. Defaults to 2.

  • num_layers_text_encoder (int) โ€“ Number of transformer layers in the text encoder. Defaults to 6.

  • kernel_size_text_encoder (int) โ€“ Kernel size of the text encoder transformer FFN layers. Defaults to 3.

  • dropout_p_text_encoder (float) โ€“ Dropout rate of the text encoder. Defaults to 0.1.

  • dropout_p_duration_predictor (float) โ€“ Dropout rate of the duration predictor. Defaults to 0.1.

  • kernel_size_posterior_encoder (int) โ€“ Kernel size of the posterior encoderโ€™s WaveNet layers. Defaults to 5.

  • dilatation_posterior_encoder (int) โ€“ Dilation rate of the posterior encoderโ€™s WaveNet layers. Defaults to 1.

  • num_layers_posterior_encoder (int) โ€“ Number of posterior encoderโ€™s WaveNet layers. Defaults to 16.

  • kernel_size_flow (int) โ€“ Kernel size of the Residual Coupling layers of the flow network. Defaults to 5.

  • dilatation_flow (int) โ€“ Dilation rate of the Residual Coupling WaveNet layers of the flow network. Defaults to 1.

  • num_layers_flow (int) โ€“ Number of Residual Coupling WaveNet layers of the flow network. Defaults to 6.

  • resblock_type_decoder (str) โ€“ Type of the residual block in the decoder network. Defaults to โ€œ1โ€.

  • resblock_kernel_sizes_decoder (List[int]) โ€“ Kernel sizes of the residual blocks in the decoder network. Defaults to [3, 7, 11].

  • resblock_dilation_sizes_decoder (List[List[int]]) โ€“ Dilation sizes of the residual blocks in the decoder network. Defaults to [[1, 3, 5], [1, 3, 5], [1, 3, 5]].

  • upsample_rates_decoder (List[int]) โ€“ Upsampling rates for each concecutive upsampling layer in the decoder network. The multiply of these values must be equal to the kop length used for computing spectrograms. Defaults to [8, 8, 2, 2].

  • upsample_initial_channel_decoder (int) โ€“ Number of hidden channels of the first upsampling convolution layer of the decoder network. Defaults to 512.

  • upsample_kernel_sizes_decoder (List[int]) โ€“ Kernel sizes for each upsampling layer of the decoder network. Defaults to [16, 16, 4, 4].

  • periods_multi_period_discriminator (List[int]) โ€“ Periods values for Vits Multi-Period Discriminator. Defaults to [2, 3, 5, 7, 11].

  • use_sdp (bool) โ€“ Use Stochastic Duration Predictor. Defaults to True.

  • noise_scale (float) โ€“ Noise scale used for the sample noise tensor in training. Defaults to 1.0.

  • inference_noise_scale (float) โ€“ Noise scale used for the sample noise tensor in inference. Defaults to 0.667.

  • length_scale (float) โ€“ Scale factor for the predicted duration values. Smaller values result faster speech. Defaults to 1.

  • noise_scale_dp (float) โ€“ Noise scale used by the Stochastic Duration Predictor sample noise in training. Defaults to 1.0.

  • inference_noise_scale_dp (float) โ€“ Noise scale for the Stochastic Duration Predictor in inference. Defaults to 0.8.

  • max_inference_len (int) โ€“ Maximum inference length to limit the memory use. Defaults to None.

  • init_discriminator (bool) โ€“ Initialize the disciminator network if set True. Set False for inference. Defaults to True.

  • use_spectral_norm_disriminator (bool) โ€“ Use spectral normalization over weight norm in the discriminator. Defaults to False.

  • use_speaker_embedding (bool) โ€“ Enable/Disable speaker embedding for multi-speaker models. Defaults to False.

  • num_speakers (int) โ€“ Number of speakers for the speaker embedding layer. Defaults to 0.

  • speakers_file (str) โ€“ Path to the speaker mapping file for the Speaker Manager. Defaults to None.

  • speaker_embedding_channels (int) โ€“ Number of speaker embedding channels. Defaults to 256.

  • use_d_vector_file (bool) โ€“ Enable/Disable the use of d-vectors for multi-speaker training. Defaults to False.

  • d_vector_file (List[str]) โ€“ List of paths to the files including pre-computed speaker embeddings. Defaults to None.

  • d_vector_dim (int) โ€“ Number of d-vector channels. Defaults to 0.

  • detach_dp_input (bool) โ€“ Detach duration predictorโ€™s input from the network for stopping the gradients. Defaults to True.

  • use_language_embedding (bool) โ€“ Enable/Disable language embedding for multilingual models. Defaults to False.

  • embedded_language_dim (int) โ€“ Number of language embedding channels. Defaults to 4.

  • num_languages (int) โ€“ Number of languages for the language embedding layer. Defaults to 0.

  • language_ids_file (str) โ€“ Path to the language mapping file for the Language Manager. Defaults to None.

  • use_speaker_encoder_as_loss (bool) โ€“ Enable/Disable Speaker Consistency Loss (SCL). Defaults to False.

  • speaker_encoder_config_path (str) โ€“ Path to the file speaker encoder config file, to use for SCL. Defaults to โ€œโ€.

  • speaker_encoder_model_path (str) โ€“ Path to the file speaker encoder checkpoint file, to use for SCL. Defaults to โ€œโ€.

  • condition_dp_on_speaker (bool) โ€“ Condition the duration predictor on the speaker embedding. Defaults to True.

  • freeze_encoder (bool) โ€“ Freeze the encoder weigths during training. Defaults to False.

  • freeze_DP (bool) โ€“ Freeze the duration predictor weigths during training. Defaults to False.

  • freeze_PE (bool) โ€“ Freeze the posterior encoder weigths during training. Defaults to False.

  • freeze_flow_encoder (bool) โ€“ Freeze the flow encoder weigths during training. Defaults to False.

  • freeze_waveform_decoder (bool) โ€“ Freeze the waveform decoder weigths during training. Defaults to False.

  • encoder_sample_rate (int) โ€“ If not None this sample rate will be used for training the Posterior Encoder, flow, text_encoder and duration predictor. The decoder part (vocoder) will be trained with the Defaults to None.

  • interpolate_z (bool) โ€“ If encoder_sample_rate not None and this parameter True the nearest interpolation will be used to upsampling the latent variable z with the sampling rate encoder_sample_rate to the If it is False you will need to add extra upsample_rates_decoder to match the shape. Defaults to True.

Vits Model#

class TTS.tts.models.vits.Vits(config, ap=None, tokenizer=None, speaker_manager=None, language_manager=None)[source]#

VITS TTS model


Paper Abstract::

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel endto-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Check TTS.tts.configs.vits_config.VitsConfig for class arguments.


>>> from TTS.tts.configs.vits_config import VitsConfig
>>> from TTS.tts.models.vits import Vits
>>> config = VitsConfig()
>>> model = Vits(config)
export_onnx(output_path='coqui_vits.onnx', verbose=True)[source]#

Export model to ONNX format for inference

  • output_path (str) โ€“ Path to save the exported model.

  • verbose (bool) โ€“ Print verbose information. Defaults to True.


Compute speaker, langugage IDs and d_vector for the batch if necessary.


Compute spectrograms on the device.

forward(x, x_lengths, y, y_lengths, waveform, aux_input={'d_vectors': None, 'language_ids': None, 'speaker_ids': None})[source]#

Forward pass of the model.

  • x (torch.tensor) โ€“ Batch of input character sequence IDs.

  • x_lengths (torch.tensor) โ€“ Batch of input character sequence lengths.

  • y (torch.tensor) โ€“ Batch of input spectrograms.

  • y_lengths (torch.tensor) โ€“ Batch of input spectrogram lengths.

  • waveform (torch.tensor) โ€“ Batch of ground truth waveforms per sample.

  • aux_input (dict, optional) โ€“ Auxiliary inputs for multi-speaker and multi-lingual training. Defaults to {โ€œd_vectorsโ€: None, โ€œspeaker_idsโ€: None, โ€œlanguage_idsโ€: None}.


model outputs keyed by the output name.

Return type:



  • x: \([B, T_seq]\)

  • x_lengths: \([B]\)

  • y: \([B, C, T_spec]\)

  • y_lengths: \([B]\)

  • waveform: \([B, 1, T_wav]\)

  • d_vectors: \([B, C, 1]\)

  • speaker_ids: \([B]\)

  • language_ids: \([B]\)

Return Shapes:
  • model_outputs: \([B, 1, T_wav]\)

  • alignments: \([B, T_seq, T_dec]\)

  • z: \([B, C, T_dec]\)

  • z_p: \([B, C, T_dec]\)

  • m_p: \([B, C, T_dec]\)

  • logs_p: \([B, C, T_dec]\)

  • m_q: \([B, C, T_dec]\)

  • logs_q: \([B, C, T_dec]\)

  • waveform_seg: \([B, 1, spec_seg_size * hop_length]\)

  • gt_spk_emb: \([B, 1, speaker_encoder.proj_dim]\)

  • syn_spk_emb: \([B, 1, speaker_encoder.proj_dim]\)


Get criterions for each optimizer. The index in the output list matches the optimizer idx used in train_step()


Set the initial learning rates for each optimizer.


learning rates for each optimizer.

Return type:



Initiate and return the GAN optimizers based on the config parameters. It returnes 2 optimizers in a list. First one is for the generator and the second one is for the discriminator. :returns: optimizers. :rtype: List


Set the schedulers for each optimizer.


optimizer (List[torch.optim.Optimizer]) โ€“ List of optimizers.


Schedulers, one for each optimizer.

Return type:


inference(x, aux_input={'d_vectors': None, 'durations': None, 'language_ids': None, 'speaker_ids': None, 'x_lengths': None})[source]#


To run in batch mode, provide x_lengths else model assumes that the batch size is 1.


  • x: \([B, T_seq]\)

  • x_lengths: \([B]\)

  • d_vectors: \([B, C]\)

  • speaker_ids: \([B]\)

Return Shapes:
  • model_outputs: \([B, 1, T_wav]\)

  • alignments: \([B, T_seq, T_dec]\)

  • z: \([B, C, T_dec]\)

  • z_p: \([B, C, T_dec]\)

  • m_p: \([B, C, T_dec]\)

  • logs_p: \([B, C, T_dec]\)

inference_onnx(x, x_lengths=None, speaker_id=None, language_id=None)[source]#

ONNX inference

inference_voice_conversion(reference_wav, speaker_id=None, d_vector=None, reference_speaker_id=None, reference_d_vector=None)[source]#

Inference for voice conversion

  • reference_wav (Tensor) โ€“ Reference wavform. Tensor of shape [B, T]

  • speaker_id (Tensor) โ€“ speaker_id of the target speaker. Tensor of shape [B]

  • d_vector (Tensor) โ€“ d_vector embedding of target speaker. Tensor of shape [B, C]

  • reference_speaker_id (Tensor) โ€“ speaker_id of the reference_wav speaker. Tensor of shape [B]

  • reference_d_vector (Tensor) โ€“ d_vector embedding of the reference_wav speaker. Tensor of shape [B, C]

static init_from_config(config, samples=None, verbose=True)[source]#

Initiate model from config

  • config (VitsConfig) โ€“ Model config.

  • samples (Union[List[List], List[Dict]]) โ€“ Training samples to parse speaker ids for training. Defaults to None.


Initialize multilingual modules of a model.


config (Coqpit) โ€“ Model configuration.


Initialize multi-speaker modules of a model. A model can be trained either with a speaker embedding layer or with external d_vectors computed from a speaker encoder model.

You must provide a speaker_manager at initialization to set up the multi-speaker modules.

  • config (Coqpit) โ€“ Model configuration.

  • data (List, optional) โ€“ Dataset items to infer number of speakers. Defaults to None.


Initialize upsampling modules of a model.

load_checkpoint(config, checkpoint_path, eval=False, strict=True, cache=False)[source]#

Load the model checkpoint and setup for training or inference

load_fairseq_checkpoint(config, checkpoint_dir, eval=False, strict=True)[source]#

Load VITS checkpoints released by fairseq here: Performs some changes for compatibility.

  • config (Coqpit) โ€“ ๐ŸธTTS model config.

  • checkpoint_dir (str) โ€“ Path to the checkpoint directory.

  • eval (bool, optional) โ€“ Set to True for evaluation. Defaults to False.


Freeze layers at the beginning of an epoch


Reinit layes if needed


Generic test run for tts models used by Trainer.

You can override this for a different behaviour.


Test figures and audios to be projected to Tensorboard.

Return type:

Tuple[Dict, Dict]

train_log(batch, outputs, logger, assets, steps)[source]#

Create visualizations and waveform examples.

For example, here you can plot spectrograms and generate sample sample waveforms from these spectrograms to be projected onto Tensorboard.

  • ap (AudioProcessor) โ€“ audio processor used at training.

  • batch (Dict) โ€“ Model inputs used at the previous training step.

  • outputs (Dict) โ€“ Model outputs generated at the previoud training step.


training plots and output waveform.

Return type:

Tuple[Dict, np.ndarray]

train_step(batch, criterion, optimizer_idx)[source]#

Perform a single training step. Run the model forward pass and compute losses.

  • batch (Dict) โ€“ Input tensors.

  • criterion (nn.Module) โ€“ Loss layer designed for the model.

  • optimizer_idx (int) โ€“ Index of optimizer to use. 0 for the generator and 1 for the discriminator networks.


Model ouputs and computed losses.

Return type:

Tuple[Dict, Dict]

voice_conversion(y, y_lengths, speaker_cond_src, speaker_cond_tgt)[source]#

Forward pass for voice conversion

TODO: create an end-point for voice conversion

  • y (Tensor) โ€“ Reference spectrograms. Tensor of shape [B, T, C]

  • y_lengths (Tensor) โ€“ Length of each reference spectrogram. Tensor of shape [B]

  • speaker_cond_src (Tensor) โ€“ Reference speaker ID. Tensor of shape [B,]

  • speaker_cond_tgt (Tensor) โ€“ Target speaker ID. Tensor of shape [B,]