VITS

VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech ) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of SOTA DL techniques like GANs, VAE, Normalizing Flows. It does not require external alignment annotations and learns the text-to-audio alignment using MAS as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder. It is a feed-forward model with x67.12 real-time factor on a GPU.

Important resources & papers

VitsConfig

class TTS.tts.configs.vits_config.VitsConfig(model='vits', run_name='coqui_tts', run_description='', epochs=10000, batch_size=None, eval_batch_size=None, mixed_precision=False, scheduler_after_epoch=True, run_eval=True, test_delay_epochs=0, print_eval=False, dashboard_logger='tensorboard', print_step=25, plot_step=100, model_param_stats=False, project_name=None, log_model_step=None, wandb_entity=None, save_step=10000, checkpoint=True, keep_all_best=False, keep_after=10000, num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, output_path=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', audio=<factory>, use_phonemes=False, use_espeak_phonemes=True, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, batch_group_size=0, loss_masking=None, sort_by_audio_len=True, min_seq_len=0, max_seq_len=500000, compute_f0=False, compute_linear_spec=True, add_blank=True, datasets=<factory>, optimizer='AdamW', optimizer_params=<factory>, lr_scheduler='', lr_scheduler_params=<factory>, test_sentences=<factory>, use_speaker_embedding=False, use_d_vector_file=False, d_vector_dim=0, model_args=<factory>, grad_clip=<factory>, lr_gen=0.0002, lr_disc=0.0002, lr_scheduler_gen='ExponentialLR', lr_scheduler_gen_params=<factory>, lr_scheduler_disc='ExponentialLR', lr_scheduler_disc_params=<factory>, kl_loss_alpha=1.0, disc_loss_alpha=1.0, gen_loss_alpha=1.0, feat_loss_alpha=1.0, mel_loss_alpha=45.0, dur_loss_alpha=1.0, return_wav=True, r=1)[source]

Defines parameters for VITS End2End TTS model.

Parameters
  • model (str) – Model name. Do not change unless you know what you are doing.

  • model_args (VitsArgs) – Model architecture arguments. Defaults to VitsArgs().

  • grad_clip (List) – Gradient clipping thresholds for each optimizer. Defaults to [5.0, 5.0].

  • lr_gen (float) – Initial learning rate for the generator. Defaults to 0.0002.

  • lr_disc (float) – Initial learning rate for the discriminator. Defaults to 0.0002.

  • lr_scheduler_gen (str) – Name of the learning rate scheduler for the generator. One of the torch.optim.lr_scheduler.*. Defaults to ExponentialLR.

  • lr_scheduler_gen_params (dict) – Parameters for the learning rate scheduler of the generator. Defaults to {‘gamma’: 0.999875, “last_epoch”:-1}.

  • lr_scheduler_disc (str) – Name of the learning rate scheduler for the discriminator. One of the torch.optim.lr_scheduler.*. Defaults to ExponentialLR.

  • lr_scheduler_disc_params (dict) – Parameters for the learning rate scheduler of the discriminator. Defaults to {‘gamma’: 0.999875, “last_epoch”:-1}.

  • scheduler_after_epoch (bool) – If true, step the schedulers after each epoch else after each step. Defaults to False.

  • optimizer (str) – Name of the optimizer to use with both the generator and the discriminator networks. One of the torch.optim.*. Defaults to AdamW.

  • kl_loss_alpha (float) – Loss weight for KL loss. Defaults to 1.0.

  • disc_loss_alpha (float) – Loss weight for the discriminator loss. Defaults to 1.0.

  • gen_loss_alpha (float) – Loss weight for the generator loss. Defaults to 1.0.

  • feat_loss_alpha (float) – Loss weight for the feature matching loss. Defaults to 1.0.

  • mel_loss_alpha (float) – Loss weight for the mel loss. Defaults to 45.0.

  • return_wav (bool) – If true, data loader returns the waveform as well as the other outputs. Do not change. Defaults to True.

  • compute_linear_spec (bool) – If true, the linear spectrogram is computed and returned alongside the mel output. Do not change. Defaults to True.

  • sort_by_audio_len (bool) – If true, dataloder sorts the data by audio length else sorts by the input text length. Defaults to True.

  • min_seq_len (int) – Minimum sequnce length to be considered for training. Defaults to 0.

  • max_seq_len (int) – Maximum sequnce length to be considered for training. Defaults to 500000.

  • r (int) – Number of spectrogram frames to be generated at a time. Do not change. Defaults to 1.

  • add_blank (bool) – If true, a blank token is added in between every character. Defaults to True.

  • test_sentences (List[str]) – List of sentences to be used for testing.

Note

Check TTS.tts.configs.shared_configs.BaseTTSConfig for the inherited parameters.

Example

>>> from TTS.tts.configs import VitsConfig
>>> config = VitsConfig()

VitsArgs

class TTS.tts.models.vits.VitsArgs(num_chars=100, out_channels=513, spec_segment_size=32, hidden_channels=192, hidden_channels_ffn_text_encoder=768, num_heads_text_encoder=2, num_layers_text_encoder=6, kernel_size_text_encoder=3, dropout_p_text_encoder=0.1, dropout_p_duration_predictor=0.5, kernel_size_posterior_encoder=5, dilation_rate_posterior_encoder=1, num_layers_posterior_encoder=16, kernel_size_flow=5, dilation_rate_flow=1, num_layers_flow=4, resblock_type_decoder='1', resblock_kernel_sizes_decoder=<factory>, resblock_dilation_sizes_decoder=<factory>, upsample_rates_decoder=<factory>, upsample_initial_channel_decoder=512, upsample_kernel_sizes_decoder=<factory>, use_sdp=True, noise_scale=1.0, inference_noise_scale=0.667, length_scale=1, noise_scale_dp=1.0, inference_noise_scale_dp=1.0, max_inference_len=None, init_discriminator=True, use_spectral_norm_disriminator=False, use_speaker_embedding=False, num_speakers=0, speakers_file=None, speaker_embedding_channels=256, use_d_vector_file=False, d_vector_dim=0, detach_dp_input=True)[source]

VITS model arguments.

Parameters
  • num_chars (int) – Number of characters in the vocabulary. Defaults to 100.

  • out_channels (int) – Number of output channels. Defaults to 513.

  • spec_segment_size (int) – Decoder input segment size. Defaults to 32 (32 * hoplength = waveform length).

  • hidden_channels (int) – Number of hidden channels of the model. Defaults to 192.

  • hidden_channels_ffn_text_encoder (int) – Number of hidden channels of the feed-forward layers of the text encoder transformer. Defaults to 256.

  • num_heads_text_encoder (int) – Number of attention heads of the text encoder transformer. Defaults to 2.

  • num_layers_text_encoder (int) – Number of transformer layers in the text encoder. Defaults to 6.

  • kernel_size_text_encoder (int) – Kernel size of the text encoder transformer FFN layers. Defaults to 3.

  • dropout_p_text_encoder (float) – Dropout rate of the text encoder. Defaults to 0.1.

  • dropout_p_duration_predictor (float) – Dropout rate of the duration predictor. Defaults to 0.1.

  • kernel_size_posterior_encoder (int) – Kernel size of the posterior encoder’s WaveNet layers. Defaults to 5.

  • dilatation_posterior_encoder (int) – Dilation rate of the posterior encoder’s WaveNet layers. Defaults to 1.

  • num_layers_posterior_encoder (int) – Number of posterior encoder’s WaveNet layers. Defaults to 16.

  • kernel_size_flow (int) – Kernel size of the Residual Coupling layers of the flow network. Defaults to 5.

  • dilatation_flow (int) – Dilation rate of the Residual Coupling WaveNet layers of the flow network. Defaults to 1.

  • num_layers_flow (int) – Number of Residual Coupling WaveNet layers of the flow network. Defaults to 6.

  • resblock_type_decoder (str) – Type of the residual block in the decoder network. Defaults to “1”.

  • resblock_kernel_sizes_decoder (List[int]) – Kernel sizes of the residual blocks in the decoder network. Defaults to [3, 7, 11].

  • resblock_dilation_sizes_decoder (List[List[int]]) – Dilation sizes of the residual blocks in the decoder network. Defaults to [[1, 3, 5], [1, 3, 5], [1, 3, 5]].

  • upsample_rates_decoder (List[int]) – Upsampling rates for each concecutive upsampling layer in the decoder network. The multiply of these values must be equal to the kop length used for computing spectrograms. Defaults to [8, 8, 2, 2].

  • upsample_initial_channel_decoder (int) – Number of hidden channels of the first upsampling convolution layer of the decoder network. Defaults to 512.

  • upsample_kernel_sizes_decoder (List[int]) – Kernel sizes for each upsampling layer of the decoder network. Defaults to [16, 16, 4, 4].

  • use_sdp (bool) – Use Stochastic Duration Predictor. Defaults to True.

  • noise_scale (float) – Noise scale used for the sample noise tensor in training. Defaults to 1.0.

  • inference_noise_scale (float) – Noise scale used for the sample noise tensor in inference. Defaults to 0.667.

  • length_scale (float) – Scale factor for the predicted duration values. Smaller values result faster speech. Defaults to 1.

  • noise_scale_dp (float) – Noise scale used by the Stochastic Duration Predictor sample noise in training. Defaults to 1.0.

  • inference_noise_scale_dp (float) – Noise scale for the Stochastic Duration Predictor in inference. Defaults to 0.8.

  • max_inference_len (int) – Maximum inference length to limit the memory use. Defaults to None.

  • init_discriminator (bool) – Initialize the disciminator network if set True. Set False for inference. Defaults to True.

  • use_spectral_norm_disriminator (bool) – Use spectral normalization over weight norm in the discriminator. Defaults to False.

  • use_speaker_embedding (bool) – Enable/Disable speaker embedding for multi-speaker models. Defaults to False.

  • num_speakers (int) – Number of speakers for the speaker embedding layer. Defaults to 0.

  • speakers_file (str) – Path to the speaker mapping file for the Speaker Manager. Defaults to None.

  • speaker_embedding_channels (int) – Number of speaker embedding channels. Defaults to 256.

  • use_d_vector_file (bool) – Enable/Disable the use of d-vectors for multi-speaker training. Defaults to False.

  • d_vector_dim (int) – Number of d-vector channels. Defaults to 0.

  • detach_dp_input (bool) – Detach duration predictor’s input from the network for stopping the gradients. Defaults to True.

Vits Model

class TTS.tts.models.vits.Vits(config)[source]

VITS TTS model

Paper::

https://arxiv.org/pdf/2106.06103.pdf

Paper Abstract::

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel endto-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Check TTS.tts.configs.vits_config.VitsConfig for class arguments.

Examples

>>> from TTS.tts.configs import VitsConfig
>>> from TTS.tts.models.vits import Vits
>>> config = VitsConfig()
>>> model = Vits(config)
forward(x, x_lengths, y, y_lengths, aux_input={'d_vectors': None, 'speaker_ids': None})[source]

Forward pass of the model.

Parameters
  • x (torch.tensor) – Batch of input character sequence IDs.

  • x_lengths (torch.tensor) – Batch of input character sequence lengths.

  • y (torch.tensor) – Batch of input spectrograms.

  • y_lengths (torch.tensor) – Batch of input spectrogram lengths.

  • aux_input (dict, optional) – Auxiliary inputs for multi-speaker training. Defaults to {“d_vectors”: None, “speaker_ids”: None}.

Returns

model outputs keyed by the output name.

Return type

Dict

Shapes

  • x: \([B, T_seq]\)

  • x_lengths: \([B]\)

  • y: \([B, C, T_spec]\)

  • y_lengths: \([B]\)

  • d_vectors: \([B, C, 1]\)

  • speaker_ids: \([B]\)

get_criterion()[source]

Get criterions for each optimizer. The index in the output list matches the optimizer idx used in train_step()

get_lr()[source]

Set the initial learning rates for each optimizer.

Returns

learning rates for each optimizer.

Return type

List

get_optimizer()[source]

Initiate and return the GAN optimizers based on the config parameters.

It returnes 2 optimizers in a list. First one is for the generator and the second one is for the discriminator.

Returns

optimizers.

Return type

List

get_scheduler(optimizer)[source]

Set the schedulers for each optimizer.

Parameters

optimizer (List[torch.optim.Optimizer]) – List of optimizers.

Returns

Schedulers, one for each optimizer.

Return type

List

inference(x, aux_input={'d_vectors': None, 'speaker_ids': None})[source]

Shapes

  • x: \([B, T_seq]\)

  • d_vectors: \([B, C, 1]\)

  • speaker_ids: \([B]\)

init_multispeaker(config, data=None)[source]

Initialize multi-speaker modules of a model. A model can be trained either with a speaker embedding layer or with external d_vectors computed from a speaker encoder model.

If you need a different behaviour, override this function for your model.

Parameters
  • config (Coqpit) – Model configuration.

  • data (List, optional) – Dataset items to infer number of speakers. Defaults to None.

load_checkpoint(config, checkpoint_path, eval=False)[source]

Load the model checkpoint and setup for training or inference

static make_symbols(config)[source]

Create a custom arrangement of symbols used by the model. The output list of symbols propagate along the whole training and inference steps.

test_run(ap)[source]

Generic test run for tts models used by Trainer.

You can override this for a different behaviour.

Returns

Test figures and audios to be projected to Tensorboard.

Return type

Tuple[Dict, Dict]

train_log(ap, batch, outputs, name_prefix='train')[source]

Create visualizations and waveform examples.

For example, here you can plot spectrograms and generate sample sample waveforms from these spectrograms to be projected onto Tensorboard.

Parameters
  • ap (AudioProcessor) – audio processor used at training.

  • batch (Dict) – Model inputs used at the previous training step.

  • outputs (Dict) – Model outputs generated at the previoud training step.

Returns

training plots and output waveform.

Return type

Tuple[Dict, np.ndarray]

train_step(batch, criterion, optimizer_idx)[source]

Perform a single training step. Run the model forward pass and compute losses.

Parameters
  • batch (Dict) – Input tensors.

  • criterion (nn.Module) – Loss layer designed for the model.

  • optimizer_idx (int) – Index of optimizer to use. 0 for the generator and 1 for the discriminator networks.

Returns

Model ouputs and computed losses.

Return type

Tuple[Dict, Dict]

voice_conversion(y, y_lengths, sid_src, sid_tgt)[source]

TODO: create an end-point for voice conversion