🌮 Tacotron 1 and 2#

Tacotron is one of the first successful DL-based text-to-mel models and opened up the whole TTS field for more DL research.

Tacotron mainly is an encoder-decoder model with attention.

The encoder takes input tokens (characters or phonemes) and the decoder outputs mel-spectrogram* frames. Attention module in-between learns to align the input tokens with the output mel-spectrgorams.

Tacotron1 and 2 are both built on the same encoder-decoder architecture but they use different layers. Additionally, Tacotron1 uses a Postnet module to convert mel-spectrograms to linear spectrograms with a higher resolution before the vocoder.

Vanilla Tacotron models are slow at inference due to the auto-regressive* nature that prevents the model to process all the inputs in parallel. One trick is to use a higher “reduction rate” that helps the model to predict multiple frames at once. That is, reduction rate 2 reduces the number of decoder iterations by half.

Tacotron also uses a Prenet module with Dropout that projects the model’s previous output before feeding it to the decoder again. The paper and most of the implementations use the Dropout layer even in inference and they report the attention fails or the voice quality degrades otherwise. But the issue with that, you get a slightly different output speech every time you run the model.

Training the attention is notoriously problematic in Tacoron models. Especially, in inference, for some input sequences, the alignment fails and causes the model to produce unexpected results. There are many different methods proposed to improve the attention.

After hundreds of experiments, @ 🐸TTS we suggest Double Decoder Consistency that leads to the most robust model performance.

If you have a limited VRAM, then you can try using the Guided Attention Loss or the Dynamic Convolutional Attention. You can also combine the two.

Important resources & papers#

BaseTacotron#

class TTS.tts.models.base_tacotron.BaseTacotron(config, ap, tokenizer, speaker_manager=None)[source]#

Base class shared by Tacotron and Tacotron2

compute_capacitron_VAE_embedding(inputs, reference_mel_info, text_info=None, speaker_embedding=None)[source]#

Capacitron Variational Autoencoder

compute_gst(inputs, style_input, speaker_embedding=None)[source]#

Compute global style token

compute_masks(text_lengths, mel_lengths)[source]#

Compute masks against sequence paddings.

get_criterion()[source]#

Get the model criterion used in training.

static init_from_config(config)[source]#

Initialize model from config.

load_checkpoint(config, checkpoint_path, eval=False, cache=False)[source]#

Load model checkpoint and set up internals.

Parameters:
  • config (Coqpi) – model configuration.

  • checkpoint_path (str) – path to checkpoint file.

  • eval (bool, optional) – whether to load model for evaluation.

  • cache (bool, optional) – If True, cache the file locally for subsequent calls. It is cached under get_user_data_dir()/tts_cache. Defaults to False.

on_epoch_start(trainer)[source]#

Callback for setting values wrt gradual training schedule.

Parameters:

trainer (TrainerTTS) – TTS trainer object that is used to train this model.

test_run(assets)[source]#

Generic test run for tts models used by Trainer.

You can override this for a different behaviour.

Parameters:

assets (dict) – A dict of training assets. For tts models, it must include {‘audio_processor’: ap}.

Returns:

Test figures and audios to be projected to Tensorboard.

Return type:

Tuple[Dict, Dict]

Tacotron#

class TTS.tts.models.tacotron.Tacotron(config, ap=None, tokenizer=None, speaker_manager=None)[source]#

Tacotron as in https://arxiv.org/abs/1703.10135 It’s an autoregressive encoder-attention-decoder-postnet architecture. Check TacotronConfig for the arguments.

Parameters:
  • config (TacotronConfig) – Configuration for the Tacotron model.

  • speaker_manager (SpeakerManager) – Speaker manager to handle multi-speaker settings. Only use if the model is a multi-speaker model. Defaults to None.

forward(text, text_lengths, mel_specs=None, mel_lengths=None, aux_input={'d_vectors': None, 'speaker_ids': None})[source]#

Shapes

text: [B, T_in] text_lengths: [B] mel_specs: [B, T_out, C] mel_lengths: [B] aux_input: ‘speaker_ids’: [B, 1] and ‘d_vectors’:[B, C]

static init_from_config(config, samples=None)[source]#

Initiate model from config

Parameters:
  • config (TacotronConfig) – Model config.

  • samples (Union[List[List], List[Dict]]) – Training samples to parse speaker ids for training. Defaults to None.

train_step(batch, criterion)[source]#

Perform a single training step by fetching the right set of samples from the batch.

Parameters:
  • batch ([Dict]) – A dictionary of input tensors.

  • criterion ([torch.nn.Module]) – Callable criterion to compute model loss.

Tacotron2#

class TTS.tts.models.tacotron2.Tacotron2(config, ap=None, tokenizer=None, speaker_manager=None)[source]#

Tacotron2 model implementation inherited from TTS.tts.models.base_tacotron.BaseTacotron.

Paper::

https://arxiv.org/abs/1712.05884

Paper abstract::

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

Check TTS.tts.configs.tacotron2_config.Tacotron2Config for model arguments.

Parameters:
  • config (TacotronConfig) – Configuration for the Tacotron2 model.

  • speaker_manager (SpeakerManager) – Speaker manager for multi-speaker training. Uuse only for multi-speaker training. Defaults to None.

forward(text, text_lengths, mel_specs=None, mel_lengths=None, aux_input={'d_vectors': None, 'speaker_ids': None})[source]#

Forward pass for training with Teacher Forcing.

Shapes

text: \([B, T_in]\) text_lengths: \([B]\) mel_specs: \([B, T_out, C]\) mel_lengths: \([B]\) aux_input: ‘speaker_ids’: \([B, 1]\) and ‘d_vectors’: \([B, C]\)

inference(text, aux_input=None)[source]#

Forward pass for inference with no Teacher-Forcing.

Shapes

text: \([B, T_in]\) text_lengths: \([B]\)

static init_from_config(config, samples=None)[source]#

Initiate model from config

Parameters:
  • config (Tacotron2Config) – Model config.

  • samples (Union[List[List], List[Dict]]) – Training samples to parse speaker ids for training. Defaults to None.

static shape_outputs(mel_outputs, mel_outputs_postnet, alignments)[source]#

Final reshape of the model output tensors.

train_log(batch, outputs, logger, assets, steps)[source]#

Log training progress.

train_step(batch, criterion)[source]#

A single training step. Forward pass and loss computation.

Parameters:
  • batch ([Dict]) – A dictionary of input tensors.

  • criterion ([type]) – Callable criterion to compute model loss.

TacotronConfig#

class TTS.tts.configs.tacotron_config.TacotronConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', save_on_interrupt=True, log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, run_eval_steps=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, precision='fp16', epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=5.0, scheduler_after_epoch=True, lr=0.0001, optimizer='RAdam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, use_grad_scaler=False, allow_tf32=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=False, training_seed=54321, model='tacotron', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=True, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=False, compute_energy=False, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, shuffle=False, drop_last=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, use_length_weighted_sampler=False, length_weighted_sampler_alpha=1.0, use_gst=False, gst=None, gst_style_input=None, use_capacitron_vae=False, capacitron_vae=None, num_speakers=1, num_chars=0, r=2, gradual_training=None, memory_size=-1, prenet_type='original', prenet_dropout=True, prenet_dropout_at_inference=False, stopnet=True, separate_stopnet=True, stopnet_pos_weight=0.2, max_decoder_steps=10000, encoder_in_features=256, decoder_in_features=256, decoder_output_dim=80, out_channels=513, attention_type='original', attention_heads=None, attention_norm='sigmoid', attention_win=False, windowing=False, use_forward_attn=False, forward_attn_mask=False, transition_agent=False, location_attn=True, bidirectional_decoder=False, double_decoder_consistency=False, ddc_r=6, speakers_file=None, use_speaker_embedding=False, speaker_embedding_dim=512, use_d_vector_file=False, d_vector_file=False, d_vector_dim=None, seq_len_norm=False, decoder_loss_alpha=0.25, postnet_loss_alpha=0.25, postnet_diff_spec_alpha=0.25, decoder_diff_spec_alpha=0.25, decoder_ssim_alpha=0.25, postnet_ssim_alpha=0.25, ga_alpha=5.0)[source]#

Defines parameters for Tacotron based models.

Example

>>> from TTS.tts.configs.tacotron_config import TacotronConfig
>>> config = TacotronConfig()
Parameters:
  • model (str) – Model name used to select the right model class to initilize. Defaults to Tacotron.

  • use_gst (bool) – enable / disable the use of Global Style Token modules. Defaults to False.

  • gst (GSTConfig) – Instance of GSTConfig class.

  • gst_style_input (str) – Path to the wav file used at inference to set the speech style through GST. If GST is enabled and this is not defined, the model uses a zero vector as an input. Defaults to None.

  • use_capacitron_vae (bool) – enable / disable the use of Capacitron modules. Defaults to False.

  • capacitron_vae (CapacitronConfig) – Instance of CapacitronConfig class.

  • num_chars (int) – Number of characters used by the model. It must be defined before initializing the model. Defaults to None.

  • num_speakers (int) – Number of speakers for multi-speaker models. Defaults to 1.

  • r (int) – Initial number of output frames that the decoder computed per iteration. Larger values makes training and inference faster but reduces the quality of the output frames. This must be equal to the largest r value used in gradual_training schedule. Defaults to 1.

  • gradual_training (List[List]) – Parameters for the gradual training schedule. It is in the form [[a, b, c], [d ,e ,f] ..] where a is the step number to start using the rest of the values, b is the r value and c is the batch size. If sets None, no gradual training is used. Defaults to None.

  • memory_size (int) – Defines the number of previous frames used by the Prenet. If set to < 0, then it uses only the last frame. Defaults to -1.

  • prenet_type (str) – original or bn. original sets the default Prenet and bn uses Batch Normalization version of the Prenet. Defaults to original.

  • prenet_dropout (bool) – enables / disables the use of dropout in the Prenet. Defaults to True.

  • prenet_dropout_at_inference (bool) – enable / disable the use of dropout in the Prenet at the inference time. Defaults to False.

  • stopnet (bool) – enable /disable the Stopnet that predicts the end of the decoder sequence. Defaults to True.

  • stopnet_pos_weight (float) – Weight that is applied to over-weight positive instances in the Stopnet loss. Use larger values with datasets with longer sentences. Defaults to 0.2.

  • max_decoder_steps (int) – Max number of steps allowed for the decoder. Defaults to 50.

  • encoder_in_features (int) – Channels of encoder input and character embedding tensors. Defaults to 256.

  • decoder_in_features (int) – Channels of decoder input and encoder output tensors. Defaults to 256.

  • out_channels (int) – Channels of the final model output. It must match the spectragram size. Defaults to 80.

  • separate_stopnet (bool) – Use a distinct Stopnet which is trained separately from the rest of the model. Defaults to True.

  • attention_type (str) – attention type. Check `TTS.tts.layers.attentions.init_attn`. Defaults to ‘original’.

  • attention_heads (int) – Number of attention heads for GMM attention. Defaults to 5.

  • windowing (bool) – It especially useful at inference to keep attention alignment diagonal. Defaults to False.

  • use_forward_attn (bool) – It is only valid if `attn_type` is `original`. Defaults to False.

  • forward_attn_mask (bool) – enable/disable extra masking over forward attention. It is useful at inference to prevent possible attention failures. Defaults to False.

  • transition_agent (bool) – enable/disable transition agent in forward attention. Defaults to False.

  • location_attn (bool) – enable/disable location sensitive attention as in the original Tacotron2 paper. It is only valid if `attn_type` is `original`. Defaults to True.

  • bidirectional_decoder (bool) – enable/disable bidirectional decoding. Defaults to False.

  • double_decoder_consistency (bool) – enable/disable double decoder consistency. Defaults to False.

  • ddc_r (int) – reduction rate used by the coarse decoder when double_decoder_consistency is in use. Set this as a multiple of the r value. Defaults to 6.

  • speakers_file (str) – Path to the speaker mapping file for the Speaker Manager. Defaults to None.

  • use_speaker_embedding (bool) – enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.

  • use_d_vector_file (bool) – enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.

  • d_vector_file (str) – Path to the file including pre-computed speaker embeddings. Defaults to None.

  • optimizer (str) – Optimizer used for the training. Set one from torch.optim.Optimizer or TTS.utils.training. Defaults to RAdam.

  • optimizer_params (dict) – Optimizer kwargs. Defaults to {“betas”: [0.8, 0.99], “weight_decay”: 0.0}

  • lr_scheduler (str) – Learning rate scheduler for the training. Use one from torch.optim.Scheduler schedulers or TTS.utils.training. Defaults to NoamLR.

  • lr_scheduler_params (dict) – Parameters for the generator learning rate scheduler. Defaults to {“warmup”: 4000}.

  • lr (float) – Initial learning rate. Defaults to 1e-4.

  • wd (float) – Weight decay coefficient. Defaults to 1e-6.

  • grad_clip (float) – Gradient clipping threshold. Defaults to 5.

  • seq_len_norm (bool) – enable / disable the sequnce length normalization in the loss functions. If set True, loss of a sample is divided by the sequence length. Defaults to False.

  • loss_masking (bool) – enable / disable masking the paddings of the samples in loss computation. Defaults to True.

  • decoder_loss_alpha (float) – Weight for the decoder loss of the Tacotron model. If set less than or equal to zero, it disables the corresponding loss function. Defaults to 0.25

  • postnet_loss_alpha (float) – Weight for the postnet loss of the Tacotron model. If set less than or equal to zero, it disables the corresponding loss function. Defaults to 0.25

  • postnet_diff_spec_alpha (float) – Weight for the postnet differential loss of the Tacotron model. If set less than or equal to zero, it disables the corresponding loss function. Defaults to 0.25

  • decoder_diff_spec_alpha (float) – Weight for the decoder differential loss of the Tacotron model. If set less than or equal to zero, it disables the corresponding loss function. Defaults to 0.25

  • decoder_ssim_alpha (float) – Weight for the decoder SSIM loss of the Tacotron model. If set less than or equal to zero, it disables the corresponding loss function. Defaults to 0.25

  • postnet_ssim_alpha (float) – Weight for the postnet SSIM loss of the Tacotron model. If set less than or equal to zero, it disables the corresponding loss function. Defaults to 0.25

  • ga_alpha (float) – Weight for the guided attention loss. If set less than or equal to zero, it disables the corresponding loss function. Defaults to 5.

Tacotron2Config#

class TTS.tts.configs.tacotron2_config.Tacotron2Config(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', save_on_interrupt=True, log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, run_eval_steps=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, precision='fp16', epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=5.0, scheduler_after_epoch=True, lr=0.0001, optimizer='RAdam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, use_grad_scaler=False, allow_tf32=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=False, training_seed=54321, model='tacotron2', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=True, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=False, compute_energy=False, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, shuffle=False, drop_last=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, use_length_weighted_sampler=False, length_weighted_sampler_alpha=1.0, use_gst=False, gst=None, gst_style_input=None, use_capacitron_vae=False, capacitron_vae=None, num_speakers=1, num_chars=0, r=2, gradual_training=None, memory_size=-1, prenet_type='original', prenet_dropout=True, prenet_dropout_at_inference=False, stopnet=True, separate_stopnet=True, stopnet_pos_weight=0.2, max_decoder_steps=10000, encoder_in_features=512, decoder_in_features=512, decoder_output_dim=80, out_channels=80, attention_type='original', attention_heads=None, attention_norm='sigmoid', attention_win=False, windowing=False, use_forward_attn=False, forward_attn_mask=False, transition_agent=False, location_attn=True, bidirectional_decoder=False, double_decoder_consistency=False, ddc_r=6, speakers_file=None, use_speaker_embedding=False, speaker_embedding_dim=512, use_d_vector_file=False, d_vector_file=False, d_vector_dim=None, seq_len_norm=False, decoder_loss_alpha=0.25, postnet_loss_alpha=0.25, postnet_diff_spec_alpha=0.25, decoder_diff_spec_alpha=0.25, decoder_ssim_alpha=0.25, postnet_ssim_alpha=0.25, ga_alpha=5.0)[source]#

Defines parameters for Tacotron2 based models.

Example

>>> from TTS.tts.configs.tacotron2_config import Tacotron2Config
>>> config = Tacotron2Config()

Check TacotronConfig for argument descriptions.