Overflow TTS#

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Compared to dominant flow-based acoustic models, our approach integrates autoregression for improved modelling of long-range dependences such as utterance-level prosody. Experiments show that a system based on our proposal gives more accurate pronunciations and better subjective speech quality than comparable methods, whilst retaining the original advantages of neural HMMs. Audio examples and code are available at https://shivammehta25.github.io/OverFlow/.

Important resources & papers#

OverflowConfig#

class TTS.tts.configs.overflow_config.OverflowConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=1, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', save_on_interrupt=True, log_model_step=None, save_step=500, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, run_eval_steps=100, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, precision='fp16', epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=40000.0, scheduler_after_epoch=True, lr=0.001, optimizer='Adam', optimizer_params=<factory>, lr_scheduler=None, lr_scheduler_params=<factory>, use_grad_scaler=False, allow_tf32=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=False, training_seed=54321, model='Overflow', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=None, min_audio_len=512, max_audio_len=inf, min_text_len=10, max_text_len=500, compute_f0=False, compute_energy=False, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, shuffle=False, drop_last=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, use_length_weighted_sampler=False, length_weighted_sampler_alpha=1.0, force_generate_statistics=False, mel_statistics_parameter_path=None, num_chars=None, state_per_phone=2, encoder_in_out_features=512, encoder_n_convolutions=3, out_channels=80, ar_order=1, sampling_temp=0.334, deterministic_transition=True, duration_threshold=0.55, use_grad_checkpointing=True, max_sampling_time=1000, prenet_type='original', prenet_dim=256, prenet_n_layers=2, prenet_dropout=0.5, prenet_dropout_at_inference=False, memory_rnn_dim=1024, outputnet_size=<factory>, flat_start_params=<factory>, std_floor=0.01, hidden_channels_dec=150, kernel_size_dec=5, dilation_rate=1, num_flow_blocks_dec=12, num_block_layers=4, dropout_p_dec=0.05, num_splits=4, num_squeeze=2, sigmoid_scale=False, c_in_channels=0, r=1, use_d_vector_file=False, use_speaker_embedding=False)[source]#

Define parameters for OverFlow model.

Example

>>> from TTS.tts.configs.overflow_config import OverflowConfig
>>> config = OverflowConfig()
Parameters:
  • model (str) – Model name used to select the right model class to initilize. Defaults to Overflow.

  • run_eval_steps (int) – Run evalulation epoch after N steps. If None, waits until training epoch is completed. Defaults to None.

  • save_step (int) – Save local checkpoint every save_step steps. Defaults to 500.

  • plot_step (int) – Plot training stats on the logger every plot_step steps. Defaults to 1.

  • model_param_stats (bool) – Log model parameters stats on the logger dashboard. Defaults to False.

  • force_generate_statistics (bool) – Force generate mel normalization statistics. Defaults to False.

  • mel_statistics_parameter_path (str) – Path to the mel normalization statistics.If the model doesn’t finds a file there it will generate statistics. Defaults to None.

  • num_chars (int) – Number of characters used by the model. It must be defined before initializing the model. Defaults to None.

  • state_per_phone (int) – Generates N states per phone. Similar, to add_blank parameter in GlowTTS but in Overflow it is upsampled by model’s encoder. Defaults to 2.

  • encoder_in_out_features (int) – Channels of encoder input and character embedding tensors. Defaults to 512.

  • encoder_n_convolutions (int) – Number of convolution layers in the encoder. Defaults to 3.

  • out_channels (int) – Channels of the final model output. It must match the spectragram size. Defaults to 80.

  • ar_order (int) – Autoregressive order of the model. Defaults to 1. In ablations of Neural HMM it was found that more autoregression while giving more variation hurts naturalness of the synthesised audio.

  • sampling_temp (float) – Variation added to the sample from the latent space of neural HMM. Defaults to 0.334.

  • deterministic_transition (bool) – deterministic duration generation based on duration quantiles as defiend in “S. Ronanki, O. Watts, S. King, and G. E. Henter, “Medianbased generation of synthetic speech durations using a nonparametric approach,” in Proc. SLT, 2016.”. Defaults to True.

  • duration_threshold (float) – Threshold for duration quantiles. Defaults to 0.55. Tune this to change the speaking rate of the synthesis, where lower values defines a slower speaking rate and higher values defines a faster speaking rate.

  • use_grad_checkpointing (bool) – Use gradient checkpointing to save memory. In a multi-GPU setting currently pytorch does not supports gradient checkpoint inside a loop so we will have to turn it off then.Adjust depending on whatever get more batch size either by using a single GPU or multi-GPU. Defaults to True.

  • max_sampling_time (int) – Maximum sampling time while synthesising latents from neural HMM. Defaults to 1000.

  • prenet_type (str) – original or bn. original sets the default Prenet and bn uses Batch Normalization version of the Prenet. Defaults to original.

  • prenet_dim (int) – Dimension of the Prenet. Defaults to 256.

  • prenet_n_layers (int) – Number of layers in the Prenet. Defaults to 2.

  • prenet_dropout (float) – Dropout rate of the Prenet. Defaults to 0.5.

  • prenet_dropout_at_inference (bool) – Use dropout at inference time. Defaults to False.

  • memory_rnn_dim (int) – Dimension of the memory LSTM to process the prenet output. Defaults to 1024.

  • outputnet_size (list[int]) – Size of the output network inside the neural HMM. Defaults to [1024].

  • flat_start_params (dict) – Parameters for the flat start initialization of the neural HMM. Defaults to {“mean”: 0.0, “std”: 1.0, “transition_p”: 0.14}. It will be recomputed when you pass the dataset.

  • std_floor (float) – Floor value for the standard deviation of the neural HMM. Prevents model cheating by putting point mass and getting infinite likelihood at any datapoint. Defaults to 0.01. It is called variance flooring in standard HMM literature.

  • hidden_channels_dec (int) – Number of base hidden channels used by the decoder WaveNet network. Defaults to 150.

  • kernel_size_dec (int) – Decoder kernel size. Defaults to 5

  • dilation_rate (int) – Rate to increase dilation by each layer in a decoder block. Defaults to 1.

  • num_flow_blocks_dec (int) – Number of decoder layers in each decoder block. Defaults to 4.

  • dropout_p_dec (float) – Dropout rate of the decoder. Defaults to 0.05.

  • num_splits (int) – Number of split levels in inversible conv1x1 operation. Defaults to 4.

  • num_squeeze (int) – Number of squeeze levels. When squeezing channels increases and time steps reduces by the factor ‘num_squeeze’. Defaults to 2.

  • sigmoid_scale (bool) – enable/disable sigmoid scaling in decoder. Defaults to False.

  • c_in_channels (int) – Unused parameter from GlowTTS’s decoder. Defaults to 0.

  • optimizer (str) – Optimizer to use for training. Defaults to adam.

  • optimizer_params (dict) – Parameters for the optimizer. Defaults to {“weight_decay”: 1e-6}.

  • grad_clip (float) – Gradient clipping threshold. Defaults to 40_000.

  • lr (float) – Learning rate. Defaults to 1e-3.

  • lr_scheduler (str) – Learning rate scheduler for the training. Use one from torch.optim.Scheduler schedulers or TTS.utils.training. Defaults to None.

  • min_seq_len (int) – Minimum input sequence length to be used at training.

  • max_seq_len (int) – Maximum input sequence length to be used at training. Larger values result in more VRAM usage.

check_values()[source]#

Validate the hyperparameters.

Raises:
  • AssertionError – when the parameters network is not defined

  • AssertionError – transition probability is not between 0 and 1

Overflow Model#

class TTS.tts.models.overflow.Overflow(config, ap=None, tokenizer=None, speaker_manager=None)[source]#

OverFlow TTS model.

Paper::

https://arxiv.org/abs/2211.06892

Paper abstract::

Neural HMMs are a type of neural transducer recently proposed for

sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Compared to dominant flow-based acoustic models, our approach integrates autoregression for improved modelling of long-range dependences such as utterance-level prosody. Experiments show that a system based on our proposal gives more accurate pronunciations and better subjective speech quality than comparable methods, whilst retaining the original advantages of neural HMMs. Audio examples and code are available at https://shivammehta25.github.io/OverFlow/.

Note

  • Neural HMMs uses flat start initialization i.e it computes the means and std and transition probabilities

of the dataset and uses them to initialize the model. This benefits the model and helps with faster learning If you change the dataset or want to regenerate the parameters change the force_generate_statistics and mel_statistics_parameter_path accordingly.

  • To enable multi-GPU training, set the use_grad_checkpointing=False in config.

This will significantly increase the memory usage. This is because to compute the actual data likelihood (not an approximation using MAS/Viterbi) we must use all the states at the previous time step during the forward pass to decide the probability distribution at the current step i.e the difference between the forward algorithm and viterbi approximation.

Check TTS.tts.configs.overflow.OverFlowConfig for class arguments.

eval_log(batch, outputs, logger, assets, steps)[source]#

Compute and log evaluation metrics.

forward(text, text_len, mels, mel_len)[source]#

Forward pass for training and computing the log likelihood of a given batch.

Shapes

Shapes: text: \([B, T_in]\) text_len: \([B]\) mels: \([B, T_out, C]\) mel_len: \([B]\)

inference(text, aux_input={'duration_threshold': None, 'max_sampling_time': None, 'sampling_temp': None, 'x_lengths': None})[source]#

Sampling from the model

Parameters:
  • text (torch.Tensor) – \([B, T_in]\)

  • aux_inputs (_type_, optional) – _description_. Defaults to None.

Returns:

Dictionary containing the following
  • mel (torch.Tensor): \([B, T_out, C]\)

  • hmm_outputs_len (torch.Tensor): \([B]\)

  • state_travelled (List[List[int]]): List of lists containing the state travelled for each sample in the batch.

  • input_parameters (list[torch.FloatTensor]): Input parameters to the neural HMM.

  • output_parameters (list[torch.FloatTensor]): Output parameters to the neural HMM.

Return type:

outputs

static init_from_config(config, samples=None, verbose=True)[source]#

Initiate model from config

Parameters:
  • config (VitsConfig) – Model config.

  • samples (Union[List[List], List[Dict]]) – Training samples to parse speaker ids for training. Defaults to None.

  • verbose (bool) – If True, print init messages. Defaults to True.

on_init_start(trainer)[source]#

If the current dataset does not have normalisation statistics and initialisation transition_probability it computes them otherwise loads.

train_log(batch, outputs, logger, assets, steps)[source]#

Log training progress.