Forward TTS model(s)#
A general feed-forward TTS model implementation that can be configured to different architectures by setting different encoder and decoder networks. It can be trained with either pre-computed durations (from pre-trained Tacotron) or an alignment network that learns the text to audio alignment from the input data.
Currently we provide the following pre-configured architectures:
FastSpeech:
It’s a feed-forward model TTS model that uses Feed Forward Transformer (FFT) modules as the encoder and decoder.
FastPitch:
It uses the same FastSpeech architecture that is conditioned on fundamental frequency (f0) contours with the promise of more expressive speech.
SpeedySpeech:
It uses Residual Convolution layers instead of Transformers that leads to a more compute friendly model.
FastSpeech2 (TODO):
Similar to FastPitch but it also uses a spectral energy values as an addition.
Important resources & papers#
FastPitch: https://arxiv.org/abs/2006.06873
SpeedySpeech: https://arxiv.org/abs/2008.03802
FastSpeech: https://arxiv.org/pdf/1905.09263
FastSpeech2: https://arxiv.org/abs/2006.04558
Aligner Network: https://arxiv.org/abs/2108.10447
What is Pitch: https://www.britannica.com/topic/pitch-speech
ForwardTTSArgs#
- class TTS.tts.models.forward_tts.ForwardTTSArgs(num_chars=None, out_channels=80, hidden_channels=384, use_aligner=True, use_pitch=True, pitch_predictor_hidden_channels=256, pitch_predictor_kernel_size=3, pitch_predictor_dropout_p=0.1, pitch_embedding_kernel_size=3, use_energy=False, energy_predictor_hidden_channels=256, energy_predictor_kernel_size=3, energy_predictor_dropout_p=0.1, energy_embedding_kernel_size=3, duration_predictor_hidden_channels=256, duration_predictor_kernel_size=3, duration_predictor_dropout_p=0.1, positional_encoding=True, poisitonal_encoding_use_scale=True, length_scale=1, encoder_type='fftransformer', encoder_params=<factory>, decoder_type='fftransformer', decoder_params=<factory>, detach_duration_predictor=False, max_duration=75, num_speakers=1, use_speaker_embedding=False, speakers_file=None, use_d_vector_file=False, d_vector_dim=None, d_vector_file=None)[source]#
ForwardTTS Model arguments.
- Parameters:
num_chars (int) – Number of characters in the vocabulary. Defaults to 100.
out_channels (int) – Number of output channels. Defaults to 80.
hidden_channels (int) – Number of base hidden channels of the model. Defaults to 512.
use_aligner (bool) – Whether to use aligner network to learn the text to speech alignment or use pre-computed durations. If set False, durations should be computed by TTS/bin/compute_attention_masks.py and path to the pre-computed durations must be provided to config.datasets[0].meta_file_attn_mask. Defaults to True.
use_pitch (bool) – Use pitch predictor to learn the pitch. Defaults to True.
use_energy (bool) – Use energy predictor to learn the energy. Defaults to True.
duration_predictor_hidden_channels (int) – Number of hidden channels in the duration predictor. Defaults to 256.
duration_predictor_dropout_p (float) – Dropout rate for the duration predictor. Defaults to 0.1.
duration_predictor_kernel_size (int) – Kernel size of conv layers in the duration predictor. Defaults to 3.
pitch_predictor_hidden_channels (int) – Number of hidden channels in the pitch predictor. Defaults to 256.
pitch_predictor_dropout_p (float) – Dropout rate for the pitch predictor. Defaults to 0.1.
pitch_predictor_kernel_size (int) – Kernel size of conv layers in the pitch predictor. Defaults to 3.
pitch_embedding_kernel_size (int) – Kernel size of the projection layer in the pitch predictor. Defaults to 3.
energy_predictor_hidden_channels (int) – Number of hidden channels in the energy predictor. Defaults to 256.
energy_predictor_dropout_p (float) – Dropout rate for the energy predictor. Defaults to 0.1.
energy_predictor_kernel_size (int) – Kernel size of conv layers in the energy predictor. Defaults to 3.
energy_embedding_kernel_size (int) – Kernel size of the projection layer in the energy predictor. Defaults to 3.
positional_encoding (bool) – Whether to use positional encoding. Defaults to True.
positional_encoding_use_scale (bool) – Whether to use a learnable scale coeff in the positional encoding. Defaults to True.
length_scale (int) – Length scale that multiplies the predicted durations. Larger values result slower speech. Defaults to 1.0.
encoder_type (str) – Type of the encoder module. One of the encoders available in
TTS.tts.layers.feed_forward.encoder
. Defaults to fftransformer as in the paper.encoder_params (dict) – Parameters of the encoder module. Defaults to
`{"hidden_channels_ffn": 1024, "num_heads": 1, "num_layers": 6, "dropout_p": 0.1}`
decoder_type (str) – Type of the decoder module. One of the decoders available in
TTS.tts.layers.feed_forward.decoder
. Defaults to fftransformer as in the paper.decoder_params (str) – Parameters of the decoder module. Defaults to
`{"hidden_channels_ffn": 1024, "num_heads": 1, "num_layers": 6, "dropout_p": 0.1}`
detach_duration_predictor (bool) – Detach the input to the duration predictor from the earlier computation graph so that the duraiton loss does not pass to the earlier layers. Defaults to True.
max_duration (int) – Maximum duration accepted by the model. Defaults to 75.
num_speakers (int) – Number of speakers for the speaker embedding layer. Defaults to 0.
speakers_file (str) – Path to the speaker mapping file for the Speaker Manager. Defaults to None.
speaker_embedding_channels (int) – Number of speaker embedding channels. Defaults to 256.
use_d_vector_file (bool) – Enable/Disable the use of d-vectors for multi-speaker training. Defaults to False.
d_vector_dim (int) – Number of d-vector channels. Defaults to 0.
ForwardTTS Model#
- class TTS.tts.models.forward_tts.ForwardTTS(config, ap=None, tokenizer=None, speaker_manager=None)[source]#
General forward TTS model implementation that uses an encoder-decoder architecture with an optional alignment network and a pitch predictor.
If the alignment network is used, the model learns the text-to-speech alignment from the data instead of using pre-computed durations.
If the pitch predictor is used, the model trains a pitch predictor that predicts average pitch value for each input character as in the FastPitch model.
ForwardTTS can be configured to one of these architectures,
FastPitch
SpeedySpeech
FastSpeech
FastSpeech2 (requires average speech energy predictor)
- Parameters:
config (Coqpit) – Model coqpit class.
speaker_manager (SpeakerManager) – Speaker manager for multi-speaker training. Only used for multi-speaker models. Defaults to None.
Examples
>>> from TTS.tts.models.fast_pitch import ForwardTTS, ForwardTTSArgs >>> config = ForwardTTSArgs() >>> model = ForwardTTS(config)
- expand_encoder_outputs(en, dr, x_mask, y_mask)[source]#
Generate attention alignment map from durations and expand encoder outputs
Shapes
en: \((B, D_{en}, T_{en})\)
dr: \((B, T_{en})\)
x_mask: \((B, T_{en})\)
y_mask: \((B, T_{de})\)
Examples:
encoder output: [a,b,c,d] durations: [1, 3, 2, 1] expanded: [a, b, b, b, c, c, d] attention map: [[0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]
- format_durations(o_dr_log, x_mask)[source]#
Format predicted durations. 1. Convert to linear scale from log scale 2. Apply the length scale for speed adjustment 3. Apply masking. 4. Cast 0 durations to 1. 5. Round the duration values.
- Parameters:
o_dr_log – Log scale durations.
x_mask – Input text mask.
Shapes
o_dr_log: \((B, T_{de})\)
x_mask: \((B, T_{en})\)
- forward(x, x_lengths, y_lengths, y=None, dr=None, pitch=None, energy=None, aux_input={'d_vectors': None, 'speaker_ids': None})[source]#
Model’s forward pass.
- Parameters:
x (torch.LongTensor) – Input character sequences.
x_lengths (torch.LongTensor) – Input sequence lengths.
y_lengths (torch.LongTensor) – Output sequnce lengths. Defaults to None.
y (torch.FloatTensor) – Spectrogram frames. Only used when the alignment network is on. Defaults to None.
dr (torch.IntTensor) – Character durations over the spectrogram frames. Only used when the alignment network is off. Defaults to None.
pitch (torch.FloatTensor) – Pitch values for each spectrogram frame. Only used when the pitch predictor is on. Defaults to None.
energy (torch.FloatTensor) – energy values for each spectrogram frame. Only used when the energy predictor is on. Defaults to None.
aux_input (Dict) – Auxiliary model inputs for multi-speaker training. Defaults to {“d_vectors”: 0, “speaker_ids”: None}.
Shapes
x: \([B, T_max]\)
x_lengths: \([B]\)
y_lengths: \([B]\)
y: \([B, T_max2]\)
dr: \([B, T_max]\)
g: \([B, C]\)
pitch: \([B, 1, T]\)
- static generate_attn(dr, x_mask, y_mask=None)[source]#
Generate an attention mask from the durations.
- Shapes
dr: \((B, T_{en})\)
x_mask: \((B, T_{en})\)
y_mask: \((B, T_{de})\)
- inference(x, aux_input={'d_vectors': None, 'speaker_ids': None})[source]#
Model’s inference pass.
- Parameters:
x (torch.LongTensor) – Input character sequence.
aux_input (Dict) – Auxiliary model inputs. Defaults to {“d_vectors”: None, “speaker_ids”: None}.
Shapes
x: [B, T_max]
x_lengths: [B]
g: [B, C]
- static init_from_config(config, samples=None)[source]#
Initiate model from config
- Parameters:
config (ForwardTTSConfig) – Model config.
samples (Union[List[List], List[Dict]]) – Training samples to parse speaker ids for training. Defaults to None.
FastPitchConfig#
- class TTS.tts.configs.fast_pitch_config.FastPitchConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', save_on_interrupt=True, log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, run_eval_steps=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, precision='fp16', epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=5.0, scheduler_after_epoch=True, lr=0.0001, optimizer='Adam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, use_grad_scaler=False, allow_tf32=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=False, training_seed=54321, model='fast_pitch', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=None, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=True, compute_energy=False, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, shuffle=False, drop_last=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, use_length_weighted_sampler=False, length_weighted_sampler_alpha=1.0, base_model='forward_tts', model_args=<factory>, num_speakers=0, speakers_file=None, use_speaker_embedding=False, use_d_vector_file=False, d_vector_file=False, d_vector_dim=0, spec_loss_type='mse', duration_loss_type='mse', use_ssim_loss=True, ssim_loss_alpha=1.0, spec_loss_alpha=1.0, aligner_loss_alpha=1.0, pitch_loss_alpha=0.1, dur_loss_alpha=0.1, binary_align_loss_alpha=0.1, binary_loss_warmup_epochs=150, min_seq_len=13, max_seq_len=200, r=1, f0_cache_path=None)[source]#
Configure ForwardTTS as FastPitch model.
Example
>>> from TTS.tts.configs.fast_pitch_config import FastPitchConfig >>> config = FastPitchConfig()
- Parameters:
model (str) – Model name used for selecting the right model at initialization. Defaults to fast_pitch.
base_model (str) – Name of the base model being configured as this model so that 🐸 TTS knows it needs to initiate the base model rather than searching for the model implementation. Defaults to forward_tts.
model_args (Coqpit) – Model class arguments. Check FastPitchArgs for more details. Defaults to FastPitchArgs().
data_dep_init_steps (int) – Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.
speakers_file (str) – Path to the file containing the list of speakers. Needed at inference for loading matching speaker ids to speaker names. Defaults to None.
use_speaker_embedding (bool) – enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.
use_d_vector_file (bool) – enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.
d_vector_file (str) – Path to the file including pre-computed speaker embeddings. Defaults to None.
d_vector_dim (int) – Dimension of the external speaker embeddings. Defaults to 0.
optimizer (str) – Name of the model optimizer. Defaults to Adam.
optimizer_params (dict) – Arguments of the model optimizer. Defaults to {“betas”: [0.9, 0.998], “weight_decay”: 1e-6}.
lr_scheduler (str) – Name of the learning rate scheduler. Defaults to Noam.
lr_scheduler_params (dict) – Arguments of the learning rate scheduler. Defaults to {“warmup_steps”: 4000}.
lr (float) – Initial learning rate. Defaults to 1e-3.
grad_clip (float) – Gradient norm clipping value. Defaults to 5.0.
spec_loss_type (str) – Type of the spectrogram loss. Check ForwardTTSLoss for possible values. Defaults to mse.
duration_loss_type (str) – Type of the duration loss. Check ForwardTTSLoss for possible values. Defaults to mse.
use_ssim_loss (bool) – Enable/disable the use of SSIM (Structural Similarity) loss. Defaults to True.
wd (float) – Weight decay coefficient. Defaults to 1e-7.
ssim_loss_alpha (float) – Weight for the SSIM loss. If set 0, disables the SSIM loss. Defaults to 1.0.
dur_loss_alpha (float) – Weight for the duration predictor’s loss. If set 0, disables the huber loss. Defaults to 1.0.
spec_loss_alpha (float) – Weight for the L1 spectrogram loss. If set 0, disables the L1 loss. Defaults to 1.0.
pitch_loss_alpha (float) – Weight for the pitch predictor’s loss. If set 0, disables the pitch predictor. Defaults to 1.0.
binary_align_loss_alpha (float) – Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.
binary_loss_warmup_epochs (float) – Number of epochs to gradually increase the binary loss impact. Defaults to 150.
min_seq_len (int) – Minimum input sequence length to be used at training.
max_seq_len (int) – Maximum input sequence length to be used at training. Larger values result in more VRAM usage.
configs (# dataset) –
compute_f0 (bool) – Compute pitch. defaults to True
f0_cache_path (str) – pith cache path. defaults to None
SpeedySpeechConfig#
- class TTS.tts.configs.speedy_speech_config.SpeedySpeechConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', save_on_interrupt=True, log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, run_eval_steps=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, precision='fp16', epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=5.0, scheduler_after_epoch=True, lr=0.0001, optimizer='Adam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, use_grad_scaler=False, allow_tf32=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=False, training_seed=54321, model='speedy_speech', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=None, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=False, compute_energy=False, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, shuffle=False, drop_last=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, use_length_weighted_sampler=False, length_weighted_sampler_alpha=1.0, base_model='forward_tts', model_args=<factory>, num_speakers=0, speakers_file=None, use_speaker_embedding=False, use_d_vector_file=False, d_vector_file=False, d_vector_dim=0, spec_loss_type='l1', duration_loss_type='huber', use_ssim_loss=False, ssim_loss_alpha=1.0, dur_loss_alpha=1.0, spec_loss_alpha=1.0, aligner_loss_alpha=1.0, binary_align_loss_alpha=0.3, binary_loss_warmup_epochs=150, min_seq_len=13, max_seq_len=200, r=1, f0_cache_path=None)[source]#
Configure ForwardTTS as SpeedySpeech model.
Example
>>> from TTS.tts.configs.speedy_speech_config import SpeedySpeechConfig >>> config = SpeedySpeechConfig()
- Args:
- model (str):
Model name used for selecting the right model at initialization. Defaults to speedy_speech.
- base_model (str):
Name of the base model being configured as this model so that 🐸 TTS knows it needs to initiate the base model rather than searching for the model implementation. Defaults to forward_tts.
- model_args (Coqpit):
Model class arguments. Check FastPitchArgs for more details. Defaults to FastPitchArgs().
- data_dep_init_steps (int):
Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.
- speakers_file (str):
Path to the file containing the list of speakers. Needed at inference for loading matching speaker ids to speaker names. Defaults to None.
- use_speaker_embedding (bool):
enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.
- use_d_vector_file (bool):
enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.
- d_vector_file (str):
Path to the file including pre-computed speaker embeddings. Defaults to None.
- d_vector_dim (int):
Dimension of the external speaker embeddings. Defaults to 0.
- optimizer (str):
Name of the model optimizer. Defaults to RAdam.
- optimizer_params (dict):
Arguments of the model optimizer. Defaults to {“betas”: [0.9, 0.998], “weight_decay”: 1e-6}.
- lr_scheduler (str):
Name of the learning rate scheduler. Defaults to Noam.
- lr_scheduler_params (dict):
Arguments of the learning rate scheduler. Defaults to {“warmup_steps”: 4000}.
- lr (float):
Initial learning rate. Defaults to 1e-3.
- grad_clip (float):
Gradient norm clipping value. Defaults to 5.0.
- spec_loss_type (str):
Type of the spectrogram loss. Check ForwardTTSLoss for possible values. Defaults to l1.
- duration_loss_type (str):
Type of the duration loss. Check ForwardTTSLoss for possible values. Defaults to huber.
- use_ssim_loss (bool):
Enable/disable the use of SSIM (Structural Similarity) loss. Defaults to True.
- wd (float):
Weight decay coefficient. Defaults to 1e-7.
- ssim_loss_alpha (float):
Weight for the SSIM loss. If set 0, disables the SSIM loss. Defaults to 1.0.
- dur_loss_alpha (float):
Weight for the duration predictor’s loss. If set 0, disables the huber loss. Defaults to 1.0.
- spec_loss_alpha (float):
Weight for the L1 spectrogram loss. If set 0, disables the L1 loss. Defaults to 1.0.
- binary_loss_alpha (float):
Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.
- binary_loss_warmup_epochs (float):
Number of epochs to gradually increase the binary loss impact. Defaults to 150.
- min_seq_len (int):
Minimum input sequence length to be used at training.
- max_seq_len (int):
Maximum input sequence length to be used at training. Larger values result in more VRAM usage.
FastSpeechConfig#
- class TTS.tts.configs.fast_speech_config.FastSpeechConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', save_on_interrupt=True, log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, run_eval_steps=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, precision='fp16', epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=5.0, scheduler_after_epoch=True, lr=0.0001, optimizer='Adam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, use_grad_scaler=False, allow_tf32=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=False, training_seed=54321, model='fast_speech', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=None, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=False, compute_energy=False, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, shuffle=False, drop_last=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, use_length_weighted_sampler=False, length_weighted_sampler_alpha=1.0, base_model='forward_tts', model_args=<factory>, num_speakers=0, speakers_file=None, use_speaker_embedding=False, use_d_vector_file=False, d_vector_file=False, d_vector_dim=0, spec_loss_type='mse', duration_loss_type='mse', use_ssim_loss=True, ssim_loss_alpha=1.0, dur_loss_alpha=1.0, spec_loss_alpha=1.0, pitch_loss_alpha=0.0, aligner_loss_alpha=1.0, binary_align_loss_alpha=1.0, binary_loss_warmup_epochs=150, min_seq_len=13, max_seq_len=200, r=1, f0_cache_path=None)[source]#
Configure ForwardTTS as FastSpeech model.
Example
>>> from TTS.tts.configs.fast_speech_config import FastSpeechConfig >>> config = FastSpeechConfig()
- Parameters:
model (str) – Model name used for selecting the right model at initialization. Defaults to fast_pitch.
base_model (str) – Name of the base model being configured as this model so that 🐸 TTS knows it needs to initiate the base model rather than searching for the model implementation. Defaults to forward_tts.
model_args (Coqpit) – Model class arguments. Check FastSpeechArgs for more details. Defaults to FastSpeechArgs().
data_dep_init_steps (int) – Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.
speakers_file (str) – Path to the file containing the list of speakers. Needed at inference for loading matching speaker ids to speaker names. Defaults to None.
use_speaker_embedding (bool) – enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.
use_d_vector_file (bool) – enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.
d_vector_file (str) – Path to the file including pre-computed speaker embeddings. Defaults to None.
d_vector_dim (int) – Dimension of the external speaker embeddings. Defaults to 0.
optimizer (str) – Name of the model optimizer. Defaults to Adam.
optimizer_params (dict) – Arguments of the model optimizer. Defaults to {“betas”: [0.9, 0.998], “weight_decay”: 1e-6}.
lr_scheduler (str) – Name of the learning rate scheduler. Defaults to Noam.
lr_scheduler_params (dict) – Arguments of the learning rate scheduler. Defaults to {“warmup_steps”: 4000}.
lr (float) – Initial learning rate. Defaults to 1e-3.
grad_clip (float) – Gradient norm clipping value. Defaults to 5.0.
spec_loss_type (str) – Type of the spectrogram loss. Check ForwardTTSLoss for possible values. Defaults to mse.
duration_loss_type (str) – Type of the duration loss. Check ForwardTTSLoss for possible values. Defaults to mse.
use_ssim_loss (bool) – Enable/disable the use of SSIM (Structural Similarity) loss. Defaults to True.
wd (float) – Weight decay coefficient. Defaults to 1e-7.
ssim_loss_alpha (float) – Weight for the SSIM loss. If set 0, disables the SSIM loss. Defaults to 1.0.
dur_loss_alpha (float) – Weight for the duration predictor’s loss. If set 0, disables the huber loss. Defaults to 1.0.
spec_loss_alpha (float) – Weight for the L1 spectrogram loss. If set 0, disables the L1 loss. Defaults to 1.0.
pitch_loss_alpha (float) – Weight for the pitch predictor’s loss. If set 0, disables the pitch predictor. Defaults to 1.0.
binary_loss_alpha (float) – Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.
binary_loss_warmup_epochs (float) – Number of epochs to gradually increase the binary loss impact. Defaults to 150.
min_seq_len (int) – Minimum input sequence length to be used at training.
max_seq_len (int) – Maximum input sequence length to be used at training. Larger values result in more VRAM usage.