Forward TTS model(s)

A general feed-forward TTS model implementation that can be configured to different architectures by setting different encoder and decoder networks. It can be trained with either pre-computed durations (from pre-trained Tacotron) or an alignment network that learns the text to audio alignment from the input data.

Currently we provide the following pre-configured architectures:

  • FastSpeech:

    It’s a feed-forward model TTS model that uses Feed Forward Transformer (FFT) modules as the encoder and decoder.

  • FastPitch:

    It uses the same FastSpeech architecture that us conditioned on fundemental frequency (f0) contours with the promise of more expressive speech.

  • SpeedySpeech:

    It uses Residual Convolution layers instead of Transformers that leads to a more compute friendly model.

  • FastSpeech2 (TODO):

    Similar to FastPitch but it also uses a spectral energy values as an addition.

Important resources & papers

ForwardTTSArgs

class TTS.tts.models.forward_tts.ForwardTTSArgs(num_chars=None, out_channels=80, hidden_channels=384, num_speakers=0, use_aligner=True, use_pitch=True, pitch_predictor_hidden_channels=256, pitch_predictor_kernel_size=3, pitch_predictor_dropout_p=0.1, pitch_embedding_kernel_size=3, duration_predictor_hidden_channels=256, duration_predictor_kernel_size=3, duration_predictor_dropout_p=0.1, positional_encoding=True, poisitonal_encoding_use_scale=True, length_scale=1, encoder_type='fftransformer', encoder_params=<factory>, decoder_type='fftransformer', decoder_params=<factory>, use_d_vector=False, d_vector_dim=0, detach_duration_predictor=False, max_duration=75)[source]

ForwardTTS Model arguments.

Parameters
  • num_chars (int) – Number of characters in the vocabulary. Defaults to 100.

  • out_channels (int) – Number of output channels. Defaults to 80.

  • hidden_channels (int) – Number of base hidden channels of the model. Defaults to 512.

  • num_speakers (int) – Number of speakers for the speaker embedding layer. Defaults to 0.

  • use_aligner (bool) – Whether to use aligner network to learn the text to speech alignment or use pre-computed durations. If set False, durations should be computed by TTS/bin/compute_attention_masks.py and path to the pre-computed durations must be provided to config.datasets[0].meta_file_attn_mask. Defaults to True.

  • use_pitch (bool) – Use pitch predictor to learn the pitch. Defaults to True.

  • duration_predictor_hidden_channels (int) – Number of hidden channels in the duration predictor. Defaults to 256.

  • duration_predictor_dropout_p (float) – Dropout rate for the duration predictor. Defaults to 0.1.

  • duration_predictor_kernel_size (int) – Kernel size of conv layers in the duration predictor. Defaults to 3.

  • pitch_predictor_hidden_channels (int) – Number of hidden channels in the pitch predictor. Defaults to 256.

  • pitch_predictor_dropout_p (float) – Dropout rate for the pitch predictor. Defaults to 0.1.

  • pitch_predictor_kernel_size (int) – Kernel size of conv layers in the pitch predictor. Defaults to 3.

  • pitch_embedding_kernel_size (int) – Kernel size of the projection layer in the pitch predictor. Defaults to 3.

  • positional_encoding (bool) – Whether to use positional encoding. Defaults to True.

  • positional_encoding_use_scale (bool) – Whether to use a learnable scale coeff in the positional encoding. Defaults to True.

  • length_scale (int) – Length scale that multiplies the predicted durations. Larger values result slower speech. Defaults to 1.0.

  • encoder_type (str) – Type of the encoder module. One of the encoders available in TTS.tts.layers.feed_forward.encoder. Defaults to fftransformer as in the paper.

  • encoder_params (dict) – Parameters of the encoder module. Defaults to `{"hidden_channels_ffn": 1024, "num_heads": 1, "num_layers": 6, "dropout_p": 0.1}`

  • decoder_type (str) – Type of the decoder module. One of the decoders available in TTS.tts.layers.feed_forward.decoder. Defaults to fftransformer as in the paper.

  • decoder_params (str) – Parameters of the decoder module. Defaults to `{"hidden_channels_ffn": 1024, "num_heads": 1, "num_layers": 6, "dropout_p": 0.1}`

  • use_d_vetor (bool) – Whether to use precomputed d-vectors for multi-speaker training. Defaults to False.

  • d_vector_dim (int) – Number of channels of the d-vectors. Defaults to 0.

  • detach_duration_predictor (bool) – Detach the input to the duration predictor from the earlier computation graph so that the duraiton loss does not pass to the earlier layers. Defaults to True.

  • max_duration (int) – Maximum duration accepted by the model. Defaults to 75.

ForwardTTS Model

class TTS.tts.models.forward_tts.ForwardTTS(config)[source]

General forward TTS model implementation that uses an encoder-decoder architecture with an optional alignment network and a pitch predictor.

If the alignment network is used, the model learns the text-to-speech alignment from the data instead of using pre-computed durations.

If the pitch predictor is used, the model trains a pitch predictor that predicts average pitch value for each input character as in the FastPitch model.

ForwardTTS can be configured to one of these architectures,

  • FastPitch

  • SpeedySpeech

  • FastSpeech

  • TODO: FastSpeech2 (requires average speech energy predictor)

Parameters

config (Coqpit) – Model coqpit class.

Examples

>>> from TTS.tts.models.fast_pitch import ForwardTTS, ForwardTTSArgs
>>> config = ForwardTTSArgs()
>>> model = ForwardTTS(config)
expand_encoder_outputs(en, dr, x_mask, y_mask)[source]

Generate attention alignment map from durations and expand encoder outputs

Shapes

  • en: \((B, D_{en}, T_{en})\)

  • dr: \((B, T_{en})\)

  • x_mask: \((B, T_{en})\)

  • y_mask: \((B, T_{de})\)

Examples:

encoder output: [a,b,c,d]
durations: [1, 3, 2, 1]

expanded: [a, b, b, b, c, c, d]
attention map: [[0, 0, 0, 0, 0, 0, 1],
                [0, 0, 0, 0, 1, 1, 0],
                [0, 1, 1, 1, 0, 0, 0],
                [1, 0, 0, 0, 0, 0, 0]]
format_durations(o_dr_log, x_mask)[source]

Format predicted durations. 1. Convert to linear scale from log scale 2. Apply the length scale for speed adjustment 3. Apply masking. 4. Cast 0 durations to 1. 5. Round the duration values.

Parameters
  • o_dr_log – Log scale durations.

  • x_mask – Input text mask.

Shapes

  • o_dr_log: \((B, T_{de})\)

  • x_mask: \((B, T_{en})\)

forward(x, x_lengths, y_lengths, y=None, dr=None, pitch=None, aux_input={'d_vectors': None, 'speaker_ids': None})[source]

Model’s forward pass.

Parameters
  • x (torch.LongTensor) – Input character sequences.

  • x_lengths (torch.LongTensor) – Input sequence lengths.

  • y_lengths (torch.LongTensor) – Output sequnce lengths. Defaults to None.

  • y (torch.FloatTensor) – Spectrogram frames. Only used when the alignment network is on. Defaults to None.

  • dr (torch.IntTensor) – Character durations over the spectrogram frames. Only used when the alignment network is off. Defaults to None.

  • pitch (torch.FloatTensor) – Pitch values for each spectrogram frame. Only used when the pitch predictor is on. Defaults to None.

  • aux_input (Dict) – Auxiliary model inputs for multi-speaker training. Defaults to {“d_vectors”: 0, “speaker_ids”: None}.

Shapes

  • x: \([B, T_max]\)

  • x_lengths: \([B]\)

  • y_lengths: \([B]\)

  • y: \([B, T_max2]\)

  • dr: \([B, T_max]\)

  • g: \([B, C]\)

  • pitch: \([B, 1, T]\)

static generate_attn(dr, x_mask, y_mask=None)[source]

Generate an attention mask from the durations.

Shapes
  • dr: \((B, T_{en})\)

  • x_mask: \((B, T_{en})\)

  • y_mask: \((B, T_{de})\)

inference(x, aux_input={'d_vectors': None, 'speaker_ids': None})[source]

Model’s inference pass.

Parameters
  • x (torch.LongTensor) – Input character sequence.

  • aux_input (Dict) – Auxiliary model inputs. Defaults to {“d_vectors”: None, “speaker_ids”: None}.

Shapes

  • x: [B, T_max]

  • x_lengths: [B]

  • g: [B, C]

on_train_step_start(trainer)[source]

Enable binary alignment loss when needed

FastPitchConfig

class TTS.tts.configs.fast_pitch_config.FastPitchConfig(model='fast_pitch', run_name='coqui_tts', run_description='', epochs=10000, batch_size=None, eval_batch_size=None, mixed_precision=False, scheduler_after_epoch=False, run_eval=True, test_delay_epochs=0, print_eval=False, dashboard_logger='tensorboard', print_step=25, plot_step=100, model_param_stats=False, project_name=None, log_model_step=None, wandb_entity=None, save_step=10000, checkpoint=True, keep_all_best=False, keep_after=10000, num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, output_path=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', audio=<factory>, use_phonemes=False, use_espeak_phonemes=True, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, batch_group_size=0, loss_masking=None, sort_by_audio_len=False, min_seq_len=13, max_seq_len=200, compute_f0=True, compute_linear_spec=False, add_blank=False, datasets=<factory>, optimizer='Adam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, test_sentences=<factory>, use_speaker_embedding=False, use_d_vector_file=False, d_vector_dim=0, base_model='forward_tts', model_args=ForwardTTSArgs(num_chars=None, out_channels=80, hidden_channels=384, num_speakers=0, use_aligner=True, use_pitch=True, pitch_predictor_hidden_channels=256, pitch_predictor_kernel_size=3, pitch_predictor_dropout_p=0.1, pitch_embedding_kernel_size=3, duration_predictor_hidden_channels=256, duration_predictor_kernel_size=3, duration_predictor_dropout_p=0.1, positional_encoding=True, poisitonal_encoding_use_scale=True, length_scale=1, encoder_type='fftransformer', encoder_params={'hidden_channels_ffn': 1024, 'num_heads': 1, 'num_layers': 6, 'dropout_p': 0.1}, decoder_type='fftransformer', decoder_params={'hidden_channels_ffn': 1024, 'num_heads': 1, 'num_layers': 6, 'dropout_p': 0.1}, use_d_vector=False, d_vector_dim=0, detach_duration_predictor=False, max_duration=75), d_vector_file=False, lr=0.0001, grad_clip=5.0, spec_loss_type='mse', duration_loss_type='mse', use_ssim_loss=True, ssim_loss_alpha=1.0, dur_loss_alpha=1.0, spec_loss_alpha=1.0, pitch_loss_alpha=1.0, aligner_loss_alpha=1.0, binary_align_loss_alpha=1.0, binary_align_loss_start_step=20000, r=1, f0_cache_path=None)[source]

Configure ForwardTTS as FastPitch model.

Example

>>> from TTS.tts.configs import FastPitchConfig
>>> config = FastPitchConfig()
Parameters
  • model (str) – Model name used for selecting the right model at initialization. Defaults to fast_pitch.

  • base_model (str) – Name of the base model being configured as this model so that 🐸 TTS knows it needs to initiate the base model rather than searching for the model implementation. Defaults to forward_tts.

  • model_args (Coqpit) – Model class arguments. Check FastPitchArgs for more details. Defaults to FastPitchArgs().

  • data_dep_init_steps (int) – Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.

  • use_speaker_embedding (bool) – enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.

  • use_d_vector_file (bool) – enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.

  • d_vector_file (str) – Path to the file including pre-computed speaker embeddings. Defaults to None.

  • d_vector_dim (int) – Dimension of the external speaker embeddings. Defaults to 0.

  • optimizer (str) – Name of the model optimizer. Defaults to Adam.

  • optimizer_params (dict) – Arguments of the model optimizer. Defaults to {“betas”: [0.9, 0.998], “weight_decay”: 1e-6}.

  • lr_scheduler (str) – Name of the learning rate scheduler. Defaults to Noam.

  • lr_scheduler_params (dict) – Arguments of the learning rate scheduler. Defaults to {“warmup_steps”: 4000}.

  • lr (float) – Initial learning rate. Defaults to 1e-3.

  • grad_clip (float) – Gradient norm clipping value. Defaults to 5.0.

  • spec_loss_type (str) – Type of the spectrogram loss. Check ForwardTTSLoss for possible values. Defaults to mse.

  • duration_loss_type (str) – Type of the duration loss. Check ForwardTTSLoss for possible values. Defaults to mse.

  • use_ssim_loss (bool) – Enable/disable the use of SSIM (Structural Similarity) loss. Defaults to True.

  • wd (float) – Weight decay coefficient. Defaults to 1e-7.

  • ssim_loss_alpha (float) – Weight for the SSIM loss. If set 0, disables the SSIM loss. Defaults to 1.0.

  • dur_loss_alpha (float) – Weight for the duration predictor’s loss. If set 0, disables the huber loss. Defaults to 1.0.

  • spec_loss_alpha (float) – Weight for the L1 spectrogram loss. If set 0, disables the L1 loss. Defaults to 1.0.

  • pitch_loss_alpha (float) – Weight for the pitch predictor’s loss. If set 0, disables the pitch predictor. Defaults to 1.0.

  • binary_loss_alpha (float) – Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.

  • binary_align_loss_start_step (int) – Start binary alignment loss after this many steps. Defaults to 20000.

  • min_seq_len (int) – Minimum input sequence length to be used at training.

  • max_seq_len (int) – Maximum input sequence length to be used at training. Larger values result in more VRAM usage.

SpeedySpeechConfig

class TTS.tts.configs.speedy_speech_config.SpeedySpeechConfig(model='speedy_speech', run_name='coqui_tts', run_description='', epochs=10000, batch_size=None, eval_batch_size=None, mixed_precision=False, scheduler_after_epoch=False, run_eval=True, test_delay_epochs=0, print_eval=False, dashboard_logger='tensorboard', print_step=25, plot_step=100, model_param_stats=False, project_name=None, log_model_step=None, wandb_entity=None, save_step=10000, checkpoint=True, keep_all_best=False, keep_after=10000, num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, output_path=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', audio=<factory>, use_phonemes=False, use_espeak_phonemes=True, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, batch_group_size=0, loss_masking=None, sort_by_audio_len=False, min_seq_len=13, max_seq_len=200, compute_f0=False, compute_linear_spec=False, add_blank=False, datasets=<factory>, optimizer='Adam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, test_sentences=<factory>, use_speaker_embedding=False, use_d_vector_file=False, d_vector_dim=0, base_model='forward_tts', model_args=ForwardTTSArgs(num_chars=None, out_channels=80, hidden_channels=128, num_speakers=0, use_aligner=True, use_pitch=False, pitch_predictor_hidden_channels=256, pitch_predictor_kernel_size=3, pitch_predictor_dropout_p=0.1, pitch_embedding_kernel_size=3, duration_predictor_hidden_channels=256, duration_predictor_kernel_size=3, duration_predictor_dropout_p=0.1, positional_encoding=True, poisitonal_encoding_use_scale=True, length_scale=1, encoder_type='residual_conv_bn', encoder_params={'kernel_size': 4, 'dilations': [1, 2, 4, 1, 2, 4, 1, 2, 4, 1, 2, 4, 1], 'num_conv_blocks': 2, 'num_res_blocks': 13}, decoder_type='residual_conv_bn', decoder_params={'kernel_size': 4, 'dilations': [1, 2, 4, 8, 1, 2, 4, 8, 1, 2, 4, 8, 1, 2, 4, 8, 1], 'num_conv_blocks': 2, 'num_res_blocks': 17}, use_d_vector=False, d_vector_dim=0, detach_duration_predictor=True, max_duration=75), d_vector_file=False, lr=0.0001, grad_clip=5.0, spec_loss_type='l1', duration_loss_type='huber', use_ssim_loss=False, ssim_loss_alpha=1.0, dur_loss_alpha=1.0, spec_loss_alpha=1.0, aligner_loss_alpha=1.0, binary_align_loss_alpha=0.3, binary_align_loss_start_step=50000, r=1, f0_cache_path=None)[source]

Configure ForwardTTS as SpeedySpeech model.

Example

>>> from TTS.tts.configs import SpeedySpeechConfig
>>> config = SpeedySpeechConfig()
Args:
model (str):

Model name used for selecting the right model at initialization. Defaults to speedy_speech.

base_model (str):

Name of the base model being configured as this model so that 🐸 TTS knows it needs to initiate the base model rather than searching for the model implementation. Defaults to forward_tts.

model_args (Coqpit):

Model class arguments. Check FastPitchArgs for more details. Defaults to FastPitchArgs().

data_dep_init_steps (int):

Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.

use_speaker_embedding (bool):

enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.

use_d_vector_file (bool):

enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.

d_vector_file (str):

Path to the file including pre-computed speaker embeddings. Defaults to None.

d_vector_dim (int):

Dimension of the external speaker embeddings. Defaults to 0.

optimizer (str):

Name of the model optimizer. Defaults to RAdam.

optimizer_params (dict):

Arguments of the model optimizer. Defaults to {“betas”: [0.9, 0.998], “weight_decay”: 1e-6}.

lr_scheduler (str):

Name of the learning rate scheduler. Defaults to Noam.

lr_scheduler_params (dict):

Arguments of the learning rate scheduler. Defaults to {“warmup_steps”: 4000}.

lr (float):

Initial learning rate. Defaults to 1e-3.

grad_clip (float):

Gradient norm clipping value. Defaults to 5.0.

spec_loss_type (str):

Type of the spectrogram loss. Check ForwardTTSLoss for possible values. Defaults to l1.

duration_loss_type (str):

Type of the duration loss. Check ForwardTTSLoss for possible values. Defaults to huber.

use_ssim_loss (bool):

Enable/disable the use of SSIM (Structural Similarity) loss. Defaults to True.

wd (float):

Weight decay coefficient. Defaults to 1e-7.

ssim_loss_alpha (float):

Weight for the SSIM loss. If set 0, disables the SSIM loss. Defaults to 1.0.

dur_loss_alpha (float):

Weight for the duration predictor’s loss. If set 0, disables the huber loss. Defaults to 1.0.

spec_loss_alpha (float):

Weight for the L1 spectrogram loss. If set 0, disables the L1 loss. Defaults to 1.0.

binary_loss_alpha (float):

Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.

binary_align_loss_start_step (int):

Start binary alignment loss after this many steps. Defaults to 20000.

min_seq_len (int):

Minimum input sequence length to be used at training.

max_seq_len (int):

Maximum input sequence length to be used at training. Larger values result in more VRAM usage.

FastSpeechConfig

class TTS.tts.configs.fast_speech_config.FastSpeechConfig(model='fast_speech', run_name='coqui_tts', run_description='', epochs=10000, batch_size=None, eval_batch_size=None, mixed_precision=False, scheduler_after_epoch=False, run_eval=True, test_delay_epochs=0, print_eval=False, dashboard_logger='tensorboard', print_step=25, plot_step=100, model_param_stats=False, project_name=None, log_model_step=None, wandb_entity=None, save_step=10000, checkpoint=True, keep_all_best=False, keep_after=10000, num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, output_path=None, distributed_backend='nccl', distributed_url='tcp://localhost:54321', audio=<factory>, use_phonemes=False, use_espeak_phonemes=True, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, batch_group_size=0, loss_masking=None, sort_by_audio_len=False, min_seq_len=13, max_seq_len=200, compute_f0=True, compute_linear_spec=False, add_blank=False, datasets=<factory>, optimizer='Adam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, test_sentences=<factory>, use_speaker_embedding=False, use_d_vector_file=False, d_vector_dim=0, base_model='forward_tts', model_args=ForwardTTSArgs(num_chars=None, out_channels=80, hidden_channels=384, num_speakers=0, use_aligner=True, use_pitch=False, pitch_predictor_hidden_channels=256, pitch_predictor_kernel_size=3, pitch_predictor_dropout_p=0.1, pitch_embedding_kernel_size=3, duration_predictor_hidden_channels=256, duration_predictor_kernel_size=3, duration_predictor_dropout_p=0.1, positional_encoding=True, poisitonal_encoding_use_scale=True, length_scale=1, encoder_type='fftransformer', encoder_params={'hidden_channels_ffn': 1024, 'num_heads': 1, 'num_layers': 6, 'dropout_p': 0.1}, decoder_type='fftransformer', decoder_params={'hidden_channels_ffn': 1024, 'num_heads': 1, 'num_layers': 6, 'dropout_p': 0.1}, use_d_vector=False, d_vector_dim=0, detach_duration_predictor=False, max_duration=75), d_vector_file=False, lr=0.0001, grad_clip=5.0, spec_loss_type='mse', duration_loss_type='mse', use_ssim_loss=True, ssim_loss_alpha=1.0, dur_loss_alpha=1.0, spec_loss_alpha=1.0, pitch_loss_alpha=0.0, aligner_loss_alpha=1.0, binary_align_loss_alpha=1.0, binary_align_loss_start_step=20000, r=1, f0_cache_path=None)[source]

Configure ForwardTTS as FastSpeech model.

Example

>>> from TTS.tts.configs import FastSpeechConfig
>>> config = FastSpeechConfig()
Parameters
  • model (str) – Model name used for selecting the right model at initialization. Defaults to fast_pitch.

  • base_model (str) – Name of the base model being configured as this model so that 🐸 TTS knows it needs to initiate the base model rather than searching for the model implementation. Defaults to forward_tts.

  • model_args (Coqpit) – Model class arguments. Check FastSpeechArgs for more details. Defaults to FastSpeechArgs().

  • data_dep_init_steps (int) – Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.

  • use_speaker_embedding (bool) – enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.

  • use_d_vector_file (bool) – enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.

  • d_vector_file (str) – Path to the file including pre-computed speaker embeddings. Defaults to None.

  • d_vector_dim (int) – Dimension of the external speaker embeddings. Defaults to 0.

  • optimizer (str) – Name of the model optimizer. Defaults to Adam.

  • optimizer_params (dict) – Arguments of the model optimizer. Defaults to {“betas”: [0.9, 0.998], “weight_decay”: 1e-6}.

  • lr_scheduler (str) – Name of the learning rate scheduler. Defaults to Noam.

  • lr_scheduler_params (dict) – Arguments of the learning rate scheduler. Defaults to {“warmup_steps”: 4000}.

  • lr (float) – Initial learning rate. Defaults to 1e-3.

  • grad_clip (float) – Gradient norm clipping value. Defaults to 5.0.

  • spec_loss_type (str) – Type of the spectrogram loss. Check ForwardTTSLoss for possible values. Defaults to mse.

  • duration_loss_type (str) – Type of the duration loss. Check ForwardTTSLoss for possible values. Defaults to mse.

  • use_ssim_loss (bool) – Enable/disable the use of SSIM (Structural Similarity) loss. Defaults to True.

  • wd (float) – Weight decay coefficient. Defaults to 1e-7.

  • ssim_loss_alpha (float) – Weight for the SSIM loss. If set 0, disables the SSIM loss. Defaults to 1.0.

  • dur_loss_alpha (float) – Weight for the duration predictor’s loss. If set 0, disables the huber loss. Defaults to 1.0.

  • spec_loss_alpha (float) – Weight for the L1 spectrogram loss. If set 0, disables the L1 loss. Defaults to 1.0.

  • pitch_loss_alpha (float) – Weight for the pitch predictor’s loss. If set 0, disables the pitch predictor. Defaults to 1.0.

  • binary_loss_alpha (float) – Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.

  • binary_align_loss_start_step (int) – Start binary alignment loss after this many steps. Defaults to 20000.

  • min_seq_len (int) – Minimum input sequence length to be used at training.

  • max_seq_len (int) – Maximum input sequence length to be used at training. Larger values result in more VRAM usage.