Forward TTS model(s)#

A general feed-forward TTS model implementation that can be configured to different architectures by setting different encoder and decoder networks. It can be trained with either pre-computed durations (from pre-trained Tacotron) or an alignment network that learns the text to audio alignment from the input data.

Currently we provide the following pre-configured architectures:

  • FastSpeech:

    It’s a feed-forward model TTS model that uses Feed Forward Transformer (FFT) modules as the encoder and decoder.

  • FastPitch:

    It uses the same FastSpeech architecture that us conditioned on fundemental frequency (f0) contours with the promise of more expressive speech.

  • SpeedySpeech:

    It uses Residual Convolution layers instead of Transformers that leads to a more compute friendly model.

  • FastSpeech2 (TODO):

    Similar to FastPitch but it also uses a spectral energy values as an addition.

Important resources & papers#

ForwardTTSArgs#

class TTS.tts.models.forward_tts.ForwardTTSArgs(num_chars=None, out_channels=80, hidden_channels=384, use_aligner=True, use_pitch=True, pitch_predictor_hidden_channels=256, pitch_predictor_kernel_size=3, pitch_predictor_dropout_p=0.1, pitch_embedding_kernel_size=3, duration_predictor_hidden_channels=256, duration_predictor_kernel_size=3, duration_predictor_dropout_p=0.1, positional_encoding=True, poisitonal_encoding_use_scale=True, length_scale=1, encoder_type='fftransformer', encoder_params=<factory>, decoder_type='fftransformer', decoder_params=<factory>, detach_duration_predictor=False, max_duration=75, num_speakers=1, use_speaker_embedding=False, speakers_file=None, use_d_vector_file=False, d_vector_dim=None, d_vector_file=None)[source]#

ForwardTTS Model arguments.

Parameters
  • num_chars (int) – Number of characters in the vocabulary. Defaults to 100.

  • out_channels (int) – Number of output channels. Defaults to 80.

  • hidden_channels (int) – Number of base hidden channels of the model. Defaults to 512.

  • use_aligner (bool) – Whether to use aligner network to learn the text to speech alignment or use pre-computed durations. If set False, durations should be computed by TTS/bin/compute_attention_masks.py and path to the pre-computed durations must be provided to config.datasets[0].meta_file_attn_mask. Defaults to True.

  • use_pitch (bool) – Use pitch predictor to learn the pitch. Defaults to True.

  • duration_predictor_hidden_channels (int) – Number of hidden channels in the duration predictor. Defaults to 256.

  • duration_predictor_dropout_p (float) – Dropout rate for the duration predictor. Defaults to 0.1.

  • duration_predictor_kernel_size (int) – Kernel size of conv layers in the duration predictor. Defaults to 3.

  • pitch_predictor_hidden_channels (int) – Number of hidden channels in the pitch predictor. Defaults to 256.

  • pitch_predictor_dropout_p (float) – Dropout rate for the pitch predictor. Defaults to 0.1.

  • pitch_predictor_kernel_size (int) – Kernel size of conv layers in the pitch predictor. Defaults to 3.

  • pitch_embedding_kernel_size (int) – Kernel size of the projection layer in the pitch predictor. Defaults to 3.

  • positional_encoding (bool) – Whether to use positional encoding. Defaults to True.

  • positional_encoding_use_scale (bool) – Whether to use a learnable scale coeff in the positional encoding. Defaults to True.

  • length_scale (int) – Length scale that multiplies the predicted durations. Larger values result slower speech. Defaults to 1.0.

  • encoder_type (str) – Type of the encoder module. One of the encoders available in TTS.tts.layers.feed_forward.encoder. Defaults to fftransformer as in the paper.

  • encoder_params (dict) – Parameters of the encoder module. Defaults to `{"hidden_channels_ffn": 1024, "num_heads": 1, "num_layers": 6, "dropout_p": 0.1}`

  • decoder_type (str) – Type of the decoder module. One of the decoders available in TTS.tts.layers.feed_forward.decoder. Defaults to fftransformer as in the paper.

  • decoder_params (str) – Parameters of the decoder module. Defaults to `{"hidden_channels_ffn": 1024, "num_heads": 1, "num_layers": 6, "dropout_p": 0.1}`

  • detach_duration_predictor (bool) – Detach the input to the duration predictor from the earlier computation graph so that the duraiton loss does not pass to the earlier layers. Defaults to True.

  • max_duration (int) – Maximum duration accepted by the model. Defaults to 75.

  • num_speakers (int) – Number of speakers for the speaker embedding layer. Defaults to 0.

  • speakers_file (str) – Path to the speaker mapping file for the Speaker Manager. Defaults to None.

  • speaker_embedding_channels (int) – Number of speaker embedding channels. Defaults to 256.

  • use_d_vector_file (bool) – Enable/Disable the use of d-vectors for multi-speaker training. Defaults to False.

  • d_vector_dim (int) – Number of d-vector channels. Defaults to 0.

ForwardTTS Model#

class TTS.tts.models.forward_tts.ForwardTTS(config, ap=None, tokenizer=None, speaker_manager=None)[source]#

General forward TTS model implementation that uses an encoder-decoder architecture with an optional alignment network and a pitch predictor.

If the alignment network is used, the model learns the text-to-speech alignment from the data instead of using pre-computed durations.

If the pitch predictor is used, the model trains a pitch predictor that predicts average pitch value for each input character as in the FastPitch model.

ForwardTTS can be configured to one of these architectures,

  • FastPitch

  • SpeedySpeech

  • FastSpeech

  • TODO: FastSpeech2 (requires average speech energy predictor)

Parameters
  • config (Coqpit) – Model coqpit class.

  • speaker_manager (SpeakerManager) – Speaker manager for multi-speaker training. Only used for multi-speaker models. Defaults to None.

Examples

>>> from TTS.tts.models.fast_pitch import ForwardTTS, ForwardTTSArgs
>>> config = ForwardTTSArgs()
>>> model = ForwardTTS(config)
expand_encoder_outputs(en, dr, x_mask, y_mask)[source]#

Generate attention alignment map from durations and expand encoder outputs

Shapes

  • en: \((B, D_{en}, T_{en})\)

  • dr: \((B, T_{en})\)

  • x_mask: \((B, T_{en})\)

  • y_mask: \((B, T_{de})\)

Examples:

encoder output: [a,b,c,d]
durations: [1, 3, 2, 1]

expanded: [a, b, b, b, c, c, d]
attention map: [[0, 0, 0, 0, 0, 0, 1],
                [0, 0, 0, 0, 1, 1, 0],
                [0, 1, 1, 1, 0, 0, 0],
                [1, 0, 0, 0, 0, 0, 0]]
format_durations(o_dr_log, x_mask)[source]#

Format predicted durations. 1. Convert to linear scale from log scale 2. Apply the length scale for speed adjustment 3. Apply masking. 4. Cast 0 durations to 1. 5. Round the duration values.

Parameters
  • o_dr_log – Log scale durations.

  • x_mask – Input text mask.

Shapes

  • o_dr_log: \((B, T_{de})\)

  • x_mask: \((B, T_{en})\)

forward(x, x_lengths, y_lengths, y=None, dr=None, pitch=None, aux_input={'d_vectors': None, 'speaker_ids': None})[source]#

Model’s forward pass.

Parameters
  • x (torch.LongTensor) – Input character sequences.

  • x_lengths (torch.LongTensor) – Input sequence lengths.

  • y_lengths (torch.LongTensor) – Output sequnce lengths. Defaults to None.

  • y (torch.FloatTensor) – Spectrogram frames. Only used when the alignment network is on. Defaults to None.

  • dr (torch.IntTensor) – Character durations over the spectrogram frames. Only used when the alignment network is off. Defaults to None.

  • pitch (torch.FloatTensor) – Pitch values for each spectrogram frame. Only used when the pitch predictor is on. Defaults to None.

  • aux_input (Dict) – Auxiliary model inputs for multi-speaker training. Defaults to {“d_vectors”: 0, “speaker_ids”: None}.

Shapes

  • x: \([B, T_max]\)

  • x_lengths: \([B]\)

  • y_lengths: \([B]\)

  • y: \([B, T_max2]\)

  • dr: \([B, T_max]\)

  • g: \([B, C]\)

  • pitch: \([B, 1, T]\)

static generate_attn(dr, x_mask, y_mask=None)[source]#

Generate an attention mask from the durations.

Shapes
  • dr: \((B, T_{en})\)

  • x_mask: \((B, T_{en})\)

  • y_mask: \((B, T_{de})\)

inference(x, aux_input={'d_vectors': None, 'speaker_ids': None})[source]#

Model’s inference pass.

Parameters
  • x (torch.LongTensor) – Input character sequence.

  • aux_input (Dict) – Auxiliary model inputs. Defaults to {“d_vectors”: None, “speaker_ids”: None}.

Shapes

  • x: [B, T_max]

  • x_lengths: [B]

  • g: [B, C]

static init_from_config(config, samples=None)[source]#

Initiate model from config

Parameters
  • config (ForwardTTSConfig) – Model config.

  • samples (Union[List[List], List[Dict]]) – Training samples to parse speaker ids for training. Defaults to None.

init_multispeaker(config)[source]#

Init for multi-speaker training.

Parameters

config (Coqpit) – Model configuration.

on_train_step_start(trainer)[source]#

Schedule binary loss weight.

FastPitchConfig#

class TTS.tts.configs.fast_pitch_config.FastPitchConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=5.0, scheduler_after_epoch=True, lr=0.0001, optimizer='Adam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, use_grad_scaler=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=True, training_seed=54321, model='fast_pitch', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=None, sort_by_audio_len=False, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=True, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, base_model='forward_tts', model_args=ForwardTTSArgs(num_chars=None, out_channels=80, hidden_channels=384, use_aligner=True, use_pitch=True, pitch_predictor_hidden_channels=256, pitch_predictor_kernel_size=3, pitch_predictor_dropout_p=0.1, pitch_embedding_kernel_size=3, duration_predictor_hidden_channels=256, duration_predictor_kernel_size=3, duration_predictor_dropout_p=0.1, positional_encoding=True, poisitonal_encoding_use_scale=True, length_scale=1, encoder_type='fftransformer', encoder_params={'hidden_channels_ffn': 1024, 'num_heads': 1, 'num_layers': 6, 'dropout_p': 0.1}, decoder_type='fftransformer', decoder_params={'hidden_channels_ffn': 1024, 'num_heads': 1, 'num_layers': 6, 'dropout_p': 0.1}, detach_duration_predictor=False, max_duration=75, num_speakers=1, use_speaker_embedding=False, speakers_file=None, use_d_vector_file=False, d_vector_dim=None, d_vector_file=None), num_speakers=0, speakers_file=None, use_speaker_embedding=False, use_d_vector_file=False, d_vector_file=False, d_vector_dim=0, spec_loss_type='mse', duration_loss_type='mse', use_ssim_loss=True, ssim_loss_alpha=1.0, spec_loss_alpha=1.0, aligner_loss_alpha=1.0, pitch_loss_alpha=0.1, dur_loss_alpha=0.1, binary_align_loss_alpha=0.1, binary_loss_warmup_epochs=150, min_seq_len=13, max_seq_len=200, r=1, f0_cache_path=None)[source]#

Configure ForwardTTS as FastPitch model.

Example

>>> from TTS.tts.configs.fast_pitch_config import FastPitchConfig
>>> config = FastPitchConfig()
Parameters
  • model (str) – Model name used for selecting the right model at initialization. Defaults to fast_pitch.

  • base_model (str) – Name of the base model being configured as this model so that 🐸 TTS knows it needs to initiate the base model rather than searching for the model implementation. Defaults to forward_tts.

  • model_args (Coqpit) – Model class arguments. Check FastPitchArgs for more details. Defaults to FastPitchArgs().

  • data_dep_init_steps (int) – Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.

  • speakers_file (str) – Path to the file containing the list of speakers. Needed at inference for loading matching speaker ids to speaker names. Defaults to None.

  • use_speaker_embedding (bool) – enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.

  • use_d_vector_file (bool) – enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.

  • d_vector_file (str) – Path to the file including pre-computed speaker embeddings. Defaults to None.

  • d_vector_dim (int) – Dimension of the external speaker embeddings. Defaults to 0.

  • optimizer (str) – Name of the model optimizer. Defaults to Adam.

  • optimizer_params (dict) – Arguments of the model optimizer. Defaults to {“betas”: [0.9, 0.998], “weight_decay”: 1e-6}.

  • lr_scheduler (str) – Name of the learning rate scheduler. Defaults to Noam.

  • lr_scheduler_params (dict) – Arguments of the learning rate scheduler. Defaults to {“warmup_steps”: 4000}.

  • lr (float) – Initial learning rate. Defaults to 1e-3.

  • grad_clip (float) – Gradient norm clipping value. Defaults to 5.0.

  • spec_loss_type (str) – Type of the spectrogram loss. Check ForwardTTSLoss for possible values. Defaults to mse.

  • duration_loss_type (str) – Type of the duration loss. Check ForwardTTSLoss for possible values. Defaults to mse.

  • use_ssim_loss (bool) – Enable/disable the use of SSIM (Structural Similarity) loss. Defaults to True.

  • wd (float) – Weight decay coefficient. Defaults to 1e-7.

  • ssim_loss_alpha (float) – Weight for the SSIM loss. If set 0, disables the SSIM loss. Defaults to 1.0.

  • dur_loss_alpha (float) – Weight for the duration predictor’s loss. If set 0, disables the huber loss. Defaults to 1.0.

  • spec_loss_alpha (float) – Weight for the L1 spectrogram loss. If set 0, disables the L1 loss. Defaults to 1.0.

  • pitch_loss_alpha (float) – Weight for the pitch predictor’s loss. If set 0, disables the pitch predictor. Defaults to 1.0.

  • binary_align_loss_alpha (float) – Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.

  • binary_loss_warmup_epochs (float) – Number of epochs to gradually increase the binary loss impact. Defaults to 150.

  • min_seq_len (int) – Minimum input sequence length to be used at training.

  • max_seq_len (int) – Maximum input sequence length to be used at training. Larger values result in more VRAM usage.

SpeedySpeechConfig#

class TTS.tts.configs.speedy_speech_config.SpeedySpeechConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=5.0, scheduler_after_epoch=True, lr=0.0001, optimizer='Adam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, use_grad_scaler=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=True, training_seed=54321, model='speedy_speech', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=None, sort_by_audio_len=False, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=False, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, base_model='forward_tts', model_args=ForwardTTSArgs(num_chars=None, out_channels=80, hidden_channels=128, use_aligner=True, use_pitch=False, pitch_predictor_hidden_channels=256, pitch_predictor_kernel_size=3, pitch_predictor_dropout_p=0.1, pitch_embedding_kernel_size=3, duration_predictor_hidden_channels=256, duration_predictor_kernel_size=3, duration_predictor_dropout_p=0.1, positional_encoding=True, poisitonal_encoding_use_scale=True, length_scale=1, encoder_type='residual_conv_bn', encoder_params={'kernel_size': 4, 'dilations': [1, 2, 4, 1, 2, 4, 1, 2, 4, 1, 2, 4, 1], 'num_conv_blocks': 2, 'num_res_blocks': 13}, decoder_type='residual_conv_bn', decoder_params={'kernel_size': 4, 'dilations': [1, 2, 4, 8, 1, 2, 4, 8, 1, 2, 4, 8, 1, 2, 4, 8, 1], 'num_conv_blocks': 2, 'num_res_blocks': 17}, detach_duration_predictor=True, max_duration=75, num_speakers=1, use_speaker_embedding=False, speakers_file=None, use_d_vector_file=False, d_vector_dim=None, d_vector_file=None), num_speakers=0, speakers_file=None, use_speaker_embedding=False, use_d_vector_file=False, d_vector_file=False, d_vector_dim=0, spec_loss_type='l1', duration_loss_type='huber', use_ssim_loss=False, ssim_loss_alpha=1.0, dur_loss_alpha=1.0, spec_loss_alpha=1.0, aligner_loss_alpha=1.0, binary_align_loss_alpha=0.3, binary_loss_warmup_epochs=150, min_seq_len=13, max_seq_len=200, r=1, f0_cache_path=None)[source]#

Configure ForwardTTS as SpeedySpeech model.

Example

>>> from TTS.tts.configs.speedy_speech_config import SpeedySpeechConfig
>>> config = SpeedySpeechConfig()
Args:
model (str):

Model name used for selecting the right model at initialization. Defaults to speedy_speech.

base_model (str):

Name of the base model being configured as this model so that 🐸 TTS knows it needs to initiate the base model rather than searching for the model implementation. Defaults to forward_tts.

model_args (Coqpit):

Model class arguments. Check FastPitchArgs for more details. Defaults to FastPitchArgs().

data_dep_init_steps (int):

Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.

speakers_file (str):

Path to the file containing the list of speakers. Needed at inference for loading matching speaker ids to speaker names. Defaults to None.

use_speaker_embedding (bool):

enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.

use_d_vector_file (bool):

enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.

d_vector_file (str):

Path to the file including pre-computed speaker embeddings. Defaults to None.

d_vector_dim (int):

Dimension of the external speaker embeddings. Defaults to 0.

optimizer (str):

Name of the model optimizer. Defaults to RAdam.

optimizer_params (dict):

Arguments of the model optimizer. Defaults to {“betas”: [0.9, 0.998], “weight_decay”: 1e-6}.

lr_scheduler (str):

Name of the learning rate scheduler. Defaults to Noam.

lr_scheduler_params (dict):

Arguments of the learning rate scheduler. Defaults to {“warmup_steps”: 4000}.

lr (float):

Initial learning rate. Defaults to 1e-3.

grad_clip (float):

Gradient norm clipping value. Defaults to 5.0.

spec_loss_type (str):

Type of the spectrogram loss. Check ForwardTTSLoss for possible values. Defaults to l1.

duration_loss_type (str):

Type of the duration loss. Check ForwardTTSLoss for possible values. Defaults to huber.

use_ssim_loss (bool):

Enable/disable the use of SSIM (Structural Similarity) loss. Defaults to True.

wd (float):

Weight decay coefficient. Defaults to 1e-7.

ssim_loss_alpha (float):

Weight for the SSIM loss. If set 0, disables the SSIM loss. Defaults to 1.0.

dur_loss_alpha (float):

Weight for the duration predictor’s loss. If set 0, disables the huber loss. Defaults to 1.0.

spec_loss_alpha (float):

Weight for the L1 spectrogram loss. If set 0, disables the L1 loss. Defaults to 1.0.

binary_loss_alpha (float):

Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.

binary_loss_warmup_epochs (float):

Number of epochs to gradually increase the binary loss impact. Defaults to 150.

min_seq_len (int):

Minimum input sequence length to be used at training.

max_seq_len (int):

Maximum input sequence length to be used at training. Larger values result in more VRAM usage.

FastSpeechConfig#

class TTS.tts.configs.fast_speech_config.FastSpeechConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui trainer run.', print_step=25, plot_step=100, model_param_stats=False, wandb_entity=None, dashboard_logger='tensorboard', log_model_step=None, save_step=10000, save_n_checkpoints=5, save_checkpoints=True, save_all_best=False, save_best_after=10000, target_loss=None, print_eval=False, test_delay_epochs=0, run_eval=True, distributed_backend='nccl', distributed_url='tcp://localhost:54321', mixed_precision=False, epochs=1000, batch_size=32, eval_batch_size=16, grad_clip=5.0, scheduler_after_epoch=True, lr=0.0001, optimizer='Adam', optimizer_params=<factory>, lr_scheduler='NoamLR', lr_scheduler_params=<factory>, use_grad_scaler=False, cudnn_enable=True, cudnn_deterministic=False, cudnn_benchmark=True, training_seed=54321, model='fast_speech', num_loader_workers=0, num_eval_loader_workers=0, use_noise_augment=False, audio=<factory>, use_phonemes=False, phonemizer=None, phoneme_language=None, compute_input_seq_cache=False, text_cleaner=None, enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path=None, characters=None, add_blank=False, batch_group_size=0, loss_masking=None, sort_by_audio_len=False, min_audio_len=1, max_audio_len=inf, min_text_len=1, max_text_len=inf, compute_f0=False, compute_linear_spec=False, precompute_num_workers=0, start_by_longest=False, datasets=<factory>, test_sentences=<factory>, eval_split_max_size=None, eval_split_size=0.01, use_speaker_weighted_sampler=False, speaker_weighted_sampler_alpha=1.0, use_language_weighted_sampler=False, language_weighted_sampler_alpha=1.0, base_model='forward_tts', model_args=ForwardTTSArgs(num_chars=None, out_channels=80, hidden_channels=384, use_aligner=True, use_pitch=False, pitch_predictor_hidden_channels=256, pitch_predictor_kernel_size=3, pitch_predictor_dropout_p=0.1, pitch_embedding_kernel_size=3, duration_predictor_hidden_channels=256, duration_predictor_kernel_size=3, duration_predictor_dropout_p=0.1, positional_encoding=True, poisitonal_encoding_use_scale=True, length_scale=1, encoder_type='fftransformer', encoder_params={'hidden_channels_ffn': 1024, 'num_heads': 1, 'num_layers': 6, 'dropout_p': 0.1}, decoder_type='fftransformer', decoder_params={'hidden_channels_ffn': 1024, 'num_heads': 1, 'num_layers': 6, 'dropout_p': 0.1}, detach_duration_predictor=False, max_duration=75, num_speakers=1, use_speaker_embedding=False, speakers_file=None, use_d_vector_file=False, d_vector_dim=None, d_vector_file=None), num_speakers=0, speakers_file=None, use_speaker_embedding=False, use_d_vector_file=False, d_vector_file=False, d_vector_dim=0, spec_loss_type='mse', duration_loss_type='mse', use_ssim_loss=True, ssim_loss_alpha=1.0, dur_loss_alpha=1.0, spec_loss_alpha=1.0, pitch_loss_alpha=0.0, aligner_loss_alpha=1.0, binary_align_loss_alpha=1.0, binary_loss_warmup_epochs=150, min_seq_len=13, max_seq_len=200, r=1, f0_cache_path=None)[source]#

Configure ForwardTTS as FastSpeech model.

Example

>>> from TTS.tts.configs.fast_speech_config import FastSpeechConfig
>>> config = FastSpeechConfig()
Parameters
  • model (str) – Model name used for selecting the right model at initialization. Defaults to fast_pitch.

  • base_model (str) – Name of the base model being configured as this model so that 🐸 TTS knows it needs to initiate the base model rather than searching for the model implementation. Defaults to forward_tts.

  • model_args (Coqpit) – Model class arguments. Check FastSpeechArgs for more details. Defaults to FastSpeechArgs().

  • data_dep_init_steps (int) – Number of steps used for computing normalization parameters at the beginning of the training. GlowTTS uses Activation Normalization that pre-computes normalization stats at the beginning and use the same values for the rest. Defaults to 10.

  • speakers_file (str) – Path to the file containing the list of speakers. Needed at inference for loading matching speaker ids to speaker names. Defaults to None.

  • use_speaker_embedding (bool) – enable / disable using speaker embeddings for multi-speaker models. If set True, the model is in the multi-speaker mode. Defaults to False.

  • use_d_vector_file (bool) – enable /disable using external speaker embeddings in place of the learned embeddings. Defaults to False.

  • d_vector_file (str) – Path to the file including pre-computed speaker embeddings. Defaults to None.

  • d_vector_dim (int) – Dimension of the external speaker embeddings. Defaults to 0.

  • optimizer (str) – Name of the model optimizer. Defaults to Adam.

  • optimizer_params (dict) – Arguments of the model optimizer. Defaults to {“betas”: [0.9, 0.998], “weight_decay”: 1e-6}.

  • lr_scheduler (str) – Name of the learning rate scheduler. Defaults to Noam.

  • lr_scheduler_params (dict) – Arguments of the learning rate scheduler. Defaults to {“warmup_steps”: 4000}.

  • lr (float) – Initial learning rate. Defaults to 1e-3.

  • grad_clip (float) – Gradient norm clipping value. Defaults to 5.0.

  • spec_loss_type (str) – Type of the spectrogram loss. Check ForwardTTSLoss for possible values. Defaults to mse.

  • duration_loss_type (str) – Type of the duration loss. Check ForwardTTSLoss for possible values. Defaults to mse.

  • use_ssim_loss (bool) – Enable/disable the use of SSIM (Structural Similarity) loss. Defaults to True.

  • wd (float) – Weight decay coefficient. Defaults to 1e-7.

  • ssim_loss_alpha (float) – Weight for the SSIM loss. If set 0, disables the SSIM loss. Defaults to 1.0.

  • dur_loss_alpha (float) – Weight for the duration predictor’s loss. If set 0, disables the huber loss. Defaults to 1.0.

  • spec_loss_alpha (float) – Weight for the L1 spectrogram loss. If set 0, disables the L1 loss. Defaults to 1.0.

  • pitch_loss_alpha (float) – Weight for the pitch predictor’s loss. If set 0, disables the pitch predictor. Defaults to 1.0.

  • binary_loss_alpha (float) – Weight for the binary loss. If set 0, disables the binary loss. Defaults to 1.0.

  • binary_loss_warmup_epochs (float) – Number of epochs to gradually increase the binary loss impact. Defaults to 150.

  • min_seq_len (int) – Minimum input sequence length to be used at training.

  • max_seq_len (int) – Maximum input sequence length to be used at training. Larger values result in more VRAM usage.