Model API

Model API provides you a set of functions that easily make your model compatible with the Trainer, Synthesizer and ModelZoo.

Base TTS Model

class TTS.model.BaseModel[source]

Abstract 🐸TTS class. Every new 🐸TTS model must inherit this.

Notes on input/output tensor shapes:

Any input or output tensor of the model must be shaped as

  • 3D tensors batch x time x channels

  • 2D tensors batch x channels

  • 1D tensors batch x 1

eval_log(ap, batch, outputs)[source]

The same as train_log()

abstract eval_step(batch, criterion)[source]

Perform a single evaluation step. Run the model forward pass and compute losses. In most cases, you can call train_step() with no changes.

Parameters
  • batch (Dict) – Input tensors.

  • criterion (nn.Module) – Loss layer designed for the model.

Returns

Model ouputs and computed losses.

Return type

Tuple[Dict, Dict]

abstract forward(input, *args, aux_input={}, **kwargs)[source]

Forward pass for the model mainly used in training.

You can be flexible here and use different number of arguments and argument names since it is intended to be used by train_step() without exposing it out of the model.

Parameters
  • input (torch.Tensor) – Input tensor.

  • aux_input (Dict) – Auxiliary model inputs like embeddings, durations or any other sorts of inputs.

Returns

Model outputs. Main model output must be named as “model_outputs”.

Return type

Dict

get_lr()[source]

Return learning rate(s).

Returns

Model’s initial learning rates.

Return type

Union[float, List[float]]

get_optimizer()[source]

Setup an return optimizer or optimizers.

abstract inference(input, aux_input={})[source]

Forward pass for inference.

We don’t use *kwargs since it is problematic with the TorchScript API.

Parameters
  • input (torch.Tensor) – [description]

  • aux_input (Dict) – Auxiliary inputs like speaker embeddings, durations etc.

Returns

[description]

Return type

Dict

abstract load_checkpoint(config, checkpoint_path, eval=False)[source]

Load a checkpoint and get ready for training or inference.

Parameters
  • config (Coqpit) – Model configuration.

  • checkpoint_path (str) – Path to the model checkpoint file.

  • eval (bool, optional) – If true, init model for inference else for training. Defaults to False.

train_log(ap, batch, outputs)[source]

Create visualizations and waveform examples for training.

For example, here you can plot spectrograms and generate sample sample waveforms from these spectrograms to be projected onto Tensorboard.

Parameters
  • ap (AudioProcessor) – audio processor used at training.

  • batch (Dict) – Model inputs used at the previous training step.

  • outputs (Dict) – Model outputs generated at the previoud training step.

Returns

training plots and output waveform.

Return type

Tuple[Dict, np.ndarray]

abstract train_step(batch, criterion)[source]

Perform a single training step. Run the model forward pass and compute losses.

Parameters
  • batch (Dict) – Input tensors.

  • criterion (nn.Module) – Loss layer designed for the model.

Returns

Model ouputs and computed losses.

Return type

Tuple[Dict, Dict]

Base tts Model

class TTS.tts.models.base_tts.BaseTTS[source]

Abstract tts class. Every new tts model must inherit this.

It defines tts specific functions on top of Model.

Notes on input/output tensor shapes:

Any input or output tensor of the model must be shaped as

  • 3D tensors batch x time x channels

  • 2D tensors batch x channels

  • 1D tensors batch x 1

format_batch(batch)[source]

Generic batch formatting for TTSDataset.

You must override this if you use a custom dataset.

Parameters

batch (Dict) – [description]

Returns

[description]

Return type

Dict

get_aux_input(**kwargs)[source]

Prepare and return aux_input used by forward()

init_multispeaker(config, data=None)[source]

Initialize a speaker embedding layer if needen and define expected embedding channel size for defining in_channels size of the connected layers.

This implementation yields 3 possible outcomes:

  1. If config.use_speaker_embedding and `config.use_d_vector_file are False, do nothing.

  2. If config.use_d_vector_file is True, set expected embedding channel size to config.d_vector_dim or 512.

3. If config.use_speaker_embedding, initialize a speaker embedding layer with channel size of config.d_vector_dim or 512.

You can override this function for new models.0

Parameters
  • config (Coqpit) – Model configuration.

  • data (List, optional) – Dataset items to infer number of speakers. Defaults to None.

test_run(ap)[source]

Generic test run for tts models used by Trainer.

You can override this for a different behaviour.

Returns

Test figures and audios to be projected to Tensorboard.

Return type

Tuple[Dict, Dict]

Base vocoder Model

class TTS.vocoder.models.base_vocoder.BaseVocoder[source]

Base vocoder class. Every new vocoder model must inherit this.

It defines vocoder specific functions on top of Model.

Notes on input/output tensor shapes:

Any input or output tensor of the model must be shaped as

  • 3D tensors batch x time x channels

  • 2D tensors batch x channels

  • 1D tensors batch x 1