AudioProcessor APIยถ

TTS.utils.audio.AudioProcessor is the core class for all the audio processing routines. It provides an API for

  • Feature extraction.

  • Sound normalization.

  • Reading and writing audio files.

  • Sampling audio signals.

  • Normalizing and denormalizing audio signals.

  • Griffin-Lim vocoder.

The AudioProcessor needs to be initialized with TTS.config.shared_configs.BaseAudioConfig. Any model config also must inherit or initiate BaseAudioConfig.

AudioProcessorยถ

class TTS.utils.audio.AudioProcessor(sample_rate=None, resample=False, num_mels=None, log_func='np.log10', min_level_db=None, frame_shift_ms=None, frame_length_ms=None, hop_length=None, win_length=None, ref_level_db=None, fft_size=1024, power=None, preemphasis=0.0, signal_norm=None, symmetric_norm=None, max_norm=None, mel_fmin=None, mel_fmax=None, spec_gain=20, stft_pad_mode='reflect', clip_norm=True, griffin_lim_iters=None, do_trim_silence=False, trim_db=60, do_sound_norm=False, do_amp_to_db_linear=True, do_amp_to_db_mel=True, stats_path=None, verbose=True, **_)[source]ยถ

Audio Processor for TTS used by all the data pipelines.

Note

All the class arguments are set to default values to enable a flexible initialization of the class with the model config. They are not meaningful for all the arguments.

Parameters
  • sample_rate (int, optional) โ€“ target audio sampling rate. Defaults to None.

  • resample (bool, optional) โ€“ enable/disable resampling of the audio clips when the target sampling rate does not match the original sampling rate. Defaults to False.

  • num_mels (int, optional) โ€“ number of melspectrogram dimensions. Defaults to None.

  • log_func (int, optional) โ€“ log exponent used for converting spectrogram aplitude to DB.

  • min_level_db (int, optional) โ€“ minimum db threshold for the computed melspectrograms. Defaults to None.

  • frame_shift_ms (int, optional) โ€“ milliseconds of frames between STFT columns. Defaults to None.

  • frame_length_ms (int, optional) โ€“ milliseconds of STFT window length. Defaults to None.

  • hop_length (int, optional) โ€“ number of frames between STFT columns. Used if `frame_shift_ms` is None. Defaults to None.

  • win_length (int, optional) โ€“ STFT window length. Used if `frame_length_ms` is None. Defaults to None.

  • ref_level_db (int, optional) โ€“ reference DB level to avoid background noise. In general <20DB corresponds to the air noise. Defaults to None.

  • fft_size (int, optional) โ€“ FFT window size for STFT. Defaults to 1024.

  • power (int, optional) โ€“ Exponent value applied to the spectrogram before GriffinLim. Defaults to None.

  • preemphasis (float, optional) โ€“ Preemphasis coefficient. Preemphasis is disabled if == 0.0. Defaults to 0.0.

  • signal_norm (bool, optional) โ€“ enable/disable signal normalization. Defaults to None.

  • symmetric_norm (bool, optional) โ€“ enable/disable symmetric normalization. If set True normalization is performed in the range [-k, k] else [0, k], Defaults to None.

  • max_norm (float, optional) โ€“ `k` defining the normalization range. Defaults to None.

  • mel_fmin (int, optional) โ€“ minimum filter frequency for computing melspectrograms. Defaults to None.

  • mel_fmax (int, optional) โ€“ maximum filter frequency for computing melspectrograms.. Defaults to None.

  • spec_gain (int, optional) โ€“ gain applied when converting amplitude to DB. Defaults to 20.

  • stft_pad_mode (str, optional) โ€“ Padding mode for STFT. Defaults to โ€˜reflectโ€™.

  • clip_norm (bool, optional) โ€“ enable/disable clipping the our of range values in the normalized audio signal. Defaults to True.

  • griffin_lim_iters (int, optional) โ€“ Number of GriffinLim iterations. Defaults to None.

  • do_trim_silence (bool, optional) โ€“ enable/disable silence trimming when loading the audio signal. Defaults to False.

  • trim_db (int, optional) โ€“ DB threshold used for silence trimming. Defaults to 60.

  • do_sound_norm (bool, optional) โ€“ enable/disable signal normalization. Defaults to False.

  • do_amp_to_db_linear (bool, optional) โ€“ enable/disable amplitude to dB conversion of linear spectrograms. Defaults to True.

  • do_amp_to_db_mel (bool, optional) โ€“ enable/disable amplitude to dB conversion of mel spectrograms. Defaults to True.

  • stats_path (str, optional) โ€“ Path to the computed stats file. Defaults to None.

  • verbose (bool, optional) โ€“ enable/disable logging. Defaults to True.

apply_inv_preemphasis(x)[source]ยถ

Reverse pre-emphasis.

apply_preemphasis(x)[source]ยถ

Apply pre-emphasis to the audio signal. Useful to reduce the correlation between neighbouring signal values.

Parameters

x (np.ndarray) โ€“ Audio signal.

Raises

RuntimeError โ€“ Preemphasis coeff is set to 0.

Returns

Decorrelated audio signal.

Return type

np.ndarray

compute_f0(x)[source]ยถ

Compute pitch (f0) of a waveform using the same parameters used for computing melspectrogram.

Parameters

x (np.ndarray) โ€“ Waveform.

Returns

Pitch.

Return type

np.ndarray

Examples

>>> WAV_FILE = filename = librosa.util.example_audio_file()
>>> from TTS.config import BaseAudioConfig
>>> from TTS.utils.audio import AudioProcessor
>>> conf = BaseAudioConfig(mel_fmax=8000)
>>> ap = AudioProcessor(**conf)
>>> wav = ap.load_wav(WAV_FILE, sr=22050)[:5 * 22050]
>>> pitch = ap.compute_f0(wav)
compute_stft_paddings(x, pad_sides=1)[source]ยถ

Compute paddings used by Librosaโ€™s STFT. Compute right padding (final frame) or both sides padding (first and final frames)

denormalize(S)[source]ยถ

Denormalize spectrogram values.

Parameters

S (np.ndarray) โ€“ Spectrogram to denormalize.

Raises

RuntimeError โ€“ Mean and variance are incompatible.

Returns

Denormalized spectrogram.

Return type

np.ndarray

static dequantize(x, bits)[source]ยถ

Dequantize a waveform from the given number of bits.

find_endpoint(wav, threshold_db=- 40, min_silence_sec=0.8)[source]ยถ

Find the last point without silence at the end of a audio signal.

Parameters
  • wav (np.ndarray) โ€“ Audio signal.

  • threshold_db (int, optional) โ€“ Silence threshold in decibels. Defaults to -40.

  • min_silence_sec (float, optional) โ€“ Ignore silences that are shorter then this in secs. Defaults to 0.8.

Returns

Last point without silence.

Return type

int

inv_melspectrogram(mel_spectrogram)[source]ยถ

Convert a melspectrogram to a waveform using Griffi-Lim vocoder.

inv_spectrogram(spectrogram)[source]ยถ

Convert a spectrogram to a waveform using Griffi-Lim vocoder.

load_stats(stats_path)[source]ยถ

Loading mean and variance statistics from a npy file.

Parameters

stats_path (str) โ€“ Path to the npy file containing

Returns

loaded statistics and the config used to

compute them.

Return type

Tuple[np.array, np.array, np.array, np.array, Dict]

load_wav(filename, sr=None)[source]ยถ

Read a wav file using Librosa and optionally resample, silence trim, volume normalize.

Parameters
  • filename (str) โ€“ Path to the wav file.

  • sr (int, optional) โ€“ Sampling rate for resampling. Defaults to None.

Returns

Loaded waveform.

Return type

np.ndarray

melspectrogram(y)[source]ยถ

Compute a melspectrogram from a waveform.

static mulaw_decode(wav, qc)[source]ยถ

Recovers waveform from quantized values.

normalize(S)[source]ยถ

Normalize values into [0, self.max_norm] or [-self.max_norm, self.max_norm]

Parameters

S (np.ndarray) โ€“ Spectrogram to normalize.

Raises

RuntimeError โ€“ Mean and variance is computed from incompatible parameters.

Returns

Normalized spectrogram.

Return type

np.ndarray

out_linear_to_mel(linear_spec)[source]ยถ

Convert a full scale linear spectrogram output of a network to a melspectrogram.

Parameters

linear_spec (np.ndarray) โ€“ Normalized full scale linear spectrogram.

Returns

Normalized melspectrogram.

Return type

np.ndarray

static quantize(x, bits)[source]ยถ

Quantize a waveform to a given number of bits.

Parameters
  • x (np.ndarray) โ€“ Waveform to quantize. Must be normalized into the range [-1, 1].

  • bits (int) โ€“ Number of quantization bits.

Returns

Quantized waveform.

Return type

np.ndarray

save_wav(wav, path, sr=None)[source]ยถ

Save a waveform to a file using Scipy.

Parameters
  • wav (np.ndarray) โ€“ Waveform to save.

  • path (str) โ€“ Path to a output file.

  • sr (int, optional) โ€“ Sampling rate used for saving to the file. Defaults to None.

setup_scaler(mel_mean, mel_std, linear_mean, linear_std)[source]ยถ

Initialize scaler objects used in mean-std normalization.

Parameters
  • mel_mean (np.ndarray) โ€“ Mean for melspectrograms.

  • mel_std (np.ndarray) โ€“ STD for melspectrograms.

  • linear_mean (np.ndarray) โ€“ Mean for full scale spectrograms.

  • linear_std (np.ndarray) โ€“ STD for full scale spectrograms.

static sound_norm(x)[source]ยถ

Normalize the volume of an audio signal.

Parameters

x (np.ndarray) โ€“ Raw waveform.

Returns

Volume normalized waveform.

Return type

np.ndarray

spectrogram(y)[source]ยถ

Compute a spectrogram from a waveform.

Parameters

y (np.ndarray) โ€“ Waveform.

Returns

Spectrogram.

Return type

np.ndarray

trim_silence(wav)[source]ยถ

Trim silent parts with a threshold and 0.01 sec margin

BaseAudioConfigยถ

class TTS.config.shared_configs.BaseAudioConfig(fft_size=1024, win_length=1024, hop_length=256, frame_shift_ms=None, frame_length_ms=None, stft_pad_mode='reflect', sample_rate=22050, resample=False, preemphasis=0.0, ref_level_db=20, do_sound_norm=False, log_func='np.log10', do_trim_silence=True, trim_db=45, power=1.5, griffin_lim_iters=60, num_mels=80, mel_fmin=0.0, mel_fmax=None, spec_gain=20, do_amp_to_db_linear=True, do_amp_to_db_mel=True, signal_norm=True, min_level_db=- 100, symmetric_norm=True, max_norm=4.0, clip_norm=True, stats_path=None)[source]ยถ

Base config to definge audio processing parameters. It is used to initialize `TTS.utils.audio.AudioProcessor.`

Parameters
  • fft_size (int) โ€“ Number of STFT frequency levels aka.size of the linear spectogram frame. Defaults to 1024.

  • win_length (int) โ€“ Each frame of audio is windowed by window of length `win_length` and then padded with zeros to match `fft_size`. Defaults to 1024.

  • hop_length (int) โ€“ Number of audio samples between adjacent STFT columns. Defaults to 1024.

  • frame_shift_ms (int) โ€“ Set `hop_length` based on milliseconds and sampling rate.

  • frame_length_ms (int) โ€“ Set `win_length` based on milliseconds and sampling rate.

  • stft_pad_mode (str) โ€“ Padding method used in STFT. โ€˜reflectโ€™ or โ€˜centerโ€™. Defaults to โ€˜reflectโ€™.

  • sample_rate (int) โ€“ Audio sampling rate. Defaults to 22050.

  • resample (bool) โ€“ Enable / Disable resampling audio to `sample_rate`. Defaults to `False`.

  • preemphasis (float) โ€“ Preemphasis coefficient. Defaults to 0.0.

  • ref_level_db (int) โ€“ 20 Reference Db level to rebase the audio signal and ignore the level below. 20Db is assumed the sound of air. Defaults to 20.

  • do_sound_norm (bool) โ€“ Enable / Disable sound normalization to reconcile the volume differences among samples. Defaults to False.

  • log_func (str) โ€“ Numpy log function used for amplitude to DB conversion. Defaults to โ€˜np.log10โ€™.

  • do_trim_silence (bool) โ€“ Enable / Disable trimming silences at the beginning and the end of the audio clip. Defaults to `True`.

  • do_amp_to_db_linear (bool, optional) โ€“ enable/disable amplitude to dB conversion of linear spectrograms. Defaults to True.

  • do_amp_to_db_mel (bool, optional) โ€“ enable/disable amplitude to dB conversion of mel spectrograms. Defaults to True.

  • trim_db (int) โ€“ Silence threshold used for silence trimming. Defaults to 45.

  • power (float) โ€“ Exponent used for expanding spectrogra levels before running Griffin Lim. It helps to reduce the artifacts in the synthesized voice. Defaults to 1.5.

  • griffin_lim_iters (int) โ€“ Number of Griffing Lim iterations. Defaults to 60.

  • num_mels (int) โ€“ Number of mel-basis frames that defines the frame lengths of each mel-spectrogram frame. Defaults to 80.

  • mel_fmin (float) โ€“ Min frequency level used for the mel-basis filters. ~50 for male and ~95 for female voices. It needs to be adjusted for a dataset. Defaults to 0.

  • mel_fmax (float) โ€“ Max frequency level used for the mel-basis filters. It needs to be adjusted for a dataset.

  • spec_gain (int) โ€“ Gain applied when converting amplitude to DB. Defaults to 20.

  • signal_norm (bool) โ€“ enable/disable signal normalization. Defaults to True.

  • min_level_db (int) โ€“ minimum db threshold for the computed melspectrograms. Defaults to -100.

  • symmetric_norm (bool) โ€“ enable/disable symmetric normalization. If set True normalization is performed in the range [-k, k] else [0, k], Defaults to True.

  • max_norm (float) โ€“ `k` defining the normalization range. Defaults to 4.0.

  • clip_norm (bool) โ€“ enable/disable clipping the our of range values in the normalized audio signal. Defaults to True.

  • stats_path (str) โ€“ Path to the computed stats file. Defaults to None.

check_values()[source]ยถ

Check config fields