Skip to content

References

References

Accoustic Model

The DelightfulTTS AcousticModel class represents a PyTorch module for an acoustic model in text-to-speech (TTS). The acoustic model is responsible for predicting speech signals from phoneme sequences.

The model comprises multiple sub-modules including encoder, decoder and various prosody encoders and predictors. Additionally, a pitch and length adaptor are instantiated.

Embedding

This class represents a simple embedding layer but without any learning of the embeddings.

Helpers

Acoustic model helpers methods

Variance Predictor

This is a Duration and Pitch predictor neural network module in PyTorch.

Pitch Adaptor Conv

Variance Adaptor with an added 1D conv layer. Used to get pitch embeddings.

Energy Adaptor

Variance Adaptor with an added 1D conv layer. Used to get energy embeddings.

Length Adaptor

The LengthAdaptor module is used to adjust the duration of phonemes. Used in Tacotron 2 model.

Phoneme Prosody Predictor

A class to define the Phoneme Prosody Predictor. This prosody predictor is non-parallel and is inspired by the work of Du et al., 2021 ?.

In linguistics, prosody (/ˈprɒsədi, ˈprɒzədi/)is the study of elements of speech that are not individual phonetic segments (vowels and consonants) but which are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.

Wikipedia Prosody (linguistics)

Aligner

Aligner class represents a PyTorch module responsible for alignment tasks in a sequence-to-sequence model. It uses convolutional layers combined with LeakyReLU activation functions to project inputs to a hidden representation.

Also, for training purposes, binarizes attention with MAS

Monotonic Alignments Shrink

mas_width1 Applies a Monotonic Alignments Shrink (MAS) operation with a hard-coded width of 1 to an attention map. Mas with hardcoded width=1

b_mas Applies Monotonic Alignments Shrink (MAS) operation in parallel to the batches of an attention map. It uses the mas_width1 function internally to perform MAS operation.