References

Style Token Layer (STL)

This layer helps to encapsulate different speaking styles in token embeddings.

Reference Encoder

Similar to Tacotron model, the reference encoder is used to extract the high-level features from the reference

It consists of a number of convolutional blocks (CoordConv1d for the first one and nn.Conv1d for the rest), then followed by instance normalization and GRU layers. The CoordConv1d at the first layer to better preserve positional information, paper: Robust and fine-grained prosody control of end-to-end speech synthesis

Utterance Level Prosody Encoder

A class to define the utterance level prosody encoder.

The encoder uses a Reference encoder class to convert input sequences into high-level features, followed by prosody embedding, self attention on the embeddings, and a feedforward transformation to generate the final output.Initializes the encoder with given specifications and creates necessary layers.

Phoneme Level Prosody Encoder

This Class is used to encode the phoneme level prosody in the speech synthesis pipeline.