References
References
Style Token Layer (STL)
This layer helps to encapsulate different speaking styles in token embeddings.
Reference Encoder
Similar to Tacotron model, the reference encoder is used to extract the high-level features from the reference
It consists of a number of convolutional blocks (CoordConv1d
for the first one and nn.Conv1d
for the rest),
then followed by instance normalization and GRU layers.
The CoordConv1d
at the first layer to better preserve positional information, paper:
Robust and fine-grained prosody control of end-to-end speech synthesis
Utterance Level Prosody Encoder
A class to define the utterance level prosody encoder.
The encoder uses a Reference encoder class to convert input sequences into high-level features, followed by prosody embedding, self attention on the embeddings, and a feedforward transformation to generate the final output.Initializes the encoder with given specifications and creates necessary layers.
Phoneme Level Prosody Encoder
This Class is used to encode the phoneme level prosody in the speech synthesis pipeline.