Skip to content

Utterance Level Prosody Encoder

UtteranceLevelProsodyEncoder

Bases: Module

A class to define the utterance level prosody encoder.

The encoder uses a Reference encoder class to convert input sequences into high-level features, followed by prosody embedding, self attention on the embeddings, and a feedforward transformation to generate the final output.Initializes the encoder with given specifications and creates necessary layers.

Parameters:

Name Type Description Default
preprocess_config PreprocessingConfig

Configuration object with preprocessing parameters.

required
model_config AcousticModelConfigType

Configuration object with acoustic model parameters.

required

Returns:

Type Description

torch.Tensor: A 3-dimensional tensor sized [N, seq_len, E].

Source code in models/tts/delightful_tts/reference_encoder/utterance_level_prosody_encoder.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
class UtteranceLevelProsodyEncoder(Module):
    r"""A class to define the utterance level prosody encoder.

    The encoder uses a Reference encoder class to convert input sequences into high-level features,
    followed by prosody embedding, self attention on the embeddings, and a feedforward transformation to generate the final output.Initializes the encoder with given specifications and creates necessary layers.

    Args:
        preprocess_config (PreprocessingConfig): Configuration object with preprocessing parameters.
        model_config (AcousticModelConfigType): Configuration object with acoustic model parameters.

    Returns:
        torch.Tensor: A 3-dimensional tensor sized `[N, seq_len, E]`.
    """

    def __init__(
        self,
        preprocess_config: PreprocessingConfig,
        model_config: AcousticModelConfigType,
    ):
        super().__init__()

        self.E = model_config.encoder.n_hidden
        ref_enc_gru_size = model_config.reference_encoder.ref_enc_gru_size
        ref_attention_dropout = model_config.reference_encoder.ref_attention_dropout
        bottleneck_size = model_config.reference_encoder.bottleneck_size_u

        # Define important layers/modules for the encoder
        self.encoder = ReferenceEncoder(preprocess_config, model_config)
        self.encoder_prj = nn.Linear(ref_enc_gru_size, self.E // 2)
        self.stl = STL(model_config)
        self.encoder_bottleneck = nn.Linear(self.E, bottleneck_size)
        self.dropout = nn.Dropout(ref_attention_dropout)

    def forward(self, mels: torch.Tensor, mel_lens: torch.Tensor) -> torch.Tensor:
        r"""Defines the forward pass of the utterance level prosody encoder.

        Args:
            mels (torch.Tensor): A 3-dimensional tensor containing input sequences. Size is `[N, Ty/r, n_mels*r]`.
            mel_lens (torch.Tensor): A 1-dimensional tensor containing the lengths of each sequence in mels. Length is N.

        Returns:
            torch.Tensor: A 3-dimensional tensor sized `[N, seq_len, E]`.
        """
        # Use the reference encoder to get prosody embeddings
        _, embedded_prosody, _ = self.encoder(mels, mel_lens)

        # Bottleneck
        # Use the linear projection layer on the prosody embeddings
        embedded_prosody = self.encoder_prj(embedded_prosody)

        # Apply the style token layer followed by the bottleneck layer
        out = self.encoder_bottleneck(self.stl(embedded_prosody))

        # Apply dropout for regularization
        out = self.dropout(out)

        # Reshape the output tensor before returning
        return out.view((-1, 1, out.shape[3]))

forward(mels, mel_lens)

Defines the forward pass of the utterance level prosody encoder.

Parameters:

Name Type Description Default
mels Tensor

A 3-dimensional tensor containing input sequences. Size is [N, Ty/r, n_mels*r].

required
mel_lens Tensor

A 1-dimensional tensor containing the lengths of each sequence in mels. Length is N.

required

Returns:

Type Description
Tensor

torch.Tensor: A 3-dimensional tensor sized [N, seq_len, E].

Source code in models/tts/delightful_tts/reference_encoder/utterance_level_prosody_encoder.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
def forward(self, mels: torch.Tensor, mel_lens: torch.Tensor) -> torch.Tensor:
    r"""Defines the forward pass of the utterance level prosody encoder.

    Args:
        mels (torch.Tensor): A 3-dimensional tensor containing input sequences. Size is `[N, Ty/r, n_mels*r]`.
        mel_lens (torch.Tensor): A 1-dimensional tensor containing the lengths of each sequence in mels. Length is N.

    Returns:
        torch.Tensor: A 3-dimensional tensor sized `[N, seq_len, E]`.
    """
    # Use the reference encoder to get prosody embeddings
    _, embedded_prosody, _ = self.encoder(mels, mel_lens)

    # Bottleneck
    # Use the linear projection layer on the prosody embeddings
    embedded_prosody = self.encoder_prj(embedded_prosody)

    # Apply the style token layer followed by the bottleneck layer
    out = self.encoder_bottleneck(self.stl(embedded_prosody))

    # Apply dropout for regularization
    out = self.dropout(out)

    # Reshape the output tensor before returning
    return out.view((-1, 1, out.shape[3]))