Bases: Module
A class to define the utterance level prosody encoder.
The encoder uses a Reference encoder class to convert input sequences into high-level features,
followed by prosody embedding, self attention on the embeddings, and a feedforward transformation to generate the final output.Initializes the encoder with given specifications and creates necessary layers.
Parameters:
Name |
Type |
Description |
Default |
preprocess_config |
PreprocessingConfig
|
Configuration object with preprocessing parameters.
|
required
|
model_config |
AcousticModelConfigType
|
Configuration object with acoustic model parameters.
|
required
|
Returns:
Type |
Description |
|
torch.Tensor: A 3-dimensional tensor sized [N, seq_len, E] .
|
Source code in models/tts/delightful_tts/reference_encoder/utterance_level_prosody_encoder.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71 | class UtteranceLevelProsodyEncoder(Module):
r"""A class to define the utterance level prosody encoder.
The encoder uses a Reference encoder class to convert input sequences into high-level features,
followed by prosody embedding, self attention on the embeddings, and a feedforward transformation to generate the final output.Initializes the encoder with given specifications and creates necessary layers.
Args:
preprocess_config (PreprocessingConfig): Configuration object with preprocessing parameters.
model_config (AcousticModelConfigType): Configuration object with acoustic model parameters.
Returns:
torch.Tensor: A 3-dimensional tensor sized `[N, seq_len, E]`.
"""
def __init__(
self,
preprocess_config: PreprocessingConfig,
model_config: AcousticModelConfigType,
):
super().__init__()
self.E = model_config.encoder.n_hidden
ref_enc_gru_size = model_config.reference_encoder.ref_enc_gru_size
ref_attention_dropout = model_config.reference_encoder.ref_attention_dropout
bottleneck_size = model_config.reference_encoder.bottleneck_size_u
# Define important layers/modules for the encoder
self.encoder = ReferenceEncoder(preprocess_config, model_config)
self.encoder_prj = nn.Linear(ref_enc_gru_size, self.E // 2)
self.stl = STL(model_config)
self.encoder_bottleneck = nn.Linear(self.E, bottleneck_size)
self.dropout = nn.Dropout(ref_attention_dropout)
def forward(self, mels: torch.Tensor, mel_lens: torch.Tensor) -> torch.Tensor:
r"""Defines the forward pass of the utterance level prosody encoder.
Args:
mels (torch.Tensor): A 3-dimensional tensor containing input sequences. Size is `[N, Ty/r, n_mels*r]`.
mel_lens (torch.Tensor): A 1-dimensional tensor containing the lengths of each sequence in mels. Length is N.
Returns:
torch.Tensor: A 3-dimensional tensor sized `[N, seq_len, E]`.
"""
# Use the reference encoder to get prosody embeddings
_, embedded_prosody, _ = self.encoder(mels, mel_lens)
# Bottleneck
# Use the linear projection layer on the prosody embeddings
embedded_prosody = self.encoder_prj(embedded_prosody)
# Apply the style token layer followed by the bottleneck layer
out = self.encoder_bottleneck(self.stl(embedded_prosody))
# Apply dropout for regularization
out = self.dropout(out)
# Reshape the output tensor before returning
return out.view((-1, 1, out.shape[3]))
|
forward(mels, mel_lens)
Defines the forward pass of the utterance level prosody encoder.
Parameters:
Name |
Type |
Description |
Default |
mels |
Tensor
|
A 3-dimensional tensor containing input sequences. Size is [N, Ty/r, n_mels*r] .
|
required
|
mel_lens |
Tensor
|
A 1-dimensional tensor containing the lengths of each sequence in mels. Length is N.
|
required
|
Returns:
Type |
Description |
Tensor
|
torch.Tensor: A 3-dimensional tensor sized [N, seq_len, E] .
|
Source code in models/tts/delightful_tts/reference_encoder/utterance_level_prosody_encoder.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71 | def forward(self, mels: torch.Tensor, mel_lens: torch.Tensor) -> torch.Tensor:
r"""Defines the forward pass of the utterance level prosody encoder.
Args:
mels (torch.Tensor): A 3-dimensional tensor containing input sequences. Size is `[N, Ty/r, n_mels*r]`.
mel_lens (torch.Tensor): A 1-dimensional tensor containing the lengths of each sequence in mels. Length is N.
Returns:
torch.Tensor: A 3-dimensional tensor sized `[N, seq_len, E]`.
"""
# Use the reference encoder to get prosody embeddings
_, embedded_prosody, _ = self.encoder(mels, mel_lens)
# Bottleneck
# Use the linear projection layer on the prosody embeddings
embedded_prosody = self.encoder_prj(embedded_prosody)
# Apply the style token layer followed by the bottleneck layer
out = self.encoder_bottleneck(self.stl(embedded_prosody))
# Apply dropout for regularization
out = self.dropout(out)
# Reshape the output tensor before returning
return out.view((-1, 1, out.shape[3]))
|