Energy Adaptor
EnergyAdaptor
Bases: Module
Variance Adaptor with an added 1D conv layer. Used to get energy embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
channels_in |
int
|
Number of in channels for conv layers. |
required |
channels_out |
int
|
Number of out channels. |
required |
kernel_size |
int
|
Size the kernel for the conv layers. |
required |
dropout |
float
|
Probability of dropout. |
required |
leaky_relu_slope |
float
|
Slope for the leaky relu. |
required |
emb_kernel_size |
int
|
Size the kernel for the pitch embedding. |
required |
inputs, mask
- inputs (batch, time1, dim): Tensor containing input vector
- target (batch, 1, time2): Tensor containing the energy target
- dr (batch, time1): Tensor containing aligner durations vector
- mask (batch, time1): Tensor containing indices to be masked
Returns: - energy prediction (batch, 1, time1): Tensor produced by energy predictor - energy embedding (batch, channels, time1): Tensor produced energy adaptor - average energy target(train only) (batch, 1, time1): Tensor produced after averaging over durations
Source code in models/tts/delightful_tts/acoustic_model/energy_adaptor.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
|
add_energy_embedding(x, mask)
Add energy embedding during inference.
This method calculates the energy embedding and adds it to the input tensor 'x'. It also returns the predicted energy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
The input tensor to which the energy embedding will be added. |
required |
mask |
Tensor
|
The mask tensor used in the energy embedding calculation. |
required |
energy_transform |
Callable
|
A function to transform the energy prediction. |
required |
Returns:
Name | Type | Description |
---|---|---|
x |
Tensor
|
The input tensor with added energy embedding. |
energy_pred |
Tensor
|
The predicted energy tensor. |
Source code in models/tts/delightful_tts/acoustic_model/energy_adaptor.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
|
add_energy_embedding_train(x, target, dr, mask)
Add energy embedding during training.
This method calculates the energy embedding and adds it to the input tensor 'x'. It also returns the predicted energy and the average target energy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
The input tensor to which the energy embedding will be added. |
required |
target |
Tensor
|
The target tensor used in the energy embedding calculation. |
required |
dr |
Tensor
|
The duration tensor used in the energy embedding calculation. |
required |
mask |
Tensor
|
The mask tensor used in the energy embedding calculation. |
required |
Returns:
Name | Type | Description |
---|---|---|
x |
Tensor
|
The input tensor with added energy embedding. |
energy_pred |
Tensor
|
The predicted energy tensor. |
avg_energy_target |
Tensor
|
The average target energy tensor. |
Source code in models/tts/delightful_tts/acoustic_model/energy_adaptor.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
|
get_energy_embedding(x, mask)
Function is used during inference to get the energy embedding and energy prediction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
A 3D tensor of shape [B, T_src, C] where B is the batch size, T_src is the source sequence length, and C is the number of channels. |
required |
mask |
Tensor
|
A 2D tensor of shape [B, T_src] where B is the batch size, T_src is the source sequence length. The values represent the mask. |
required |
Returns:
Name | Type | Description |
---|---|---|
energy_emb_pred |
Tensor
|
A 3D tensor of shape [B, C, T_src] where B is the batch size, C is the number of channels, T_src is the source sequence length. The values represent the energy embedding. |
energy_pred |
Tensor
|
A 3D tensor of shape [B, 1, T_src] where B is the batch size, T_src is the source sequence length. The values represent the energy prediction. |
Source code in models/tts/delightful_tts/acoustic_model/energy_adaptor.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
get_energy_embedding_train(x, target, dr, mask)
Function is used during training to get the energy prediction, average energy target, and energy embedding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
A 3D tensor of shape [B, T_src, C] where B is the batch size, T_src is the source sequence length, and C is the number of channels. |
required |
target |
Tensor
|
A 3D tensor of shape [B, 1, T_max2] where B is the batch size, T_max2 is the maximum target sequence length. |
required |
dr |
Tensor
|
A 2D tensor of shape [B, T_src] where B is the batch size, T_src is the source sequence length. The values represent the durations. |
required |
mask |
Tensor
|
A 2D tensor of shape [B, T_src] where B is the batch size, T_src is the source sequence length. The values represent the mask. |
required |
Returns:
Name | Type | Description |
---|---|---|
energy_pred |
Tensor
|
A 3D tensor of shape [B, 1, T_src] where B is the batch size, T_src is the source sequence length. The values represent the energy prediction. |
avg_energy_target |
Tensor
|
A 3D tensor of shape [B, 1, T_src] where B is the batch size, T_src is the source sequence length. The values represent the average energy target. |
energy_emb |
Tensor
|
A 3D tensor of shape [B, C, T_src] where B is the batch size, C is the number of channels, T_src is the source sequence length. The values represent the energy embedding. |
Shapes:
x: :math: [B, T_src, C]
target: :math: [B, 1, T_max2]
dr: :math: [B, T_src]
mask: :math: [B, T_src]
Source code in models/tts/delightful_tts/acoustic_model/energy_adaptor.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
|