Pitch Adaptor Conv
PitchAdaptorConv
Bases: Module
The PitchAdaptorConv class is a pitch adaptor network in the model. Updated version of the PitchAdaptorConv uses the conv embeddings for the pitch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
channels_in |
int
|
Number of in channels for conv layers. |
required |
channels_out |
int
|
Number of out channels. |
required |
kernel_size |
int
|
Size the kernel for the conv layers. |
required |
dropout |
float
|
Probability of dropout. |
required |
leaky_relu_slope |
float
|
Slope for the leaky relu. |
required |
emb_kernel_size |
int
|
Size the kernel for the pitch embedding. |
required |
inputs, mask
- inputs (batch, time1, dim): Tensor containing input vector
- target (batch, 1, time2): Tensor containing the pitch target
- dr (batch, time1): Tensor containing aligner durations vector
- mask (batch, time1): Tensor containing indices to be masked
Returns: - pitch prediction (batch, 1, time1): Tensor produced by pitch predictor - pitch embedding (batch, channels, time1): Tensor produced pitch adaptor - average pitch target(train only) (batch, 1, time1): Tensor produced after averaging over durations
Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
|
add_pitch_embedding(x, mask)
Add pitch embedding during inference.
This method calculates the pitch embedding and adds it to the input tensor 'x'. It also returns the predicted pitch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
The input tensor to which the pitch embedding will be added. |
required |
mask |
Tensor
|
The mask tensor used in the pitch embedding calculation. |
required |
pitch_transform |
Callable
|
A function to transform the pitch prediction. |
required |
Returns:
Name | Type | Description |
---|---|---|
x |
Tensor
|
The input tensor with added pitch embedding. |
pitch_pred |
Tensor
|
The predicted pitch tensor. |
Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
|
add_pitch_embedding_train(x, target, dr, mask)
Add pitch embedding during training.
This method calculates the pitch embedding and adds it to the input tensor 'x'. It also returns the predicted pitch and the average target pitch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
The input tensor to which the pitch embedding will be added. |
required |
target |
Tensor
|
The target tensor used in the pitch embedding calculation. |
required |
dr |
Tensor
|
The duration tensor used in the pitch embedding calculation. |
required |
mask |
Tensor
|
The mask tensor used in the pitch embedding calculation. |
required |
Returns:
Name | Type | Description |
---|---|---|
x |
Tensor
|
The input tensor with added pitch embedding. |
pitch_pred |
Tensor
|
The predicted pitch tensor. |
avg_pitch_target |
Tensor
|
The average target pitch tensor. |
Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
|
get_pitch_embedding(x, mask)
Function is used during inference to get the pitch embedding and pitch prediction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
A 3D tensor of shape [B, T_src, C] where B is the batch size, T_src is the source sequence length, and C is the number of channels. |
required |
mask |
Tensor
|
A 2D tensor of shape [B, T_src] where B is the batch size, T_src is the source sequence length. The values represent the mask. |
required |
Returns:
Name | Type | Description |
---|---|---|
pitch_emb_pred |
Tensor
|
A 3D tensor of shape [B, C, T_src] where B is the batch size, C is the number of channels, T_src is the source sequence length. The values represent the pitch embedding. |
pitch_pred |
Tensor
|
A 3D tensor of shape [B, 1, T_src] where B is the batch size, T_src is the source sequence length. The values represent the pitch prediction. |
Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|
get_pitch_embedding_train(x, target, dr, mask)
Function is used during training to get the pitch prediction, average pitch target, and pitch embedding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
A 3D tensor of shape [B, T_src, C] where B is the batch size, T_src is the source sequence length, and C is the number of channels. |
required |
target |
Tensor
|
A 3D tensor of shape [B, 1, T_max2] where B is the batch size, T_max2 is the maximum target sequence length. |
required |
dr |
Tensor
|
A 2D tensor of shape [B, T_src] where B is the batch size, T_src is the source sequence length. The values represent the durations. |
required |
mask |
Tensor
|
A 2D tensor of shape [B, T_src] where B is the batch size, T_src is the source sequence length. The values represent the mask. |
required |
Returns:
Name | Type | Description |
---|---|---|
pitch_pred |
Tensor
|
A 3D tensor of shape [B, 1, T_src] where B is the batch size, T_src is the source sequence length. The values represent the pitch prediction. |
avg_pitch_target |
Tensor
|
A 3D tensor of shape [B, 1, T_src] where B is the batch size, T_src is the source sequence length. The values represent the average pitch target. |
pitch_emb |
Tensor
|
A 3D tensor of shape [B, C, T_src] where B is the batch size, C is the number of channels, T_src is the source sequence length. The values represent the pitch embedding. |
Shapes:
x: :math: [B, T_src, C]
target: :math: [B, 1, T_max2]
dr: :math: [B, T_src]
mask: :math: [B, T_src]
Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|