Style Embed Attention
StyleEmbedAttention
Bases: Module
Mechanism is being used to extract style features from audio data in the form of spectrograms.
Each style token (parameterized by an embedding vector) represents a unique style feature. The model applies the StyleEmbedAttention
mechanism to combine these style tokens (style features) in a weighted manner. The output of the attention module is a sum of style tokens, with each token weighted by its relevance to the input.
This technique is often used in text-to-speech synthesis (TTS) such as Tacotron-2, where the goal is to modulate the prosody, stress, and intonation of the synthesized speech based on the reference audio or some control parameters. The concept of "global style tokens" (GST) was introduced in Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis by Yuxuan Wang et al.
The StyleEmbedAttention
class is a PyTorch module implementing the attention mechanism.
This class is specifically designed for handling multiple attention heads.
Attention here operates on a query and a set of key-value pairs to produce an output.
Builds the StyleEmbedAttention
network.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query_dim |
int
|
Dimensionality of the query vectors. |
required |
key_dim |
int
|
Dimensionality of the key vectors. |
required |
num_units |
int
|
Total dimensionality of the query, key, and value vectors. |
required |
num_heads |
int
|
Number of parallel attention layers (heads). |
required |
Note: num_units
should be divisible by num_heads
.
Source code in models/tts/delightful_tts/attention/style_embed_attention.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
|
forward(query, key_soft)
Forward pass of the StyleEmbedAttention module calculates the attention scores.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
Tensor
|
The input tensor for queries of shape |
required |
key_soft |
Tensor
|
The input tensor for keys of shape |
required |
Returns:
Name | Type | Description |
---|---|---|
out |
Tensor
|
The output tensor of shape |
Source code in models/tts/delightful_tts/attention/style_embed_attention.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
|