References
References
Style Embed Attention
This mechanism is being used to extract style features from audio data in the form of spectrograms.
This technique is often used in text-to-speech synthesis (TTS) such as Tacotron-2, where the goal is to modulate the prosody, stress, and intonation of the synthesized speech based on the reference audio or some control parameters. The concept of "global style tokens" (GST) was introduced in
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis by Yuxuan Wang et al.
Multi-Head Attention
I found great explanations with code implementation of Multi-Headed Attention (MHA) by labml.ai Deep Learning Paper Implementations.
This is a tutorial/implementation of multi-headed attention from paper Attention Is All You Need in PyTorch. The implementation is inspired from Annotated Transformer.
This computes scaled multi-headed attention for given query
, key
and value
vectors.
$$\mathop{Attention}(Q, K, V) = \underset{seq}{\mathop{softmax}}\Bigg(\frac{Q K^\top}{\sqrt{d_k}}\Bigg)V$$
In simple terms, it finds keys that matches the query, and gets the values of those keys.
It uses dot-product of query and key as the indicator of how matching they are. Before taking the $softmax$ the dot-products are scaled by $\frac{1}{\sqrt{d_k}}$. This is done to avoid large dot-product values causing softmax to give very small gradients when $d_k$ is large.
Softmax is calculated along the axis of of the sequence (or time).
Relative Multi-Head Attention
Explanations with code implementation of Relative Multi-Headed Attention by labml.ai Deep Learning Paper Implementations.
Paper: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context in PyTorch
Conformer Multi-Headed Self Attention
Conformer employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme.
Feed Forward
Creates a feed-forward neural network.
The network includes a layer normalization, an activation function (LeakyReLU
), and dropout layers.
Conformer Conv Module
Conformer Convolution Module class represents a module in the Conformer model architecture. The module includes a layer normalization, pointwise and depthwise convolutional layers, Gated Linear Units (GLU) activation, and dropout layer.
Conformer Block
ConformerBlock
class represents a block in the Conformer model architecture.
The block includes a pointwise convolution followed by Gated Linear Units (GLU
) activation layer (Conv1dGLU
),
a Conformer self attention layer (ConformerMultiHeadedSelfAttention
), and optional feed-forward layer (FeedForward
).
Conformer
Conformer
class represents the Conformer
model which is a sequence-to-sequence model
used in some modern automated speech recognition systems. It is composed of several ConformerBlocks
.