CoordConv1d
CoordConv1d
Bases: Conv1d
, Module
CoordConv1d
is an extension of the standard 1D convolution layer (conv.Conv1d
), with the addition of extra coordinate
channels. These extra channels encode positional coordinates, and optionally, the radial distance from the origin.
This is inspired by the paper:
An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution
and is designed to help Convolution layers to pay attention to the absolute position of features in the input space.
The responsibility of this class is to intercept the input tensor and append extra channels to it. These extra channels encode the positional coordinates (and optionally, the radial distance from the center). The enhanced tensor is then immediately passed through a standard Conv1D layer.
In concrete terms, this means Convolution layer does not just process the color in an image-based task, but also 'knows' where in the overall image this color is located.
In a typical Text-To-Speech (TTS) system like DelightfulTTS, the utterance is processed in a sequential manner.
The importance of sequential data in such a use-case can benefit from CoordConv
layer as it offers a way to draw
more attention to the positioning of data. CoordConv
is a drop-in replacement for standard convolution layers,
enriches spatial representation in Convolutional Neural Networks (CNN) with additional positional information.
Hence, the resultant Convolution does not only process the characteristics of the sound in the input speech signal, but also 'knows' where in the overall signal this particular sound is located, providing it with the spatial context. This can be particularly useful in TTS systems where the sequence of phonemes and their timing can be critical.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
in_channels |
int
|
Number of channels in the input. |
required |
out_channels |
int
|
Number of channels produced by the convolution. |
required |
kernel_size |
int
|
Size of the convolving kernel. |
required |
stride |
int
|
Stride of the convolution. Default: 1. |
1
|
padding |
int
|
Zero-padding added to both sides of the input . Default: 0. |
0
|
dilation |
int
|
Spacing between kernel elements. Default: 1. |
1
|
groups |
int
|
Number of blocked connections from input channels to output channels. Default: 1. |
1
|
bias |
bool
|
If True, adds a learnable bias to the output. Default: True. |
True
|
with_r |
bool
|
If True, adds a radial coordinate channel. Default: False. |
False
|
Source code in models/tts/delightful_tts/conv_blocks/coord_conv1d.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
forward(x)
The forward pass of the CoordConv1d
module. It adds the coordinate channels to the input tensor with the AddCoords
module, and then immediately passes the result through a 1D convolution.
As a result, the subsequent Conv layers don't merely process sound characteristics of the speech signal, but are also aware of their relative positioning, offering a notable improvement over traditional methods, particularly for challenging TTS tasks where the sequence is critical.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
The input tensor. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: The output tensor of shape (batch_size, out_channels, length). |
Source code in models/tts/delightful_tts/conv_blocks/coord_conv1d.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|