Skip to content

CoordConv1d

CoordConv1d

Bases: Conv1d, Module

CoordConv1d is an extension of the standard 1D convolution layer (conv.Conv1d), with the addition of extra coordinate channels. These extra channels encode positional coordinates, and optionally, the radial distance from the origin. This is inspired by the paper: An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution and is designed to help Convolution layers to pay attention to the absolute position of features in the input space.

The responsibility of this class is to intercept the input tensor and append extra channels to it. These extra channels encode the positional coordinates (and optionally, the radial distance from the center). The enhanced tensor is then immediately passed through a standard Conv1D layer.

In concrete terms, this means Convolution layer does not just process the color in an image-based task, but also 'knows' where in the overall image this color is located.

In a typical Text-To-Speech (TTS) system like DelightfulTTS, the utterance is processed in a sequential manner. The importance of sequential data in such a use-case can benefit from CoordConv layer as it offers a way to draw more attention to the positioning of data. CoordConv is a drop-in replacement for standard convolution layers, enriches spatial representation in Convolutional Neural Networks (CNN) with additional positional information.

Hence, the resultant Convolution does not only process the characteristics of the sound in the input speech signal, but also 'knows' where in the overall signal this particular sound is located, providing it with the spatial context. This can be particularly useful in TTS systems where the sequence of phonemes and their timing can be critical.

Parameters:

Name Type Description Default
in_channels int

Number of channels in the input.

required
out_channels int

Number of channels produced by the convolution.

required
kernel_size int

Size of the convolving kernel.

required
stride int

Stride of the convolution. Default: 1.

1
padding int

Zero-padding added to both sides of the input . Default: 0.

0
dilation int

Spacing between kernel elements. Default: 1.

1
groups int

Number of blocked connections from input channels to output channels. Default: 1.

1
bias bool

If True, adds a learnable bias to the output. Default: True.

True
with_r bool

If True, adds a radial coordinate channel. Default: False.

False
Source code in models/tts/delightful_tts/conv_blocks/coord_conv1d.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
class CoordConv1d(conv.Conv1d, Module):
    r"""`CoordConv1d` is an extension of the standard 1D convolution layer (`conv.Conv1d`), with the addition of extra coordinate
    channels. These extra channels encode positional coordinates, and optionally, the radial distance from the origin.
    This is inspired by the paper:
    [An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution](https://arxiv.org/abs/1807.03247)
    and is designed to help Convolution layers to pay attention to the absolute position of features in the input space.

    The responsibility of this class is to intercept the input tensor and append extra channels to it. These extra channels
    encode the positional coordinates (and optionally, the radial distance from the center). The enhanced tensor is then
    immediately passed through a standard Conv1D layer.

    In concrete terms, this means Convolution layer does not just process the color in an image-based task, but also 'knows'
    where in the overall image this color is located.

    In a typical Text-To-Speech (TTS) system like DelightfulTTS, the utterance is processed in a sequential manner.
    The importance of sequential data in such a use-case can benefit from `CoordConv` layer as it offers a way to draw
    more attention to the positioning of data. `CoordConv` is a drop-in replacement for standard convolution layers,
    enriches spatial representation in Convolutional Neural Networks (CNN) with additional positional information.

    Hence, the resultant Convolution does not only process the characteristics of the sound in the input speech signal,
    but also 'knows' where in the overall signal this particular sound is located, providing it with the spatial context.
    This can be particularly useful in TTS systems where the sequence of phonemes and their timing can be critical.

    Args:
        in_channels (int): Number of channels in the input.
        out_channels (int): Number of channels produced by the convolution.
        kernel_size (int): Size of the convolving kernel.
        stride (int): Stride of the convolution. Default: 1.
        padding (int): Zero-padding added to both sides of the input . Default: 0.
        dilation (int): Spacing between kernel elements. Default: 1.
        groups (int): Number of blocked connections from input channels to output channels. Default: 1.
        bias (bool): If True, adds a learnable bias to the output. Default: True.
        with_r (bool): If True, adds a radial coordinate channel. Default: False.

    """

    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int,
        stride: int = 1,
        padding: int = 0,
        dilation: int = 1,
        groups: int = 1,
        bias: bool = True,
        with_r: bool = False,
    ):
        super().__init__(
            in_channels,
            out_channels,
            kernel_size,
            stride,
            padding,
            dilation,
            groups,
            bias,
        )

        self.rank = 1
        self.addcoords = AddCoords(self.rank, with_r)

        self.conv = nn.Conv1d(
            in_channels + self.rank + int(with_r),
            out_channels,
            kernel_size,
            stride,
            padding,
            dilation,
            groups,
            bias,
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        r"""The forward pass of the `CoordConv1d` module. It adds the coordinate channels to the input tensor with the `AddCoords`
        module, and then immediately passes the result through a 1D convolution.

        As a result, the subsequent Conv layers don't merely process sound characteristics of the speech signal, but are
        also aware of their relative positioning, offering a notable improvement over traditional methods, particularly for
        challenging TTS tasks where the sequence is critical.

        Args:
            x (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: The output tensor of shape (batch_size, out_channels, length).
        """
        # Apply AddCoords layer to add coordinate channels to the input tensor
        x = self.addcoords(x)

        # Apply convolution
        return self.conv(x)

forward(x)

The forward pass of the CoordConv1d module. It adds the coordinate channels to the input tensor with the AddCoords module, and then immediately passes the result through a 1D convolution.

As a result, the subsequent Conv layers don't merely process sound characteristics of the speech signal, but are also aware of their relative positioning, offering a notable improvement over traditional methods, particularly for challenging TTS tasks where the sequence is critical.

Parameters:

Name Type Description Default
x Tensor

The input tensor.

required

Returns:

Type Description
Tensor

torch.Tensor: The output tensor of shape (batch_size, out_channels, length).

Source code in models/tts/delightful_tts/conv_blocks/coord_conv1d.py
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def forward(self, x: torch.Tensor) -> torch.Tensor:
    r"""The forward pass of the `CoordConv1d` module. It adds the coordinate channels to the input tensor with the `AddCoords`
    module, and then immediately passes the result through a 1D convolution.

    As a result, the subsequent Conv layers don't merely process sound characteristics of the speech signal, but are
    also aware of their relative positioning, offering a notable improvement over traditional methods, particularly for
    challenging TTS tasks where the sequence is critical.

    Args:
        x (torch.Tensor): The input tensor.

    Returns:
        torch.Tensor: The output tensor of shape (batch_size, out_channels, length).
    """
    # Apply AddCoords layer to add coordinate channels to the input tensor
    x = self.addcoords(x)

    # Apply convolution
    return self.conv(x)