Problems and Fixes

Fixes

During the development process, a huge amount of job has been done. For example, the stft code changed to the latest way of conversion from complex to real. For example:

# Compute the short-time Fourier transform of the input waveform
x = torch.stft(
    x,
    n_fft=n_fft,
    hop_length=hop_length,
    win_length=win_length,
    center=False,
    return_complex=True,
    # Add window parameter to prevent the signal leak
    window=torch.ones(win_length, device=x.device),
)  # [B, F, TT, 2]

# Convert to real as the additional step
x = torch.view_as_real(x)

It's a tested fix, and works stable.

Minor fixes, and TODO that I can't fix now, but wanted to fix for the future.

The ideas behind the model and architecture are mostly the same. The training code is completely different, there is a base of dunky11 and the architecture is completely new.

Changes

pitch_adaptor2 and pitches_stat

Instead of pitch_adaptor that used the stats.json (which is unknown) that looks like critical issue for me, I have a new version of PitchAdaptor that receive an argument pitch_range.

Inside the AcousticModule I have the pitches_stat, parameter, it's required for the audio generation and PitchAdaptor. Without the new weights that keep-in-trackpitches_stat``, it's not possible to generate the waveform or mel-spectrogram.

Latest changes in PitchAdaptor, now I have PitchAdaptorConv!

PitchAdaptorConv is much more effective and during the training I found out the best performance and quality compare to the PitchAdaptor based on Embeddings idea.

Problems

FIXME: Step param!

I have a step parameter, it requires the future investigation. Maybe I need to add this param to the model step with self.register_buffer. It's required for the FastSpeech2LossGen Possible training issue!