Change Log
Major Changes
1 Text preprocessing is changed.
Dunky11 created a not-so-good way for text preprocessing. Maybe his solutions are appropriate for the previous releases of NeMo, but now it works stable. I can't find the described scenarios: Dunky11 quote:
Nemo's text normalizer unfortunately produces a large amount of false positives. For example it normalizes 'medic' into 'm e d i c' or 'yeah' into 'y e a h'. To reduce the amount of false positives we will do a check for unusual symbols or words inside the text and only normalize if necessary. I checked the described cases and it works fine. Tried to find an issue, but wasn't lucky. Maybe I did it wrong. I have the code, that wrapped the
NeMo
and added several more preprocessing features, that absolutely required, like char mapping. You can find docs here: NormalizeText
2 The phonemizer process (G2P) and tokenization process are changed.
I tried to build the same tokenization as Dunky11, but failed, because of the vocab. It's not possible to reproduce the same vocab in the same order, and the vocab that I have missesed some IPA
tokens. Change of the vocab order == lost all the progress that was made from the training steps. So, it makes no sense to build my own tokenization that lost benefits during the training process. I decided to use tokenizations from DeepPhonemizer, maybe Dunky11 didn't find it, I don't understand why he's built his own solution.
Maybe because of the [SILENCE]
token from here:
for sentence in sentence_tokenizer.tokenize(text):
symbol_ids = []
sentence = text_normalizer(sentence)
for word in word_tokenizer.tokenize(sentence):
word = word.lower()
if word.strip() == "":
continue
elif word in [".", "?", "!"]:
symbol_ids.append(self.symbol2id[word])
elif word in [",", ";"]:
symbol_ids.append(self.symbol2id["SILENCE"])
3 Training framework instead of tricky training spaghetti
PyTorch Lightning instead of wild hacks.