omogr/xtts-ru-ipa · How many data

Sep 23, 2024

Hi, i see you replace old vocab with new russian IPA vocab. How many data to training this model, thank you

Owner Sep 25, 2024

Hello. The library contains one statistical model (for generating IPA transcription) and two bert models for accentuation. To train the statistical model, I used mainly words from wiktionary and wikipedia, (the Russian version of which contains an IPA transcription). To train bert, I used ~3 GB of text data, for which the correct accents were placed for ambiguous words. I am currently working on increasing the amount of training data in order to more accurately resolve ambiguities in the accentuation.

anhnct

Sep 25, 2024

Thank you for your reply. What I mean is how much audio data do you use for training xtts

omogr

Owner Sep 25, 2024

•

edited Sep 25, 2024

I'm sorry, I should have guessed. It was a small experiment, just to understand how it makes sense to use transcription and accents for speech synthesis. I used ~60 hours of speech for training. In the README, I referred to the acoustic data that I used for training. https://github.com/omogr/omogre/blob/main/README_eng.md. The model was trained on the RUSLAN and Common Voice datasets.

https://ruslan-corpus.github.io/
https://commonvoice.mozilla.org/ru

anhnct

Sep 25, 2024

Thank you

JaspertTms

13 days ago

Did you freeze the original layers to train new model weights, or did you fine-tune using the default weights?

Regarding your model: the stress on the letter "И" (in Russian) is not always placed correctly, and the letter "Ч" is pronounced like "Ш". Also, the pauses between sentences should be slightly longer.

Additionally, you should add number processing to the code, converting numbers into words — currently, the library simply removes them.

omogr

Owner 10 days ago

•

edited 10 days ago

Thank you for your thoughtful feedback on my library! I appreciate you taking the time to share your observations.

Model Training Approach

"Did you freeze the original layers to train new model weights, or did you fine-tune using the default weights?"

The model was fine-tuned using the default weights without freezing the original layers.

Pronunciation Observations

"The stress on 'И' isn't always correct, 'Ч' sounds like 'Ш', and pauses between sentences could be longer."

I haven't observed these systematic errors in my testing, but I'd be very interested to investigate specific phrases where you've encountered these issues. The model's pronunciation is heavily influenced by the reference audio used during inference - different reference files can produce noticeably different articulation patterns. If you could share examples of problematic sentences (and optionally a reference audio that demonstrates your desired pronunciation), I'd be happy to explore this further.

Text Normalization (Numbers/Symbols)

"Add number-to-words conversion instead of removing them."

You're absolutely right that robust text preprocessing should handle numbers, abbreviations, special symbols, and mixed-language text. However, implementing comprehensive text normalization is nontrivial and highly domain-dependent (e.g., dates/currency/units require context-aware conversion).

For this implementation, I consciously decided to focus on core TTS functionality while leveraging existing specialized libraries for text normalization. Solutions like ruNorm (Russian-specific) or multilingual tools like those in the XTTS framework could serve as viable starting points. While some implementations might appear overly complex due to their multilingual support, they could be adapted for specific use cases through targeted modifications.

JaspertTms

8 days ago

•

edited 8 days ago

Here are some examples:
"Нечто" (Nechto): It should be [nʲetɕtə] (pronounced like "nechta" with a clear "ch"), but the transliterator outputs [nʲeʂtə] (like "neshta" with "sh"). While this might sound similar during fast reading, no one actually says "neshto" today — it's incorrect.
"Аура" (Aura): It should be [aʊrə] (with the diphthong "au"), but the transliterator gives [arə] (like "ara"), and [aurə] sounds like "ura".
The issue is the absence of tokens for [tɕtə] (for the "chte" sound) and [aʊ] (for the "au" diphthong).
We need transliteration that reflects the orthographic pronunciation, not just the phonetic one.
There are many similar errors. Is there a transliteration model dictionary where corrections can be manually added? That would help fix these mistakes and improve the output quality.

Ну или на русском
Вот примеры:
«Нечто»: Должно быть [nʲetɕtə] (произносится как «нечта» с чётким «ч»), но транслитератор выдает [nʲeʂtə] (как «нешта» с «ш»). Понятно, что при беглом чтении это звучит похоже, но «нешто» сейчас никто так не говорит — это ошибка.
«Аура»: Должно быть [aʊrə] (с дифтонгом «ау»), но транслитератор даёт [arə] (как «ара»), а [aurə] звучит как «ура».
Проблема в отсутствии токенов для [tɕtə] (для звука «чтэ») и [aʊ] (для дифтонга «ау»).
Яо Чанъин - [ao tɕɪnʲin] О ченин как я не петался яо он таки не выдал.
Нужно, чтобы транслитерация отражала орфографическое, а не только фонетическое произношение.
Таких ошибок много. Возможно, существует словарь модели транслитерации, куда можно вносить правки? Это позволило бы исправлять ошибки вручную и улучшить качество вывода.

omogr

Owner about 7 hours ago

Thank you so much, I totally agree, yes, this needs to be fixed. I'll figure it out. It might take me some time.