ZhenYe234 commited on
Commit
e224984
·
verified ·
1 Parent(s): 48712f1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +248 -0
README.md ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - zh
5
+ - en
6
+ - de
7
+ - fr
8
+ - ja
9
+ - ko
10
+ - nl
11
+ - es
12
+ - it
13
+ - pt
14
+ - pl
15
+ base_model:
16
+ - meta-llama/Llama-3.2-1B-Instruct
17
+ tags:
18
+ - Text-to-Speech
19
+ pipeline_tag: text-to-speech
20
+ ---
21
+
22
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2502.04128)
23
+
24
+ **Main Idea:**
25
+ This model enhances previous Llasa TTS by incorporating multilingual data. The approach leverages the LLAMA-initialized text BPE tokenizer,
26
+ which is adept at handling multilingual text without the need to design language-specific G2P (grapheme-to-phoneme) systems.
27
+ Although the multilingual training data is limited—using only the MLS and Emilia datasets—resulting in potentially less optimal performance for some languages due to data scarcity,
28
+ our model can serve as a base TTS model. It is particularly suitable for fine-tuning for a specific language, as texts in various languages can be uniformly processed using the BPE tokenizer from Llama.
29
+
30
+ This model is not mentioned in the paper, but it follows the same methodology.
31
+
32
+ LLaSA: Scaling Train-Time and Inference-Time Compute for LLaMA-based Speech Synthesis
33
+
34
+
35
+ - **Train from Scratch**: If you want to train the model from scratch, use the [LLaSA Training Repository](https://github.com/zhenye234/LLaSA_training).
36
+
37
+ - **Scale for Test-Time Computation**: If you want to experiment with scaling for test-time computation, use the [LLaSA Testing Repository](https://github.com/zhenye234/LLaSA_inference).
38
+
39
+
40
+
41
+ ## How to use
42
+ Install [XCodec2](https://huggingface.co/HKUST-Audio/xcodec2). (Please use new version of xcodec2==0.1.3)
43
+ ```bash
44
+ conda create -n xcodec2 python=3.9
45
+ conda activate xcodec2
46
+ pip install xcodec2==0.1.3
47
+ ```
48
+
49
+ **1. Speech synthesis solely from input text**
50
+ ```python
51
+ from transformers import AutoTokenizer, AutoModelForCausalLM
52
+ import torch
53
+ import soundfile as sf
54
+
55
+ llasa_1b ='HKUST-Audio/Llasa-1B-Multilingual'
56
+
57
+ tokenizer = AutoTokenizer.from_pretrained(llasa_1b)
58
+ model = AutoModelForCausalLM.from_pretrained(llasa_1b)
59
+ model.eval()
60
+ model.to('cuda')
61
+
62
+ from xcodec2.modeling_xcodec2 import XCodec2Model
63
+
64
+ model_path = "HKUST-Audio/xcodec2"
65
+
66
+ Codec_model = XCodec2Model.from_pretrained(model_path)
67
+ Codec_model.eval().cuda()
68
+
69
+ input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.'
70
+ # input_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
71
+ def ids_to_speech_tokens(speech_ids):
72
+
73
+ speech_tokens_str = []
74
+ for speech_id in speech_ids:
75
+ speech_tokens_str.append(f"<|s_{speech_id}|>")
76
+ return speech_tokens_str
77
+
78
+ def extract_speech_ids(speech_tokens_str):
79
+
80
+ speech_ids = []
81
+ for token_str in speech_tokens_str:
82
+ if token_str.startswith('<|s_') and token_str.endswith('|>'):
83
+ num_str = token_str[4:-2]
84
+
85
+ num = int(num_str)
86
+ speech_ids.append(num)
87
+ else:
88
+ print(f"Unexpected token: {token_str}")
89
+ return speech_ids
90
+
91
+ #TTS start!
92
+ with torch.no_grad():
93
+
94
+ formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
95
+
96
+ # Tokenize the text
97
+ chat = [
98
+ {"role": "user", "content": "Convert the text to speech:" + formatted_text},
99
+ {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
100
+ ]
101
+
102
+ input_ids = tokenizer.apply_chat_template(
103
+ chat,
104
+ tokenize=True,
105
+ return_tensors='pt',
106
+ continue_final_message=True
107
+ )
108
+ input_ids = input_ids.to('cuda')
109
+ speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
110
+
111
+ # Generate the speech autoregressively
112
+ outputs = model.generate(
113
+ input_ids,
114
+ max_length=2048, # We trained our model with a max length of 2048
115
+ eos_token_id= speech_end_id ,
116
+ do_sample=True,
117
+ top_p=1, # Adjusts the diversity of generated content
118
+ temperature=0.8, # Controls randomness in output
119
+ )
120
+ # Extract the speech tokens
121
+ generated_ids = outputs[0][input_ids.shape[1]:-1]
122
+
123
+ speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
124
+
125
+ # Convert token <|s_23456|> to int 23456
126
+ speech_tokens = extract_speech_ids(speech_tokens)
127
+
128
+ speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
129
+
130
+ # Decode the speech tokens to speech waveform
131
+ gen_wav = Codec_model.decode_code(speech_tokens)
132
+
133
+
134
+ sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
135
+ ```
136
+
137
+ **2. Speech synthesis utilizing a given speech prompt**
138
+
139
+ ```python
140
+ from transformers import AutoTokenizer, AutoModelForCausalLM
141
+ import torch
142
+ import soundfile as sf
143
+
144
+ llasa_3b ='HKUST-Audio/Llasa-3B'
145
+
146
+ tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
147
+ model = AutoModelForCausalLM.from_pretrained(llasa_3b)
148
+ model.eval()
149
+ model.to('cuda')
150
+
151
+ from xcodec2.modeling_xcodec2 import XCodec2Model
152
+
153
+ model_path = "HKUST-Audio/xcodec2"
154
+
155
+ Codec_model = XCodec2Model.from_pretrained(model_path)
156
+ Codec_model.eval().cuda()
157
+ # only 16khz speech support!
158
+ prompt_wav, sr = sf.read("太乙真人.wav") # you can find wav in Files
159
+ #prompt_wav, sr = sf.read("Anna.wav") # English prompt
160
+ prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)
161
+
162
+ prompt_text ="对,这就是我万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。"
163
+ #promt_text = "A chance to leave him alone, but... No. She just wanted to see him again. Anna, you don't know how it feels to lose a sister. Anna, I'm sorry, but your father asked me not to tell you anything."
164
+ target_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
165
+ #target_text = "Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me."
166
+ input_text = prompt_text + target_text
167
+
168
+ def ids_to_speech_tokens(speech_ids):
169
+
170
+ speech_tokens_str = []
171
+ for speech_id in speech_ids:
172
+ speech_tokens_str.append(f"<|s_{speech_id}|>")
173
+ return speech_tokens_str
174
+
175
+ def extract_speech_ids(speech_tokens_str):
176
+
177
+ speech_ids = []
178
+ for token_str in speech_tokens_str:
179
+ if token_str.startswith('<|s_') and token_str.endswith('|>'):
180
+ num_str = token_str[4:-2]
181
+
182
+ num = int(num_str)
183
+ speech_ids.append(num)
184
+ else:
185
+ print(f"Unexpected token: {token_str}")
186
+ return speech_ids
187
+
188
+ #TTS start!
189
+ with torch.no_grad():
190
+ # Encode the prompt wav
191
+ vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
192
+ print("Prompt Vq Code Shape:", vq_code_prompt.shape )
193
+
194
+ vq_code_prompt = vq_code_prompt[0,0,:]
195
+ # Convert int 12345 to token <|s_12345|>
196
+ speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)
197
+
198
+ formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
199
+
200
+ # Tokenize the text and the speech prefix
201
+ chat = [
202
+ {"role": "user", "content": "Convert the text to speech:" + formatted_text},
203
+ {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
204
+ ]
205
+
206
+ input_ids = tokenizer.apply_chat_template(
207
+ chat,
208
+ tokenize=True,
209
+ return_tensors='pt',
210
+ continue_final_message=True
211
+ )
212
+ input_ids = input_ids.to('cuda')
213
+ speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
214
+
215
+ # Generate the speech autoregressively
216
+ outputs = model.generate(
217
+ input_ids,
218
+ max_length=2048, # We trained our model with a max length of 2048
219
+ eos_token_id= speech_end_id ,
220
+ do_sample=True,
221
+ top_p=1,
222
+ temperature=0.8,
223
+ )
224
+ # Extract the speech tokens
225
+ generated_ids = outputs[0][input_ids.shape[1]-len(speech_ids_prefix):-1]
226
+
227
+ speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
228
+
229
+ # Convert token <|s_23456|> to int 23456
230
+ speech_tokens = extract_speech_ids(speech_tokens)
231
+
232
+ speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
233
+
234
+ # Decode the speech tokens to speech waveform
235
+ gen_wav = Codec_model.decode_code(speech_tokens)
236
+
237
+ # if only need the generated part
238
+ # gen_wav = gen_wav[:,:,prompt_wav.shape[1]:]
239
+
240
+ sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
241
+ ```
242
+
243
+
244
+ ## Disclaimer
245
+
246
+ This model is licensed under the CC BY-NC-ND 4.0 License, which prohibits free commercial use because of ethics and privacy concerns; detected violations will result in legal consequences.
247
+
248
+ This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws.