ZhenYe234 commited on
Commit
a84d322
·
verified ·
1 Parent(s): 58c198c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +239 -0
README.md ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ language:
4
+ - zh
5
+ - en
6
+ base_model:
7
+ - meta-llama/Llama-3.1-8B-Instruct
8
+ tags:
9
+ - Text-to-Speech
10
+ pipeline_tag: text-to-speech
11
+ ---
12
+
13
+ ## Paper
14
+ LLaSA: Scaling Train-Time and Test-Time Compute for LLaMA-based Speech Synthesis (Comming soon)
15
+
16
+ - **Train from Scratch**: If you want to train the model from scratch, use the [LLaSA Training Repository](https://github.com/zhenye234/LLaSA_training).
17
+
18
+ - **Scale for Test-Time Computation**: If you want to experiment with scaling for test-time computation, use the [LLaSA Testing Repository](https://github.com/zhenye234/LLaSA_inference).
19
+
20
+ ## Model Information
21
+ Our model, Llasa, is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook,
22
+ which contains 65,536 tokens. We trained Llasa on a dataset comprising 250,000 hours of Chinese-English speech data.
23
+ The model is capable of generating speech **either solely from input text or by utilizing a given speech prompt.**
24
+
25
+ The method is seamlessly compatible with the Llama framework, making training TTS similar as training LLM (convert audios into single-codebook tokens and simply view it as a special language). It opens the possiblity of existing method for compression, acceleration and finetuning for LLM to be applied.
26
+
27
+ **More brief information of XCodec and XCodec2** can be found from
28
+ https://huggingface.co/HKUSTAudio/Llasa-3B/discussions/11
29
+
30
+
31
+
32
+ ## How to use
33
+ Install [XCodec2](https://huggingface.co/HKUST-Audio/xcodec2). (Please use new version of xcodec2==0.1.3)
34
+ ```bash
35
+ conda create -n xcodec2 python=3.9
36
+ conda activate xcodec2
37
+ pip install xcodec2==0.1.3
38
+ ```
39
+
40
+ **1. Speech synthesis solely from input text**
41
+ ```python
42
+ from transformers import AutoTokenizer, AutoModelForCausalLM
43
+ import torch
44
+ import soundfile as sf
45
+
46
+ llasa_3b ='HKUST-Audio/Llasa-3B'
47
+
48
+ tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
49
+ model = AutoModelForCausalLM.from_pretrained(llasa_3b)
50
+ model.eval()
51
+ model.to('cuda')
52
+
53
+ from xcodec2.modeling_xcodec2 import XCodec2Model
54
+
55
+ model_path = "HKUST-Audio/xcodec2"
56
+
57
+ Codec_model = XCodec2Model.from_pretrained(model_path)
58
+ Codec_model.eval().cuda()
59
+
60
+ input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.'
61
+ # input_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
62
+ def ids_to_speech_tokens(speech_ids):
63
+
64
+ speech_tokens_str = []
65
+ for speech_id in speech_ids:
66
+ speech_tokens_str.append(f"<|s_{speech_id}|>")
67
+ return speech_tokens_str
68
+
69
+ def extract_speech_ids(speech_tokens_str):
70
+
71
+ speech_ids = []
72
+ for token_str in speech_tokens_str:
73
+ if token_str.startswith('<|s_') and token_str.endswith('|>'):
74
+ num_str = token_str[4:-2]
75
+
76
+ num = int(num_str)
77
+ speech_ids.append(num)
78
+ else:
79
+ print(f"Unexpected token: {token_str}")
80
+ return speech_ids
81
+
82
+ #TTS start!
83
+ with torch.no_grad():
84
+
85
+ formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
86
+
87
+ # Tokenize the text
88
+ chat = [
89
+ {"role": "user", "content": "Convert the text to speech:" + formatted_text},
90
+ {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
91
+ ]
92
+
93
+ input_ids = tokenizer.apply_chat_template(
94
+ chat,
95
+ tokenize=True,
96
+ return_tensors='pt',
97
+ continue_final_message=True
98
+ )
99
+ input_ids = input_ids.to('cuda')
100
+ speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
101
+
102
+ # Generate the speech autoregressively
103
+ outputs = model.generate(
104
+ input_ids,
105
+ max_length=2048, # We trained our model with a max length of 2048
106
+ eos_token_id= speech_end_id ,
107
+ do_sample=True,
108
+ top_p=1, # Adjusts the diversity of generated content
109
+ temperature=0.8, # Controls randomness in output
110
+ )
111
+ # Extract the speech tokens
112
+ generated_ids = outputs[0][input_ids.shape[1]:-1]
113
+
114
+ speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
115
+
116
+ # Convert token <|s_23456|> to int 23456
117
+ speech_tokens = extract_speech_ids(speech_tokens)
118
+
119
+ speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
120
+
121
+ # Decode the speech tokens to speech waveform
122
+ gen_wav = Codec_model.decode_code(speech_tokens)
123
+
124
+
125
+ sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
126
+ ```
127
+
128
+ **2. Speech synthesis utilizing a given speech prompt**
129
+
130
+ ```python
131
+ from transformers import AutoTokenizer, AutoModelForCausalLM
132
+ import torch
133
+ import soundfile as sf
134
+
135
+ llasa_3b ='HKUST-Audio/Llasa-3B'
136
+
137
+ tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
138
+ model = AutoModelForCausalLM.from_pretrained(llasa_3b)
139
+ model.eval()
140
+ model.to('cuda')
141
+
142
+ from xcodec2.modeling_xcodec2 import XCodec2Model
143
+
144
+ model_path = "HKUST-Audio/xcodec2"
145
+
146
+ Codec_model = XCodec2Model.from_pretrained(model_path)
147
+ Codec_model.eval().cuda()
148
+ # only 16khz speech support!
149
+ prompt_wav, sr = sf.read("太乙真人.wav") # you can find wav in Files
150
+ #prompt_wav, sr = sf.read("Anna.wav") # English prompt
151
+ prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)
152
+
153
+ prompt_text ="对,这就是我万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。"
154
+ #promt_text = "A chance to leave him alone, but... No. She just wanted to see him again. Anna, you don't know how it feels to lose a sister. Anna, I'm sorry, but your father asked me not to tell you anything."
155
+ target_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
156
+ #target_text = "Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me."
157
+ input_text = prompt_text + target_text
158
+
159
+ def ids_to_speech_tokens(speech_ids):
160
+
161
+ speech_tokens_str = []
162
+ for speech_id in speech_ids:
163
+ speech_tokens_str.append(f"<|s_{speech_id}|>")
164
+ return speech_tokens_str
165
+
166
+ def extract_speech_ids(speech_tokens_str):
167
+
168
+ speech_ids = []
169
+ for token_str in speech_tokens_str:
170
+ if token_str.startswith('<|s_') and token_str.endswith('|>'):
171
+ num_str = token_str[4:-2]
172
+
173
+ num = int(num_str)
174
+ speech_ids.append(num)
175
+ else:
176
+ print(f"Unexpected token: {token_str}")
177
+ return speech_ids
178
+
179
+ #TTS start!
180
+ with torch.no_grad():
181
+ # Encode the prompt wav
182
+ vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
183
+ print("Prompt Vq Code Shape:", vq_code_prompt.shape )
184
+
185
+ vq_code_prompt = vq_code_prompt[0,0,:]
186
+ # Convert int 12345 to token <|s_12345|>
187
+ speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)
188
+
189
+ formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
190
+
191
+ # Tokenize the text and the speech prefix
192
+ chat = [
193
+ {"role": "user", "content": "Convert the text to speech:" + formatted_text},
194
+ {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
195
+ ]
196
+
197
+ input_ids = tokenizer.apply_chat_template(
198
+ chat,
199
+ tokenize=True,
200
+ return_tensors='pt',
201
+ continue_final_message=True
202
+ )
203
+ input_ids = input_ids.to('cuda')
204
+ speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
205
+
206
+ # Generate the speech autoregressively
207
+ outputs = model.generate(
208
+ input_ids,
209
+ max_length=2048, # We trained our model with a max length of 2048
210
+ eos_token_id= speech_end_id ,
211
+ do_sample=True,
212
+ top_p=1,
213
+ temperature=0.8,
214
+ )
215
+ # Extract the speech tokens
216
+ generated_ids = outputs[0][input_ids.shape[1]-len(speech_ids_prefix):-1]
217
+
218
+ speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
219
+
220
+ # Convert token <|s_23456|> to int 23456
221
+ speech_tokens = extract_speech_ids(speech_tokens)
222
+
223
+ speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
224
+
225
+ # Decode the speech tokens to speech waveform
226
+ gen_wav = Codec_model.decode_code(speech_tokens)
227
+
228
+ # if only need the generated part
229
+ # gen_wav = gen_wav[:,:,prompt_wav.shape[1]:]
230
+
231
+ sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
232
+ ```
233
+
234
+
235
+ ## Disclaimer
236
+
237
+ This model is licensed under the CC BY-NC-ND 4.0 License, which prohibits free commercial use because of ethics and privacy concerns; detected violations will result in legal consequences.
238
+
239
+ This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws.