--- datasets: - pkufool/libriheavy language: - en pipeline_tag: text-to-speech --- # SLED-TTS: Efficient Speech Language Modeling via Energy Distance in Continuous Space [![HuggingFace](https://img.shields.io/badge/HuggingFace-FEC200?style=flat&logo=Hugging%20Face)](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac) [![WeChat AI](https://img.shields.io/badge/WeChat%20AI-4CAF50?style=flat&logo=wechat)](https://www.wechat.com) [![ICT/CAS](https://img.shields.io/badge/ICT%2FCAS-0066cc?style=flat&logo=school)](https://ict.cas.cn) ## Codes: https://github.com/ictnlp/SLED-TTS ## Key features - **Autoregressive Continuous Modeling**: SLED models speech in a continuous latent space using a speacial type of maximum mean discrepancy as the objective. - **Streaming Synthesis**: SLED supports streaming synthesis, enabling speech generation to start as soon as the text stream begins. - **Voice Cloning**: Capable of generating speech based on a 3-second prefix or reference utterance as prompt. ## Demo You can check SLED in action by exploring the [demo page](https://sled-demo.github.io/).
## Available Models on Hugging Face We have made SLED available on [Hugging Face](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac), currently offering two distinct English models for different use cases: 1. **[SLED-TTS-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Libriheavy)**: This model is trained on the Libriheavy dataset and provides high-quality text-to-speech synthesis. 2. **[SLED-TTS-Streaming-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Streaming-Libriheavy)**: This variant supports **streaming decoding**, which generates a 0.6-second speech chunk for every 5 text tokens received. It’s ideal for applications requiring low-latency audio generation. The Mandarin models are on the way! Alternatively, you can train your own SLED-TTS models by following the guidelines below. ## Usage **We provide the training and inference code for SLED-TTS.** ### Installation ``` sh git clone https://github.com/ictnlp/SLED-TTS.git cd SLED-TTS pip install -e ./ ``` We currently utilize the sum of the first 8 embedding vectors from [Encodec_24khz](https://huggingface.co/facebook/encodec_24khz) as the continuous latent vector. To proceed, ensure that [Encodec_24khz](https://huggingface.co/facebook/encodec_24khz) is downloaded and cached in your HuggingFace dir. ### Inference - Set the `CHECKPOINT` variable to the path of the cached **[SLED-TTS-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Libriheavy)** or **[SLED-TTS-Streaming-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Streaming-Libriheavy)** model. - Diverse generation results can be obtained by varying the `SEED` variable. ``` sh CHECKPOINT=/path/to/checkpoint CFG=2.0 SEED=0 ``` ***Offline Inference*** ``` sh python scripts/run_offline.py \ --model_name_or_path ${CHECKPOINT} \ --cfg ${CFG} \ --input "My remark pleases him, but I soon prove to him that it is not the right way to speak. However perfect may have been the language of that ancient writer." \ --seed ${SEED} ``` ***Streaming Inference*** ``` sh python scripts/run_stream.py \ --model_name_or_path ${CHECKPOINT} \ --cfg ${CFG} \ --input "My remark pleases him, but I soon prove to him that it is not the right way to speak. However perfect may have been the language of that ancient writer." \ --seed ${SEED} # Please note that we have simulated the generation in a streaming environment in run_stream.py for evaluating its quality. # However, the existing code does not actually provide a streaming API. ``` ***Voice Clone*** You can adjust the prompt speech by setting `--prompt_text` and `--prompt_audio`. ``` sh python scripts/run_voice_clone.py \ --prompt_text "Were I in the warm room with all the splendor and magnificence!" \ --prompt_audio "example_prompt.flac" \ --model_name_or_path ${CHECKPOINT} \ --cfg ${CFG} \ --input "Perhaps the other trees from the forest will come to look at me!" \ --seed ${SEED} ``` ### Training ***Data Processing*** #TODO ***Training Offline Model*** ``` sh OUTPUT_DIR=./runs/libriheavy mkdir -p $OUTPUT_DIR LOG_FILE=${OUTPUT_DIR}/log BATCH_SIZE=8 UPDATE_FREQ=8 # assume 8 proc per node, then WORLD_SIZE * 8 * BATCH_SIZE * UPDATE_FREQ == 512 torchrun --nnodes ${WORLD_SIZE} --node_rank ${RANK} --nproc_per_node 8 --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} \ ./scripts/train_libriheavy.py \ --training_cfg 0.1 \ --num_hidden_layers 12 --diffloss_d 6 --noise_channels 128 \ --dataloader_num_workers 8 \ --dataloader_pin_memory True \ --remove_unused_columns False \ --label_names audio_inputs \ --group_by_speech_length \ --do_train \ --do_eval \ --eval_strategy steps \ --eval_steps 10000 \ --prediction_loss_only \ --per_device_train_batch_size ${BATCH_SIZE} \ --per_device_eval_batch_size 24 \ --gradient_accumulation_steps ${UPDATE_FREQ} \ --bf16 \ --learning_rate 5e-4 \ --weight_decay 0.01 \ --adam_beta1 0.9 \ --adam_beta2 0.999 \ --adam_epsilon 1e-8 \ --max_grad_norm 1.0 \ --max_steps 300000 \ --lr_scheduler_type "linear" \ --warmup_steps 32000 \ --logging_first_step \ --logging_steps 100 \ --save_steps 10000 \ --save_total_limit 10 \ --output_dir ${OUTPUT_DIR} \ --report_to tensorboard \ --disable_tqdm True \ --ddp_timeout 3600 --overwrite_output_dir ``` ***Training Streaming Model*** ``` sh OUTPUT_DIR=./runs/libriheavy_stream mkdir -p $OUTPUT_DIR LOG_FILE=${OUTPUT_DIR}/log BATCH_SIZE=8 UPDATE_FREQ=8 # assume 8 proc per node, then WORLD_SIZE * 8 * BATCH_SIZE * UPDATE_FREQ == 512 torchrun --nnodes ${WORLD_SIZE} --node_rank ${RANK} --nproc_per_node 8 --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} \ ./scripts/train_libriheavy_stream.py \ --finetune_path ./runs/libriheavy/checkpoint-300000/model.safetensors \ --stream_n 5 --stream_m 45 \ --training_cfg 0.1 \ --num_hidden_layers 12 --diffloss_d 6 --noise_channels 128 \ --dataloader_num_workers 8 \ --dataloader_pin_memory True \ --remove_unused_columns False \ --label_names audio_inputs \ --group_by_speech_length \ --do_train \ --do_eval \ --eval_strategy steps \ --eval_steps 10000 \ --prediction_loss_only \ --per_device_train_batch_size ${BATCH_SIZE} \ --per_device_eval_batch_size 24 \ --gradient_accumulation_steps ${UPDATE_FREQ} \ --bf16 \ --learning_rate 3e-4 \ --weight_decay 0.01 \ --adam_beta1 0.9 \ --adam_beta2 0.999 \ --adam_epsilon 1e-8 \ --max_grad_norm 1.0 \ --max_steps 100000 \ --lr_scheduler_type "linear" \ --warmup_steps 10000 \ --logging_first_step \ --logging_steps 100 \ --save_steps 10000 \ --save_total_limit 10 \ --output_dir ${OUTPUT_DIR} \ --report_to tensorboard \ --disable_tqdm True \ --ddp_timeout 3600 --overwrite_output_dir ``` ## Code Contributors - [Zhengrui Ma](https://scholar.google.com/citations?user=dUgq6tEAAAAJ) - [Chenze Shao](https://scholar.google.com/citations?user=LH_rZf8AAAAJ) ## Ackonwledgement This work is inspired by following great works: - A Proper Loss Is All You Need: Autoregressive Image Generation in Continuous Space via Score Maximization - Autoregressive Image Generation without Vector Quantization - A Spectral Energy Distance for Parallel Speech Synthesis ## Citation #TODO