Qwen2.5-7B-Instruct-Add-Speech-Token-4096-Nostrip

Introduction

This repo contains the Qwen2.5-7B-Instruct-Add-Speech-Token-4096-Nostrip model utilized to train the EMOVA series of models. Based on the original Qwen2.5-7B-Instruct checkpoint, we further insert speech tokens into its vocabulary for end-to-end omni-modal alignment as follows. The total number of speech tokens in EMOVA speech tokenizer is 4096. Therefore, it should be utilized as initialization in the Stage 2: Omni-modal text-centric alignment of EMOVA training.

# Source code can be found https://github.com/emova-ollm/EMOVA#insert-speech-tokens-into-llm-vocabulary
python scripts/insert_speech_token.py \
  --origin_model_path Qwen/Qwen2.5-7B-Instruct \
  --saved_model_path ./Qwen2.5-7B-Instruct_add_speech_token_4096_nostrip \
  --num_speech_tokens 4096

Usage

To train EMOVA with Qwen2.5-7B-Instruct_add_speech_token_4096_nostrip, we need to create a new model config, and set the language_model parameters as follows. An example is provided here. Check more details on training EMOVA in our github repo.

language_model=dict(
  type='EmovaQwen2ForCausalLM',                                              -- Wrapper class type for EMOVA
  pretrained_model_name_or_path='Emova-ollm/Qwen2.5-7B-Instruct_add_speech_token_4096_nostrip',  -- HuggingFace repo of pre-trained LLM
  attn_implementation="flash_attention_2",                                   -- Attention type
  from_pretrained=True,                                                      -- Load pre-trained weights
),

Citation

@article{chen2024emova,
  title={Emova: Empowering language models to see, hear and speak with vivid emotions},
  author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
  journal={arXiv preprint arXiv:2409.18042},
  year={2024}
}

@article{qwen2.5,
    title   = {Qwen2.5 Technical Report}, 
    author  = {An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu},
    journal = {arXiv preprint arXiv:2412.15115},
    year    = {2024}
}
Downloads last month
10
Safetensors
Model size
7.08B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for Emova-ollm/Qwen2.5-7B-Instruct_add_speech_token_4096_nostrip

Base model

Qwen/Qwen2.5-7B
Finetuned
(752)
this model
Finetunes
2 models