microsoft/Phi-4-multimodal-instruct · Add Appendix B: Fine-tuning Korean speech

Add Appendix B: Fine-tuning Korean speech

#40

by daekeun-ml - opened 5 days ago

base: refs/heads/main

←

from: refs/pr/40

Discussion Files changed

+143

-0

Files changed (1) hide show

README.md +143 -0

README.md CHANGED Viewed

@@ -653,3 +653,146 @@ The model was evaluated across a breadth of public and internal benchmarks to un
   + Red Team:
     + Responses to prompts provided by AI Red Team at Microsoft
 </details>

   + Red Team:
     + Responses to prompts provided by AI Red Team at Microsoft
 </details>
+## Appendix B: Fine-tuning Korean speech
+<details>
+  <summary>Click to view detail descriptions</summary>
+### Overview and Datasets
+Phi-4-multimodal is originally not designed for Korean speech-to-text task, but it can be fine-tuned for Korean speech-to-text task using your own data or public Korean speech datasets.
+We have fine-tuned Phi-4-multimodal model for Korean speech-to-text task using the following datasets:
+- kresnik/zeroth_korean
+- mozilla-foundation/common_voice_17_0 (Used Korean speech only)
+- PolyAI/minds14 (Used Korean speech only)
+- Custom dataset. The speech was a mix of fast and slow speech (Technical blog contents and presentations that the author have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py)
+Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.
+You can download the fine-tuned model [here](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech). Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.
+### Requirements
+Based on Python 3.10, the following packages are required, and A100/H100 GPU is recommended.
+```
+torch==2.6.0
+transformers==4.48.2
+accelerate==1.4.0
+soundfile==0.13.1
+pillow==11.1.0
+scipy==1.15.2
+torchvision==0.21.0
+backoff==2.2.1
+peft==0.14.0
+datasets==3.3.2
+pandas==2.2.3
+flash_attn==2.7.4.post1
+evaluate==0.4.3
+sacrebleu==2.5.1
+```
+### Training
+The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
+The fine tuning script and command line are basically the same as [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-main-py), but you need to prepare your own dataset. Also, to perform audio encoder unfreeze, please refer to the code snippet below. The code snippet is retrieved from [the fine-tuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing).
+```python
+with accelerator.local_main_process_first():
+    processor = AutoProcessor.from_pretrained(
+        "microsoft/Phi-4-multimodal-instruct",
+        trust_remote_code=True,
+    )
+    model = create_model(
+        args.model_name_or_path,
+        use_flash_attention=args.use_flash_attention,
+    )
+def unfreeze_speech_components(model):
+    """Directly target verified components from your debug logs"""
+    # 1. Audio Embed Module (confirmed exists)
+    audio_embed = model.model.embed_tokens_extend.audio_embed
+    # 2. Entire Audio Encoder (simplified)
+    audio_encoder = audio_embed.encoder  # Direct access
+    # 3. Audio Projection (from debug logs)
+    audio_projection = audio_embed.audio_projection
+    # Unfreeze ONLY these 3 components
+    for component in [audio_embed, audio_encoder, audio_projection]:
+        for param in component.parameters():
+            param.requires_grad = True
+    return model
+model = unfreeze_speech_components(model)
+# Verify unfrozen parameters
+trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+print(f"Trainable parameters: {trainable_params:,}")
+# After unfreezing
+encoder_params = list(model.model.embed_tokens_extend.audio_embed.encoder.parameters())
+proj_params = list(model.model.embed_tokens_extend.audio_embed.audio_projection.parameters())
+assert any(p.requires_grad for p in encoder_params), "Encoder params frozen!"
+assert any(p.requires_grad for p in proj_params), "Projection params frozen!"
+print("Components properly unfrozen ✅")
+```
+Example commands to run finetuning scripts are as follows:
+```bash
+python main.py
+```
+The latest version of the model currently uploaded was fine-tuned by **unfreezing the audio encoder**, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning.
+Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is **1.61%** and 2.72%, and the WER on zeroth-test set is **3.54%** and 7.19%, respectively. Please refer to the [Experimental Settings and Results](#experimental-settings-and-results) for more details.
+### Experimental Settings and Results
+The purpose of this benchmarking setup is to evaluate the basic performance of Korean audio in speech and audio understanding tasks. We did this for automatic speech recognition and automatic speech translation, and the test data used the following datasets and samples:
+Evaluation was done on the following datasets:
++ ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on [zeroth-test set (457 samples)](https://huggingface.co/datasets/kresnik/zeroth_korean).
++ AST (Automatic Speech Translation): Evaluated with BLEU score on [fleurs ko <-> en speech translation test set (270 samples)](https://huggingface.co/datasets/seastar105/fleurs_ko_en_test).
+Evaluation Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py)
+We used the [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) as a baseline to improve performance, as it showed significant performance improvement with 1 epoch. Note that the baseline was trained with [22K Zeroth Korean Korean speech data](https://huggingface.co/datasets/kresnik/zeroth_korean) for 1 epoch. Based on this baseline with 35K training samples, we conducted additional experiments with the following scenarios:
++ [Case 1] LoRA finetune (1 epoch): LoRA adapter-based fine-tuning for 1 epochs
++ [Case 2] LoRA finetune (4 epochs): LoRA adapter-based fine-tuning for 4 epochs
++ [Case 3] Unfreeze audio encoder finetune (4 epochs): Full fine-tuning for 4 epochs.
+The results of the experiments are as follows:
++ CER and WER for zeroth-test set (Lower is better)
+  + Case 1's CER and WER are 3.80% and 11.52%, respectively, which are better than the baseline (7.02% and 17.31%).
+  + Case 2's CER and WER are 2.72% and 7.19%, respectively, which are better than Case 1.
+  + Case 3's CER and WER are 1.61% and 3.54%, respectively, which are the best among the cases.
++ BLEU score for fleurs ko <-> en speech translation test set (Higher is better)
+  + Case 1's result is not improved compared to the baseline. Especially, the BLEU score for fleurs-ko2en-cot is decreased compared to the baseline.
+  + Case 2's result is slightly improved compared to Case 1, which is the best among the cases.
+  + Case 3's result is not improved compared to the baseline and Case 2.
+| Model                          | zeroth (CER) | zeroth (WER) | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
+|--------------------------------|-------------|-------------|--------------|------------------|--------------|------------------|
+| original                       | 99.16       | 99.63       | 5.63         | 2.42             | 6.86         | 4.17             |
+| Ours - speech full finetune (4 epochs) | 1.61        | 3.54        | 7.67         | 8.38             | 12.31        | 9.69             |
+| LoRA finetune (4 epochs)        | 2.72        | 7.19        | 7.11         | 9.95             | 13.22        | 10.45            |
+| LoRA finetune (1 epoch)         | 3.80        | 11.52       | 7.03         | 7.04             | 12.50        | 9.54             |
+| Phi-4-mm-inst-zeroth-kor        | 7.02        | 17.31       | 7.07         | 9.19             | 13.08        | 9.35             |
+## Cautions
+Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
+Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
+## References
+- https://huggingface.co/microsoft/Phi-4-multimodal-instruct
+- https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor
+</details>