Add Appendix B: Fine-tuning Korean speech

#40
Files changed (1) hide show
  1. README.md +143 -0
README.md CHANGED
@@ -653,3 +653,146 @@ The model was evaluated across a breadth of public and internal benchmarks to un
653
  + Red Team:
654
  + Responses to prompts provided by AI Red Team at Microsoft
655
  </details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
653
  + Red Team:
654
  + Responses to prompts provided by AI Red Team at Microsoft
655
  </details>
656
+
657
+
658
+ ## Appendix B: Fine-tuning Korean speech
659
+
660
+ <details>
661
+ <summary>Click to view detail descriptions</summary>
662
+
663
+ ### Overview and Datasets
664
+
665
+ Phi-4-multimodal is originally not designed for Korean speech-to-text task, but it can be fine-tuned for Korean speech-to-text task using your own data or public Korean speech datasets.
666
+
667
+ We have fine-tuned Phi-4-multimodal model for Korean speech-to-text task using the following datasets:
668
+
669
+ - kresnik/zeroth_korean
670
+ - mozilla-foundation/common_voice_17_0 (Used Korean speech only)
671
+ - PolyAI/minds14 (Used Korean speech only)
672
+ - Custom dataset. The speech was a mix of fast and slow speech (Technical blog contents and presentations that the author have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py)
673
+
674
+ Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.
675
+
676
+ You can download the fine-tuned model [here](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech). Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.
677
+
678
+ ### Requirements
679
+ Based on Python 3.10, the following packages are required, and A100/H100 GPU is recommended.
680
+ ```
681
+ torch==2.6.0
682
+ transformers==4.48.2
683
+ accelerate==1.4.0
684
+ soundfile==0.13.1
685
+ pillow==11.1.0
686
+ scipy==1.15.2
687
+ torchvision==0.21.0
688
+ backoff==2.2.1
689
+ peft==0.14.0
690
+ datasets==3.3.2
691
+ pandas==2.2.3
692
+ flash_attn==2.7.4.post1
693
+ evaluate==0.4.3
694
+ sacrebleu==2.5.1
695
+ ```
696
+
697
+ ### Training
698
+ The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
699
+
700
+ The fine tuning script and command line are basically the same as [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-main-py), but you need to prepare your own dataset. Also, to perform audio encoder unfreeze, please refer to the code snippet below. The code snippet is retrieved from [the fine-tuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing).
701
+
702
+ ```python
703
+ with accelerator.local_main_process_first():
704
+ processor = AutoProcessor.from_pretrained(
705
+ "microsoft/Phi-4-multimodal-instruct",
706
+ trust_remote_code=True,
707
+ )
708
+ model = create_model(
709
+ args.model_name_or_path,
710
+ use_flash_attention=args.use_flash_attention,
711
+ )
712
+
713
+ def unfreeze_speech_components(model):
714
+ """Directly target verified components from your debug logs"""
715
+ # 1. Audio Embed Module (confirmed exists)
716
+ audio_embed = model.model.embed_tokens_extend.audio_embed
717
+
718
+ # 2. Entire Audio Encoder (simplified)
719
+ audio_encoder = audio_embed.encoder # Direct access
720
+
721
+ # 3. Audio Projection (from debug logs)
722
+ audio_projection = audio_embed.audio_projection
723
+
724
+ # Unfreeze ONLY these 3 components
725
+ for component in [audio_embed, audio_encoder, audio_projection]:
726
+ for param in component.parameters():
727
+ param.requires_grad = True
728
+ return model
729
+
730
+ model = unfreeze_speech_components(model)
731
+
732
+ # Verify unfrozen parameters
733
+ trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
734
+ print(f"Trainable parameters: {trainable_params:,}")
735
+
736
+ # After unfreezing
737
+ encoder_params = list(model.model.embed_tokens_extend.audio_embed.encoder.parameters())
738
+ proj_params = list(model.model.embed_tokens_extend.audio_embed.audio_projection.parameters())
739
+
740
+ assert any(p.requires_grad for p in encoder_params), "Encoder params frozen!"
741
+ assert any(p.requires_grad for p in proj_params), "Projection params frozen!"
742
+ print("Components properly unfrozen ✅")
743
+ ```
744
+
745
+ Example commands to run finetuning scripts are as follows:
746
+ ```bash
747
+ python main.py
748
+ ```
749
+
750
+ The latest version of the model currently uploaded was fine-tuned by **unfreezing the audio encoder**, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning.
751
+ Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is **1.61%** and 2.72%, and the WER on zeroth-test set is **3.54%** and 7.19%, respectively. Please refer to the [Experimental Settings and Results](#experimental-settings-and-results) for more details.
752
+
753
+ ### Experimental Settings and Results
754
+ The purpose of this benchmarking setup is to evaluate the basic performance of Korean audio in speech and audio understanding tasks. We did this for automatic speech recognition and automatic speech translation, and the test data used the following datasets and samples:
755
+
756
+ Evaluation was done on the following datasets:
757
+ + ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on [zeroth-test set (457 samples)](https://huggingface.co/datasets/kresnik/zeroth_korean).
758
+ + AST (Automatic Speech Translation): Evaluated with BLEU score on [fleurs ko <-> en speech translation test set (270 samples)](https://huggingface.co/datasets/seastar105/fleurs_ko_en_test).
759
+
760
+ Evaluation Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py)
761
+
762
+ We used the [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) as a baseline to improve performance, as it showed significant performance improvement with 1 epoch. Note that the baseline was trained with [22K Zeroth Korean Korean speech data](https://huggingface.co/datasets/kresnik/zeroth_korean) for 1 epoch. Based on this baseline with 35K training samples, we conducted additional experiments with the following scenarios:
763
+
764
+ + [Case 1] LoRA finetune (1 epoch): LoRA adapter-based fine-tuning for 1 epochs
765
+ + [Case 2] LoRA finetune (4 epochs): LoRA adapter-based fine-tuning for 4 epochs
766
+ + [Case 3] Unfreeze audio encoder finetune (4 epochs): Full fine-tuning for 4 epochs.
767
+
768
+ The results of the experiments are as follows:
769
+ + CER and WER for zeroth-test set (Lower is better)
770
+ + Case 1's CER and WER are 3.80% and 11.52%, respectively, which are better than the baseline (7.02% and 17.31%).
771
+ + Case 2's CER and WER are 2.72% and 7.19%, respectively, which are better than Case 1.
772
+ + Case 3's CER and WER are 1.61% and 3.54%, respectively, which are the best among the cases.
773
+
774
+ + BLEU score for fleurs ko <-> en speech translation test set (Higher is better)
775
+ + Case 1's result is not improved compared to the baseline. Especially, the BLEU score for fleurs-ko2en-cot is decreased compared to the baseline.
776
+ + Case 2's result is slightly improved compared to Case 1, which is the best among the cases.
777
+ + Case 3's result is not improved compared to the baseline and Case 2.
778
+
779
+ | Model | zeroth (CER) | zeroth (WER) | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
780
+ |--------------------------------|-------------|-------------|--------------|------------------|--------------|------------------|
781
+ | original | 99.16 | 99.63 | 5.63 | 2.42 | 6.86 | 4.17 |
782
+ | Ours - speech full finetune (4 epochs) | 1.61 | 3.54 | 7.67 | 8.38 | 12.31 | 9.69 |
783
+ | LoRA finetune (4 epochs) | 2.72 | 7.19 | 7.11 | 9.95 | 13.22 | 10.45 |
784
+ | LoRA finetune (1 epoch) | 3.80 | 11.52 | 7.03 | 7.04 | 12.50 | 9.54 |
785
+ | Phi-4-mm-inst-zeroth-kor | 7.02 | 17.31 | 7.07 | 9.19 | 13.08 | 9.35 |
786
+
787
+ ## Cautions
788
+
789
+ Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
790
+
791
+ Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
792
+
793
+ ## References
794
+
795
+ - https://huggingface.co/microsoft/Phi-4-multimodal-instruct
796
+ - https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor
797
+
798
+ </details>