Files changed (1) hide show
  1. README.md +120 -0
README.md CHANGED
@@ -653,3 +653,123 @@ The model was evaluated across a breadth of public and internal benchmarks to un
653
  + Red Team:
654
  + Responses to prompts provided by AI Red Team at Microsoft
655
  </details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
653
  + Red Team:
654
  + Responses to prompts provided by AI Red Team at Microsoft
655
  </details>
656
+
657
+
658
+
659
+
660
+ ## Appendix B: Fine-tuning Korean speech
661
+
662
+ <details>
663
+ <summary>Click to view detail descriptions</summary>
664
+
665
+ ### Overview and Datasets
666
+
667
+ Phi-4-multimodal is originally not designed for Korean speech-to-text task, but it can be fine-tuned for Korean speech-to-text task using your own data or public Korean speech datasets.
668
+
669
+ We have fine-tuned Phi-4-multimodal model for Korean speech-to-text task using the following datasets:
670
+
671
+ - kresnik/zeroth_korean
672
+ - mozilla-foundation/common_voice_17_0 (Used Korean speech only)
673
+ - PolyAI/minds14 (Used Korean speech only)
674
+ - Custom dataset. The speech was a mix of fast and slow speech (Technical blog contents and presentations that the author have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py)
675
+
676
+ Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.
677
+
678
+ You can download the fine-tuned model [here](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech). Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.
679
+
680
+ ### Training
681
+ The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
682
+
683
+ The latest version of the model currently uploaded was fine-tuned by **unfreezing the audio encoder**, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning.
684
+ Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is **1.61%** and 2.72%, and the WER on zeroth-test set is **3.54%** and 7.19%, respectively.
685
+
686
+ ### Experimental Settings and Results
687
+ The purpose of this benchmarking setup is to evaluate the basic performance of Korean audio in speech and audio understanding tasks. We did this for automatic speech recognition and automatic speech translation, and the test data used the following datasets and samples:
688
+
689
+ Evaluation was done on the following datasets:
690
+ + ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on [zeroth-test set (457 samples)](https://huggingface.co/datasets/kresnik/zeroth_korean).
691
+ + AST (Automatic Speech Translation): Evaluated with BLEU score on [fleurs ko <-> en speech translation test set (270 samples)](https://huggingface.co/datasets/seastar105/fleurs_ko_en_test).
692
+
693
+ Evaluation Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py)
694
+
695
+ We used the [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) as a baseline to improve performance, as it showed significant performance improvement with 1 epoch. Note that the baseline was trained with [22K Zeroth Korean Korean speech data](https://huggingface.co/datasets/kresnik/zeroth_korean) for 1 epoch. Based on this baseline with 35K training samples, we conducted additional experiments with the following scenarios:
696
+
697
+ + [Case 1] LoRA finetune (1 epoch): LoRA adapter-based fine-tuning for 1 epochs
698
+ + [Case 2] LoRA finetune (4 epochs): LoRA adapter-based fine-tuning for 4 epochs
699
+ + [Case 3] Unfreeze audio encoder finetune (4 epochs): Full fine-tuning for 4 epochs.
700
+
701
+ We retrieved the unfreeze audio encoder code snippets from [the fine-tuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing). The code snippets are as follows:
702
+ ```python
703
+ with accelerator.local_main_process_first():
704
+ processor = AutoProcessor.from_pretrained(
705
+ "microsoft/Phi-4-multimodal-instruct",
706
+ trust_remote_code=True,
707
+ )
708
+ model = create_model(
709
+ args.model_name_or_path,
710
+ use_flash_attention=args.use_flash_attention,
711
+ )
712
+
713
+ def unfreeze_speech_components(model):
714
+ """Directly target verified components from your debug logs"""
715
+ # 1. Audio Embed Module (confirmed exists)
716
+ audio_embed = model.model.embed_tokens_extend.audio_embed
717
+
718
+ # 2. Entire Audio Encoder (simplified)
719
+ audio_encoder = audio_embed.encoder # Direct access
720
+
721
+ # 3. Audio Projection (from debug logs)
722
+ audio_projection = audio_embed.audio_projection
723
+
724
+ # Unfreeze ONLY these 3 components
725
+ for component in [audio_embed, audio_encoder, audio_projection]:
726
+ for param in component.parameters():
727
+ param.requires_grad = True
728
+ return model
729
+
730
+ model = unfreeze_speech_components(model)
731
+
732
+ # Verify unfrozen parameters
733
+ trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
734
+ print(f"Trainable parameters: {trainable_params:,}")
735
+
736
+ # After unfreezing
737
+ encoder_params = list(model.model.embed_tokens_extend.audio_embed.encoder.parameters())
738
+ proj_params = list(model.model.embed_tokens_extend.audio_embed.audio_projection.parameters())
739
+
740
+ assert any(p.requires_grad for p in encoder_params), "Encoder params frozen!"
741
+ assert any(p.requires_grad for p in proj_params), "Projection params frozen!"
742
+ print("Components properly unfrozen ✅")
743
+ ```
744
+
745
+ The results of the experiments are as follows:
746
+ + CER and WER for zeroth-test set (Lower is better)
747
+ + Case 1's CER and WER are 3.80% and 11.52%, respectively, which are better than the baseline (7.02% and 17.31%).
748
+ + Case 2's CER and WER are 2.72% and 7.19%, respectively, which are better than Case 1.
749
+ + Case 3's CER and WER are 1.61% and 3.54%, respectively, which are the best among the cases.
750
+
751
+ + BLEU score for fleurs ko <-> en speech translation test set (Higher is better)
752
+ + Case 1's result is not improved compared to the baseline. Especially, the BLEU score for fleurs-ko2en-cot is decreased compared to the baseline.
753
+ + Case 2's result is slightly improved compared to Case 1, which is the best among the cases.
754
+ + Case 3's result is not improved compared to the baseline and Case 2.
755
+
756
+ | Model | zeroth (CER) | zeroth (WER) | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
757
+ |--------------------------------|-------------|-------------|--------------|------------------|--------------|------------------|
758
+ | original | 99.16 | 99.63 | 5.63 | 2.42 | 6.86 | 4.17 |
759
+ | Ours - speech full finetune (4 epochs) | 1.61 | 3.54 | 7.67 | 8.38 | 12.31 | 9.69 |
760
+ | LoRA finetune (4 epochs) | 2.72 | 7.19 | 7.11 | 9.95 | 13.22 | 10.45 |
761
+ | LoRA finetune (1 epoch) | 3.80 | 11.52 | 7.03 | 7.04 | 12.50 | 9.54 |
762
+ | Phi-4-mm-inst-zeroth-kor | 7.02 | 17.31 | 7.07 | 9.19 | 13.08 | 9.35 |
763
+
764
+ ### Cautions
765
+
766
+ Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
767
+
768
+ Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
769
+
770
+ ### References
771
+
772
+ - https://huggingface.co/microsoft/Phi-4-multimodal-instruct
773
+ - https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor
774
+
775
+ </details>