Upload folder using huggingface_hub
Browse files- README.md +65 -3
- adapter_config.json +21 -0
- adapter_model.bin +3 -0
- original_args.json +26 -0
README.md
CHANGED
@@ -1,3 +1,65 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
license: apache-2.0
|
4 |
+
tags:
|
5 |
+
- audio
|
6 |
+
- speech
|
7 |
+
- transcription
|
8 |
+
- librispeech
|
9 |
+
- llama-3
|
10 |
+
datasets:
|
11 |
+
- librispeech_asr
|
12 |
+
---
|
13 |
+
|
14 |
+
# Audio-LLaMA: LoRA Adapter for Audio Transcription
|
15 |
+
|
16 |
+
This model is a LoRA adapter fine-tuned on audio transcription tasks. It requires the Llama base model to be used.
|
17 |
+
|
18 |
+
## Model Details
|
19 |
+
|
20 |
+
- **Base Model**: meta-llama/Llama-3.2-3B-Instruct
|
21 |
+
- **Audio Model**: openai/whisper-large-v3-turbo
|
22 |
+
- **LoRA Rank**: 32
|
23 |
+
- **Task**: Audio transcription from LibriSpeech dataset
|
24 |
+
- **Training Framework**: PEFT (Parameter-Efficient Fine-Tuning)
|
25 |
+
|
26 |
+
## Usage
|
27 |
+
|
28 |
+
This is a PEFT (LoRA) adapter that needs to be combined with the base Llama model to work:
|
29 |
+
|
30 |
+
```python
|
31 |
+
import torch
|
32 |
+
from peft import PeftModel, PeftConfig
|
33 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
34 |
+
|
35 |
+
# Load the LoRA configuration
|
36 |
+
config = PeftConfig.from_pretrained("cdreetz/audio-llama")
|
37 |
+
|
38 |
+
# Load the base model
|
39 |
+
model = AutoModelForCausalLM.from_pretrained(
|
40 |
+
config.base_model_name_or_path,
|
41 |
+
torch_dtype=torch.float16,
|
42 |
+
device_map="auto"
|
43 |
+
)
|
44 |
+
|
45 |
+
# Load the tokenizer
|
46 |
+
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
|
47 |
+
|
48 |
+
# Load the LoRA adapter
|
49 |
+
model = PeftModel.from_pretrained(model, "cdreetz/audio-llama")
|
50 |
+
|
51 |
+
# Run inference
|
52 |
+
prompt = "Transcribe this audio:"
|
53 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
54 |
+
outputs = model.generate(**inputs, max_new_tokens=100)
|
55 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
56 |
+
print(response)
|
57 |
+
```
|
58 |
+
|
59 |
+
## Training
|
60 |
+
|
61 |
+
This model was fine-tuned using LoRA on audio transcription tasks. It starts with a Llama 3 base model and uses Whisper-processed audio features for audio understanding.
|
62 |
+
|
63 |
+
## Limitations
|
64 |
+
|
65 |
+
This model requires special code for audio processing with Whisper before passing to the Llama model. See the [Audio-LLaMA repository](https://github.com/cdreetz/audio-llama) for full usage instructions.
|
adapter_config.json
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"base_model_name_or_path": "meta-llama/Llama-3.2-3B-Instruct",
|
3 |
+
"bias": "none",
|
4 |
+
"enable_lora": null,
|
5 |
+
"fan_in_fan_out": false,
|
6 |
+
"inference_mode": true,
|
7 |
+
"lora_alpha": 32,
|
8 |
+
"lora_dropout": 0.05,
|
9 |
+
"modules_to_save": null,
|
10 |
+
"peft_type": "LORA",
|
11 |
+
"r": 32,
|
12 |
+
"target_modules": [
|
13 |
+
"q_proj",
|
14 |
+
"k_proj",
|
15 |
+
"v_proj",
|
16 |
+
"gate_proj",
|
17 |
+
"up_proj",
|
18 |
+
"down_proj"
|
19 |
+
],
|
20 |
+
"task_type": "CAUSAL_LM"
|
21 |
+
}
|
adapter_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6f63fd54f85cd58730463ed2ef5c15a657fed4c174fed433966eddda933dd9ab
|
3 |
+
size 172603978
|
original_args.json
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"llama_path": "meta-llama/Llama-3.2-3B-Instruct",
|
3 |
+
"whisper_path": "openai/whisper-large-v3-turbo",
|
4 |
+
"data_path": "./audio_instruction_examples.json",
|
5 |
+
"audio_dir": "./",
|
6 |
+
"output_dir": "./checkpoints",
|
7 |
+
"batch_size": 16,
|
8 |
+
"eval_batch_size": 16,
|
9 |
+
"grad_accum_steps": 4,
|
10 |
+
"num_epochs": 5,
|
11 |
+
"learning_rate": 5e-05,
|
12 |
+
"weight_decay": 0.01,
|
13 |
+
"warmup_steps": 500,
|
14 |
+
"max_grad_norm": 1.0,
|
15 |
+
"lora_rank": 32,
|
16 |
+
"save_steps": 1000,
|
17 |
+
"eval_steps": 500,
|
18 |
+
"log_steps": 100,
|
19 |
+
"max_audio_length": 30,
|
20 |
+
"text_max_length": 512,
|
21 |
+
"use_wandb": false,
|
22 |
+
"wandb_project": "audio-llm",
|
23 |
+
"seed": 42,
|
24 |
+
"fp16": true,
|
25 |
+
"num_workers": 4
|
26 |
+
}
|