Safetensors
English
omni_speech2s_llama
nielsr HF Staff commited on
Commit
7b9dd1f
·
verified ·
1 Parent(s): b7daaf6

Add pipeline tag and library name

Browse files

This PR adds the `pipeline_tag` and `library_name` to the model card metadata. The `pipeline_tag` is set to `audio-text-to-text` to accurately reflect the model's capabilities. The `library_name` is set to `transformers` based on the provided code example.

Files changed (1) hide show
  1. README.md +56 -244
README.md CHANGED
@@ -1,12 +1,14 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - VocalNet/VoiceAssitant-430K-vocalnet
5
  - VocalNet/UltraChat-vocalnet
6
  language:
7
  - en
8
- base_model:
9
- - meta-llama/Llama-3.2-1B-Instruct
 
10
  ---
11
 
12
  ## 🎧 VocalNet-1B Model Card
@@ -277,263 +279,73 @@ VocalNet-1B was evaluated on [OpenAudioBench](https://huggingface.co/datasets/ba
277
  <td style="padding: 10px; border: 1px solid #ddd;"><u>6.22</u></td>
278
  </tr>
279
  <tr>
280
- <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Minmo*</td>
281
- <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
282
- <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
283
- <td style="padding: 10px; border: 1px solid #ddd;">-</td>
284
- <td style="padding: 10px; border: 1px solid #ddd;">78.9</td>
285
- <td style="padding: 10px; border: 1px solid #ddd;">4.83</td>
286
- <td style="padding: 10px; border: 1px solid #ddd;">5.50</td>
287
- </tr>
288
- <tr>
289
- <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
290
- <td style="padding: 10px; border: 1px solid #ddd;"><b>6.48</b></td>
291
- <td style="padding: 10px; border: 1px solid #ddd;">64.1</td>
292
- <td style="padding: 10px; border: 1px solid #ddd;">3.75</td>
293
- <td style="padding: 10px; border: 1px solid #ddd;">3.99</td>
294
- </tr>
295
- <tr>
296
- <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Qwen2.5-Omni</td>
297
- <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
298
- <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
299
- <td style="padding: 10px; border: 1px solid #ddd;">6.01</td>
300
- <td style="padding: 10px; border: 1px solid #ddd;"><u>79.0</u></td>
301
- <td style="padding: 10px; border: 1px solid #ddd;">5.89</td>
302
- <td style="padding: 10px; border: 1px solid #ddd;"><u>6.88</u></td>
303
- </tr>
304
- <tr>
305
- <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
306
- <td style="padding: 10px; border: 1px solid #ddd;">5.73</td>
307
- <td style="padding: 10px; border: 1px solid #ddd;"><b>76.3</b></td>
308
- <td style="padding: 10px; border: 1px solid #ddd;"><u>5.59</u></td>
309
- <td style="padding: 10px; border: 1px solid #ddd;"><b>6.70</b></td>
310
- </tr>
311
- <tr>
312
- <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B (VA)</td>
313
- <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
314
- <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
315
- <td style="padding: 10px; border: 1px solid #ddd;"><u>7.05</u></td>
316
- <td style="padding: 10px; border: 1px solid #ddd;">77.1</td>
317
- <td style="padding: 10px; border: 1px solid #ddd;">6.15</td>
318
- <td style="padding: 10px; border: 1px solid #ddd;">6.34</td>
319
- </tr>
320
- <tr>
321
- <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
322
- <td style="padding: 10px; border: 1px solid #ddd;">6.30</td>
323
- <td style="padding: 10px; border: 1px solid #ddd;">71.4</td>
324
- <td style="padding: 10px; border: 1px solid #ddd;">5.24</td>
325
- <td style="padding: 10px; border: 1px solid #ddd;">5.81</td>
326
- </tr>
327
- <tr>
328
- <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B</td>
329
- <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
330
- <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
331
- <td style="padding: 10px; border: 1px solid #ddd;"><b>7.12</b></td>
332
- <td style="padding: 10px; border: 1px solid #ddd;"><b>79.5</b></td>
333
- <td style="padding: 10px; border: 1px solid #ddd;"><u>6.24</u></td>
334
- <td style="padding: 10px; border: 1px solid #ddd;">6.48</td>
335
- </tr>
336
- <tr>
337
- <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
338
- <td style="padding: 10px; border: 1px solid #ddd;"><u>6.37</u></td>
339
- <td style="padding: 10px; border: 1px solid #ddd;"><u>73.1</u></td>
340
- <td style="padding: 10px; border: 1px solid #ddd;"><b>5.67</b></td>
341
- <td style="padding: 10px; border: 1px solid #ddd;">6.16</td>
342
- </tr>
343
- </tbody>
344
- </table>
345
- </div>
346
-
347
- #### Response Alignment and Acoustic Quality
348
- <div align="center">
349
- <table style="margin: 0 auto; text-align: center; border-collapse: collapse; font-size: 14px;">
350
- <tbody>
351
- <tr style="background-color: #f2f2f2;">
352
- <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Model</td>
353
- <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">AlpacaEval</td>
354
- <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">LLaMA Questions</td>
355
- <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">TriviaQA</td>
356
- <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">Web Questions</td>
357
- <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">Avg</td>
358
- </tr>
359
- <tr>
360
- <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
361
- <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
362
- <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
363
- <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
364
- <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
365
- <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
366
- <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
367
- <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
368
- <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
369
- <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
370
- </tr>
371
- <tr>
372
- <td colspan="11" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Tiny Models</td>
373
- </tr>
374
- <tr>
375
- <td style="padding: 10px; border: 1px solid #ddd;">Mini-Omni</td>
376
- <td style="padding: 10px; border: 1px solid #ddd;">20.78</td>
377
- <td style="padding: 10px; border: 1px solid #ddd;">4.429</td>
378
- <td style="padding: 10px; border: 1px solid #ddd;">5.20</td>
379
- <td style="padding: 10px; border: 1px solid #ddd;">4.428</td>
380
- <td style="padding: 10px; border: 1px solid #ddd;">7.43</td>
381
- <td style="padding: 10px; border: 1px solid #ddd;">4.428</td>
382
- <td style="padding: 10px; border: 1px solid #ddd;">8.51</td>
383
- <td style="padding: 10px; border: 1px solid #ddd;">4.433</td>
384
- <td style="padding: 10px; border: 1px solid #ddd;">8.66</td>
385
- <td style="padding: 10px; border: 1px solid #ddd;">4.430</td>
386
- </tr>
387
- <tr>
388
- <td style="padding: 10px; border: 1px solid #ddd;">SLAM-Omni</td>
389
- <td style="padding: 10px; border: 1px solid #ddd;">5.52</td>
390
- <td style="padding: 10px; border: 1px solid #ddd;">4.439</td>
391
- <td style="padding: 10px; border: 1px solid #ddd;">5.55</td>
392
- <td style="padding: 10px; border: 1px solid #ddd;">4.467</td>
393
- <td style="padding: 10px; border: 1px solid #ddd;">6.16</td>
394
- <td style="padding: 10px; border: 1px solid #ddd;">4.470</td>
395
- <td style="padding: 10px; border: 1px solid #ddd;">6.50</td>
396
- <td style="padding: 10px; border: 1px solid #ddd;">4.461</td>
397
- <td style="padding: 10px; border: 1px solid #ddd;">6.17</td>
398
- <td style="padding: 10px; border: 1px solid #ddd;">4.464</td>
399
  </tr>
400
  <tr>
401
- <td style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B (VA)</td>
402
- <td style="padding: 10px; border: 1px solid #ddd;"><b>3.43</b></td>
403
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.495</b></td>
404
- <td style="padding: 10px; border: 1px solid #ddd;">3.65</td>
405
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.498</b></td>
406
- <td style="padding: 10px; border: 1px solid #ddd;"><b>5.97</b></td>
407
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.499</b></td>
408
- <td style="padding: 10px; border: 1px solid #ddd;">6.40</td>
409
- <td style="padding: 10px; border: 1px solid #ddd;">4.489</td>
410
- <td style="padding: 10px; border: 1px solid #ddd;">5.66</td>
411
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.495</b></td>
412
  </tr>
413
  <tr>
414
- <td style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B</td>
415
- <td style="padding: 10px; border: 1px solid #ddd;"><b>3.43</b></td>
416
- <td style="padding: 10px; border: 1px solid #ddd;">4.491</td>
417
- <td style="padding: 10px; border: 1px solid #ddd;"><b>3.27</b></td>
418
- <td style="padding: 10px; border: 1px solid #ddd;">4.497</td>
419
- <td style="padding: 10px; border: 1px solid #ddd;">6.73</td>
420
- <td style="padding: 10px; border: 1px solid #ddd;">4.486</td>
421
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.88</b></td>
422
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.493</b></td>
423
- <td style="padding: 10px; border: 1px solid #ddd;"><b>5.31</b></td>
424
- <td style="padding: 10px; border: 1px solid #ddd;">4.491</td>
425
- </tr>
426
- <tr>
427
- <td colspan="11" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Base Models</td>
428
- </tr>
429
- <tr>
430
- <td style="padding: 10px; border: 1px solid #ddd;">LLaMA-Omni</td>
431
- <td style="padding: 10px; border: 1px solid #ddd;">6.00</td>
432
- <td style="padding: 10px; border: 1px solid #ddd;">3.942</td>
433
- <td style="padding: 10px; border: 1px solid #ddd;">10.00</td>
434
- <td style="padding: 10px; border: 1px solid #ddd;">4.003</td>
435
- <td style="padding: 10px; border: 1px solid #ddd;">20.93</td>
436
- <td style="padding: 10px; border: 1px solid #ddd;">3.965</td>
437
- <td style="padding: 10px; border: 1px solid #ddd;">14.60</td>
438
- <td style="padding: 10px; border: 1px solid #ddd;">3.935</td>
439
- <td style="padding: 10px; border: 1px solid #ddd;">15.90</td>
440
- <td style="padding: 10px; border: 1px solid #ddd;">3.956</td>
441
- </tr>
442
- <tr>
443
- <td style="padding: 10px; border: 1px solid #ddd;">Freeze-Omni</td>
444
- <td style="padding: 10px; border: 1px solid #ddd;">14.33</td>
445
- <td style="padding: 10px; border: 1px solid #ddd;">4.377</td>
446
- <td style="padding: 10px; border: 1px solid #ddd;">14.20</td>
447
- <td style="padding: 10px; border: 1px solid #ddd;">4.417</td>
448
- <td style="padding: 10px; border: 1px solid #ddd;">20.39</td>
449
- <td style="padding: 10px; border: 1px solid #ddd;">4.404</td>
450
- <td style="padding: 10px; border: 1px solid #ddd;">18.25</td>
451
- <td style="padding: 10px; border: 1px solid #ddd;">4.398</td>
452
- <td style="padding: 10px; border: 1px solid #ddd;">18.31</td>
453
- <td style="padding: 10px; border: 1px solid #ddd;">4.401</td>
454
- </tr>
455
- <tr>
456
- <td style="padding: 10px; border: 1px solid #ddd;">GLM-4-Voice</td>
457
- <td style="padding: 10px; border: 1px solid #ddd;">18.71</td>
458
- <td style="padding: 10px; border: 1px solid #ddd;">4.025</td>
459
- <td style="padding: 10px; border: 1px solid #ddd;">14.45</td>
460
- <td style="padding: 10px; border: 1px solid #ddd;">4.152</td>
461
- <td style="padding: 10px; border: 1px solid #ddd;">8.33</td>
462
- <td style="padding: 10px; border: 1px solid #ddd;">4.306</td>
463
- <td style="padding: 10px; border: 1px solid #ddd;">6.08</td>
464
- <td style="padding: 10px; border: 1px solid #ddd;">4.214</td>
465
- <td style="padding: 10px; border: 1px solid #ddd;">8.99</td>
466
- <td style="padding: 10px; border: 1px solid #ddd;">4.228</td>
467
- </tr>
468
- <tr>
469
- <td style="padding: 10px; border: 1px solid #ddd;">Baichuan-Omni-1.5</td>
470
- <td style="padding: 10px; border: 1px solid #ddd;">20.84</td>
471
- <td style="padding: 10px; border: 1px solid #ddd;">4.082</td>
472
- <td style="padding: 10px; border: 1px solid #ddd;">22.82</td>
473
- <td style="padding: 10px; border: 1px solid #ddd;">4.332</td>
474
- <td style="padding: 10px; border: 1px solid #ddd;">22.36</td>
475
- <td style="padding: 10px; border: 1px solid #ddd;">4.401</td>
476
- <td style="padding: 10px; border: 1px solid #ddd;">23.29</td>
477
- <td style="padding: 10px; border: 1px solid #ddd;">4.350</td>
478
- <td style="padding: 10px; border: 1px solid #ddd;">22.67</td>
479
- <td style="padding: 10px; border: 1px solid #ddd;">4.347</td>
480
- </tr>
481
- <tr>
482
- <td style="padding: 10px; border: 1px solid #ddd;">MiniCPM-o</td>
483
- <td style="padding: 10px; border: 1px solid #ddd;">15.35</td>
484
- <td style="padding: 10px; border: 1px solid #ddd;">4.102</td>
485
- <td style="padding: 10px; border: 1px solid #ddd;">5.73</td>
486
- <td style="padding: 10px; border: 1px solid #ddd;">4.228</td>
487
- <td style="padding: 10px; border: 1px solid #ddd;">8.08</td>
488
- <td style="padding: 10px; border: 1px solid #ddd;">4.128</td>
489
- <td style="padding: 10px; border: 1px solid #ddd;">8.94</td>
490
- <td style="padding: 10px; border: 1px solid #ddd;">4.125</td>
491
- <td style="padding: 10px; border: 1px solid #ddd;">8.72</td>
492
- <td style="padding: 10px; border: 1px solid #ddd;">4.137</td>
493
  </tr>
494
  <tr>
495
- <td style="padding: 10px; border: 1px solid #ddd;">Qwen2.5-Omni</td>
496
- <td style="padding: 10px; border: 1px solid #ddd;"><b>2.41</b></td>
497
- <td style="padding: 10px; border: 1px solid #ddd;">4.299</td>
498
- <td style="padding: 10px; border: 1px solid #ddd;"><b>0.93</b></td>
499
- <td style="padding: 10px; border: 1px solid #ddd;">4.315</td>
500
- <td style="padding: 10px; border: 1px solid #ddd;"><b>1.13</b></td>
501
- <td style="padding: 10px; border: 1px solid #ddd;">4.339</td>
502
- <td style="padding: 10px; border: 1px solid #ddd;">4.68</td>
503
- <td style="padding: 10px; border: 1px solid #ddd;">4.363</td>
504
- <td style="padding: 10px; border: 1px solid #ddd;"><b>2.63</b></td>
505
- <td style="padding: 10px; border: 1px solid #ddd;">4.342</td>
506
  </tr>
507
  <tr>
508
  <td style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B (VA)</td>
509
- <td style="padding: 10px; border: 1px solid #ddd;"><u>2.65</u></td>
510
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.490</b></td>
511
- <td style="padding: 10px; border: 1px solid #ddd;">3.00</td>
512
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.503</b></td>
513
- <td style="padding: 10px; border: 1px solid #ddd;">5.02</td>
514
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.499</b></td>
515
- <td style="padding: 10px; border: 1px solid #ddd;"><u>4.21</u></td>
516
- <td style="padding: 10px; border: 1px solid #ddd;"><u>4.485</u></td>
517
- <td style="padding: 10px; border: 1px solid #ddd;">4.26</td>
518
- <td style="padding: 10px; border: 1px solid #ddd;"><b>4.493</b></td>
 
 
 
 
 
 
 
 
 
 
 
 
 
519
  </tr>
 
520
  <tr>
521
- <td style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B</td>
522
- <td style="padding: 10px; border: 1px solid #ddd;">4.71</td>
523
- <td style="padding: 10px; border: 1px solid #ddd;"><u>4.489</u></td>
524
- <td style="padding: 10px; border: 1px solid #ddd;"><u>2.68</u></td>
525
- <td style="padding: 10px; border: 1px solid #ddd;"><u>4.500</u></td>
526
- <td style="padding: 10px; border: 1px solid #ddd;"><u>4.04</u></td>
527
- <td style="padding: 10px; border: 1px solid #ddd;"><u>4.482</u></td>
528
- <td style="padding: 10px; border: 1px solid #ddd;"><b>3.11</b></td>
529
- <td style="padding as: 10px; border: 1px solid #ddd;"><b>4.492</b></td>
530
- <td style="padding: 10px; border: 1px solid #ddd;"><u>3.56</u></td>
531
- <td style="padding: 10px; border: 1px solid #ddd;"><u>4.489</u></td>
532
  </tr>
 
533
  </tbody>
534
  </table>
535
  </div>
536
 
 
 
537
  ### ✍️ Citation
538
  If you find our work useful, please cite:
539
  ```bib
 
1
  ---
2
+ base_model:
3
+ - meta-llama/Llama-3.2-1B-Instruct
4
  datasets:
5
  - VocalNet/VoiceAssitant-430K-vocalnet
6
  - VocalNet/UltraChat-vocalnet
7
  language:
8
  - en
9
+ license: apache-2.0
10
+ pipeline_tag: audio-text-to-text
11
+ library_name: transformers
12
  ---
13
 
14
  ## 🎧 VocalNet-1B Model Card
 
279
  <td style="padding: 10px; border: 1px solid #ddd;"><u>6.22</u></td>
280
  </tr>
281
  <tr>
282
+ <td style="padding: 10px; border: 1px solid #ddd;">Minmo*</td>
283
+ <td style="padding: 10px; border: 1px solid #ddd;">8B</td>
284
+ <td>s→t</td>
285
+ <td>-</td>
286
+ <td>78.9</td>
287
+ <td>4.83</td>
288
+ <td>5.50</td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289
  </tr>
290
  <tr>
291
+ <td>s→s</td>
292
+ <td><b>6.48<br></td>
293
+ <td>64.1</td>
294
+ <td>3.75</td>
295
+ <td>3.99</td>
 
 
 
 
 
 
296
  </tr>
297
  <tr>
298
+ <td style="padding: 10px; border: 1px solid #ddd;">Qwen2.5-Omni</td>
299
+ <td>s→t</td>
300
+ <td>6.01</td>
301
+ <td><u>79.0</u></td>
302
+ <td>5.89</td>
303
+ <td><u>6.88</u></td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
304
  </tr>
305
  <tr>
306
+ <td>s→s</td>
307
+ <td>5.73</td>
308
+ <td><b>76.3<br></td>
309
+ <td><u>5.59</u></td>
310
+ <td><b>6.70<br></td>
 
 
 
 
 
 
311
  </tr>
312
  <tr>
313
  <td style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B (VA)</td>
314
+ <td><u>7.05</u></td>
315
+ <td><b>4.490<br></td>
316
+ <td>77.1</td>
317
+ <td>4.503</td>
318
+ <td>6.15</td>
319
+ <td><b>4.499<br></td>
320
+ <td><u>4.21</u></td>
321
+ <td><u>4.485</u></td>
322
+ <td>4.26</td>
323
+ <td><b>4.493<br></td>
324
+ </tr>
325
+ <tr>
326
+ <td>VocalNet-8B</td>
327
+ <td><b>7.12<br></td>
328
+ <td><u>4.489</u></td>
329
+ <td><b>79.5<br></td>
330
+ <td>4.500</td>
331
+ <td><u>6.24</u></td>
332
+ <td>4.482</td>
333
+ <td>3.11</td>
334
+ <td>4.492</td>
335
+ <td><u>3.56</u></td>
336
+ <td><u>4.489</u></td>
337
  </tr>
338
+ <thead>
339
  <tr>
340
+ <th class="tg-c3ow" colspan="11"></th>
 
 
 
 
 
 
 
 
 
 
341
  </tr>
342
+ </thead>
343
  </tbody>
344
  </table>
345
  </div>
346
 
347
+ </details>
348
+
349
  ### ✍️ Citation
350
  If you find our work useful, please cite:
351
  ```bib