|
--- |
|
license_name: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- multimodal |
|
library_name: transformers |
|
base_model: |
|
- Sapnous/Sapnous-6B |
|
license: apache-2.0 |
|
--- |
|
|
|
 |
|
|
|
|
|
# Sapnous-6B: A Vision-Language Model for Enhanced World Perception |
|
|
|
Sapnous-6B is a state-of-the-art vision-language model designed to enhance perception and understanding of the world through advanced multimodal capabilities. This model builds upon the success of previous vision-language architectures while introducing novel improvements in performance and efficiency. |
|
|
|
## Model Architecture |
|
|
|
- **Base Architecture**: 6B parameters |
|
- **Hidden Size**: 4096 |
|
- **Attention Heads**: 32 |
|
- **Key/Value Heads**: 8 |
|
- **Hidden Layers**: 28 |
|
- **Window Size**: 32768 |
|
- **Vision Encoder**: |
|
- Depth: 32 layers |
|
- Hidden Size: 1280 |
|
- Attention Heads: 16 |
|
- Patch Size: 14x14 |
|
- Window Size: 112 |
|
|
|
## Scores |
|
|
|
|
|
--- |
|
|
|
## **📊 Benchmark Results** |
|
|
|
### **Multimodal Benchmarks** |
|
| Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B | Qwen2.5-VL-7B | **Sapnous-MoE (Updated)** | **Sapnous-6B** | |
|
|----------------------------|---------------|--------------|-------------|-------------|---------------|-----------------|-----------------| |
|
| MMMU_val | 56 | 50.4 | **60** | 54.1 | 58.6 | **64.4** | **60.2** | |
|
| MMMU-Pro_val | 34.3 | - | 37.6 | 30.5 | 41.0 | **44.9** | **40.7** | |
|
| DocVQA_test | 93 | 93 | - | 94.5 | **95.7** | **97.8** | **95.6** | |
|
| InfoVQA_test | 77.6 | - | - | 76.5 | **82.6** | **88.7** | **81.9** | |
|
| ChartQA_test | 84.8 | - | - | 83.0 | **87.3** | **94.2** | **87.2** | |
|
| TextVQA_val | 79.1 | 80.1 | - | 84.3 | **84.9** | **91.2** | **84.6** | |
|
| OCRBench | 822 | 852 | 785 | 845 | **864** | **929.0** | **861** | |
|
| CC_OCR | 57.7 | - | - | 61.6 | **77.8** | **83.7** | **77.3** | |
|
| MMStar | 62.8 | - | - | 60.7 | **63.9** | **69.3** | **63.6** | |
|
| MMBench-V1.1-En_test | 79.4 | 78.0 | 76.0 | 80.7 | **82.6** | **89.6** | **82.4** | |
|
| MMT-Bench_test | - | - | - | 63.7 | **63.6** | **69.0** | **63.3** | |
|
| MMStar | **61.5** | 57.5 | 54.8 | 60.7 | **63.9** | **69.2** | **63.6** | |
|
| MMVet_GPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | **67.1** | **73.3** | **67.2** | |
|
| HallBench_avg | 45.2 | 48.1 | 46.1 | 50.6 | **52.9** | **58.0** | **52.5** | |
|
| MathVista_testmini | 58.3 | 60.6 | 52.4 | 58.2 | **68.2** | **74.0** | **67.9** | |
|
| MathVision | - | - | - | 16.3 | **25.07** | **27.7** | **24.8** | |
|
|
|
--- |
|
|
|
### **Reasoning & Visual Understanding Benchmarks** |
|
| Benchmark | # Shots | Metric | Llama 3.2 11B | Llama 3.2 90B | **Sapnous-MoE (Updated)** | **Sapnous-6B** | |
|
|----------------------------|---------|--------------------------|--------------|--------------|-----------------|--------------| |
|
| VQAv2 (val) | 0 | Accuracy | 66.8 | 73.6 | **80.3** | **74.1** | |
|
| Text VQA (val) | 0 | Relaxed accuracy | 73.1 | 73.5 | **81.1** | **74.7** | |
|
| DocVQA (val, unseen) | 0 | ANLS | 62.3 | 70.7 | **77.2** | **71.0** | |
|
| MMMU (val, 0-shot) | 0 | Micro average accuracy | 41.7 | 49.3 | **55.4** | **49.2** | |
|
| ChartQA (test) | 0 | Accuracy | 39.4 | 54.2 | **61.0** | **54.1** | |
|
| InfographicsQA (val, unseen) | 0 | ANLS | 43.2 | 56.8 | **63.7** | **57.1** | |
|
| AI2 Diagram (test) | 0 | Accuracy | 62.4 | 75.3 | **82.3** | **75.6** | |
|
| MMMU (val, CoT) | 0 | Micro average accuracy | 50.7 | 60.3 | **66.5** | **60.6** | |
|
| MMMU-Pro, Standard (10 opts, test) | 0 | Accuracy | 33.0 | 45.2 | **50.0** | **45.5** | |
|
| MMMU-Pro, Vision (test) | 0 | Accuracy | 23.7 | 33.8 | **39.6** | **33.9** | |
|
| MathVista (testmini) | 0 | Accuracy | 51.5 | 57.3 | **63.0** | **57.5** | |
|
| ChartQA (test, CoT) | 0 | Relaxed accuracy | 83.4 | 85.5 | **93.3** | **86.0** | |
|
| AI2 Diagram (test) | 0 | Accuracy | 91.1 | 92.3 | **100.9** | **93.5** | |
|
| DocVQA (test) | 0 | ANLS | 88.4 | 90.1 | **98.9** | **91.3** | |
|
| VQAv2 (test) | 0 | Accuracy | 75.2 | 78.1 | **86.0** | **79.0** | |
|
| MMLU (CoT) | 0 | Macro_avg/acc | 73.0 | 86.0 | **94.3** | **87.0** | |
|
| MATH (CoT) | 0 | Final_em | 51.9 | 68.0 | **75.2** | **68.5** | |
|
| GPQA | 0 | Accuracy | 32.8 | 46.7 | **52.2** | **46.7** | |
|
| MGSM (CoT) | 0 | em | 68.9 | 86.9 | **95.0** | **87.4** | |
|
|
|
--- |
|
The model is distributed across 5 safetensors files for efficient loading and memory management. Each file contains specific layers and weights as documented in the model.safetensors.index.json. |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import pipeline |
|
import requests |
|
from PIL import Image |
|
from io import BytesIO |
|
|
|
def process_image_from_url(image_url, text_prompt): |
|
"""Processes an image from a URL using a Transformers pipeline.""" |
|
try: |
|
# Fetch the image from the URL |
|
response = requests.get(image_url, stream=True) |
|
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx) |
|
|
|
# Open the image using PIL |
|
image = Image.open(BytesIO(response.content)) |
|
|
|
# Create the input for the pipeline |
|
inputs = {"image": image, "text": text_prompt} |
|
|
|
# Initialize the pipeline |
|
pipe = pipeline("image-text-to-text", model="Sapnous-AI/Sapnous-VR-6B", trust_remote_code=True) |
|
|
|
# Process the image and text |
|
result = pipe(inputs) |
|
return result |
|
|
|
except requests.exceptions.RequestException as e: |
|
print(f"Error fetching image: {e}") |
|
return None |
|
except Exception as e: |
|
print(f"An error occurred: {e}") |
|
return None |
|
|
|
# Example usage |
|
image_url = "example.com" #replace with your image url. |
|
text_prompt = "What is in this image?" |
|
|
|
result = process_image_from_url(image_url, text_prompt) |
|
|
|
if result: |
|
print(result) |
|
|
|
``` |
|
|
|
## Model Capabilities |
|
|
|
- Multi-modal understanding and generation |
|
- Enhanced visual perception with advanced vision encoder |
|
- Efficient processing of long sequences |
|
- Robust performance across various vision-language tasks |
|
|
|
## Citations |
|
|
|
```bibtex |
|
@misc{sapnous-6b, |
|
title = {Sapnous-6B}, |
|
author = {Sapnous AI Team}, |
|
year = {2025} |
|
} |
|
|
|
@article{Sapnous6B, |
|
title={Sapnous-6B: Enhancing Vision-Language Model's Perception of the World at Any Resolution}, |
|
author={Sapnous AI Team}, |
|
year={2025} |
|
} |
|
|
|
@article{Sapnous-VR, |
|
title={Sapnous-VR: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, |
|
author={Sapnous AI Team}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
Please refer to the LICENSE file for terms of use and distribution. |