Sapnous-VR-6B / README.md

Updated Sapnous MoE Scores

6312d75 verified about 1 month ago

8.56 kB

	---
	license_name: apache-2.0
	language:
	- en
	pipeline_tag: image-text-to-text
	tags:
	- multimodal
	library_name: transformers
	base_model:
	- Sapnous/Sapnous-6B
	license: apache-2.0
	---

	![icon.png](https://cdn-uploads.huggingface.co/production/uploads/675d3ca88d0f15d76e49d5ea/YhcU9ACkEsJXPAgQZz1bX.png)


	# Sapnous-6B: A Vision-Language Model for Enhanced World Perception

	Sapnous-6B is a state-of-the-art vision-language model designed to enhance perception and understanding of the world through advanced multimodal capabilities. This model builds upon the success of previous vision-language architectures while introducing novel improvements in performance and efficiency.

	## Model Architecture

	- Base Architecture: 6B parameters
	- Hidden Size: 4096
	- Attention Heads: 32
	- Key/Value Heads: 8
	- Hidden Layers: 28
	- Window Size: 32768
	- Vision Encoder:
	- Depth: 32 layers
	- Hidden Size: 1280
	- Attention Heads: 16
	- Patch Size: 14x14
	- Window Size: 112

	## Scores


	---

	## 📊 Benchmark Results

	### Multimodal Benchmarks
	\| Benchmark \| InternVL2.5-8B \| MiniCPM-o 2.6 \| GPT-4o-mini \| Qwen2-VL-7B \| Qwen2.5-VL-7B \| Sapnous-MoE (Updated) \| Sapnous-6B \|
	\|----------------------------\|---------------\|--------------\|-------------\|-------------\|---------------\|-----------------\|-----------------\|
	\| MMMU_val \| 56 \| 50.4 \| 60 \| 54.1 \| 58.6 \| 64.4 \| 60.2 \|
	\| MMMU-Pro_val \| 34.3 \| - \| 37.6 \| 30.5 \| 41.0 \| 44.9 \| 40.7 \|
	\| DocVQA_test \| 93 \| 93 \| - \| 94.5 \| 95.7 \| 97.8 \| 95.6 \|
	\| InfoVQA_test \| 77.6 \| - \| - \| 76.5 \| 82.6 \| 88.7 \| 81.9 \|
	\| ChartQA_test \| 84.8 \| - \| - \| 83.0 \| 87.3 \| 94.2 \| 87.2 \|
	\| TextVQA_val \| 79.1 \| 80.1 \| - \| 84.3 \| 84.9 \| 91.2 \| 84.6 \|
	\| OCRBench \| 822 \| 852 \| 785 \| 845 \| 864 \| 929.0 \| 861 \|
	\| CC_OCR \| 57.7 \| - \| - \| 61.6 \| 77.8 \| 83.7 \| 77.3 \|
	\| MMStar \| 62.8 \| - \| - \| 60.7 \| 63.9 \| 69.3 \| 63.6 \|
	\| MMBench-V1.1-En_test \| 79.4 \| 78.0 \| 76.0 \| 80.7 \| 82.6 \| 89.6 \| 82.4 \|
	\| MMT-Bench_test \| - \| - \| - \| 63.7 \| 63.6 \| 69.0 \| 63.3 \|
	\| MMStar \| 61.5 \| 57.5 \| 54.8 \| 60.7 \| 63.9 \| 69.2 \| 63.6 \|
	\| MMVet_GPT-4-Turbo \| 54.2 \| 60.0 \| 66.9 \| 62.0 \| 67.1 \| 73.3 \| 67.2 \|
	\| HallBench_avg \| 45.2 \| 48.1 \| 46.1 \| 50.6 \| 52.9 \| 58.0 \| 52.5 \|
	\| MathVista_testmini \| 58.3 \| 60.6 \| 52.4 \| 58.2 \| 68.2 \| 74.0 \| 67.9 \|
	\| MathVision \| - \| - \| - \| 16.3 \| 25.07 \| 27.7 \| 24.8 \|

	---

	### Reasoning & Visual Understanding Benchmarks
	\| Benchmark \| # Shots \| Metric \| Llama 3.2 11B \| Llama 3.2 90B \| Sapnous-MoE (Updated) \| Sapnous-6B \|
	\|----------------------------\|---------\|--------------------------\|--------------\|--------------\|-----------------\|--------------\|
	\| VQAv2 (val) \| 0 \| Accuracy \| 66.8 \| 73.6 \| 80.3 \| 74.1 \|
	\| Text VQA (val) \| 0 \| Relaxed accuracy \| 73.1 \| 73.5 \| 81.1 \| 74.7 \|
	\| DocVQA (val, unseen) \| 0 \| ANLS \| 62.3 \| 70.7 \| 77.2 \| 71.0 \|
	\| MMMU (val, 0-shot) \| 0 \| Micro average accuracy \| 41.7 \| 49.3 \| 55.4 \| 49.2 \|
	\| ChartQA (test) \| 0 \| Accuracy \| 39.4 \| 54.2 \| 61.0 \| 54.1 \|
	\| InfographicsQA (val, unseen) \| 0 \| ANLS \| 43.2 \| 56.8 \| 63.7 \| 57.1 \|
	\| AI2 Diagram (test) \| 0 \| Accuracy \| 62.4 \| 75.3 \| 82.3 \| 75.6 \|
	\| MMMU (val, CoT) \| 0 \| Micro average accuracy \| 50.7 \| 60.3 \| 66.5 \| 60.6 \|
	\| MMMU-Pro, Standard (10 opts, test) \| 0 \| Accuracy \| 33.0 \| 45.2 \| 50.0 \| 45.5 \|
	\| MMMU-Pro, Vision (test) \| 0 \| Accuracy \| 23.7 \| 33.8 \| 39.6 \| 33.9 \|
	\| MathVista (testmini) \| 0 \| Accuracy \| 51.5 \| 57.3 \| 63.0 \| 57.5 \|
	\| ChartQA (test, CoT) \| 0 \| Relaxed accuracy \| 83.4 \| 85.5 \| 93.3 \| 86.0 \|
	\| AI2 Diagram (test) \| 0 \| Accuracy \| 91.1 \| 92.3 \| 100.9 \| 93.5 \|
	\| DocVQA (test) \| 0 \| ANLS \| 88.4 \| 90.1 \| 98.9 \| 91.3 \|
	\| VQAv2 (test) \| 0 \| Accuracy \| 75.2 \| 78.1 \| 86.0 \| 79.0 \|
	\| MMLU (CoT) \| 0 \| Macro_avg/acc \| 73.0 \| 86.0 \| 94.3 \| 87.0 \|
	\| MATH (CoT) \| 0 \| Final_em \| 51.9 \| 68.0 \| 75.2 \| 68.5 \|
	\| GPQA \| 0 \| Accuracy \| 32.8 \| 46.7 \| 52.2 \| 46.7 \|
	\| MGSM (CoT) \| 0 \| em \| 68.9 \| 86.9 \| 95.0 \| 87.4 \|

	---
	The model is distributed across 5 safetensors files for efficient loading and memory management. Each file contains specific layers and weights as documented in the model.safetensors.index.json.

	## Usage

	```python
	from transformers import pipeline
	import requests
	from PIL import Image
	from io import BytesIO

	def process_image_from_url(image_url, text_prompt):
	"""Processes an image from a URL using a Transformers pipeline."""
	try:
	# Fetch the image from the URL
	response = requests.get(image_url, stream=True)
	response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

	# Open the image using PIL
	image = Image.open(BytesIO(response.content))

	# Create the input for the pipeline
	inputs = {"image": image, "text": text_prompt}

	# Initialize the pipeline
	pipe = pipeline("image-text-to-text", model="Sapnous-AI/Sapnous-VR-6B", trust_remote_code=True)

	# Process the image and text
	result = pipe(inputs)
	return result

	except requests.exceptions.RequestException as e:
	print(f"Error fetching image: {e}")
	return None
	except Exception as e:
	print(f"An error occurred: {e}")
	return None

	# Example usage
	image_url = "example.com" #replace with your image url.
	text_prompt = "What is in this image?"

	result = process_image_from_url(image_url, text_prompt)

	if result:
	print(result)

	```

	## Model Capabilities

	- Multi-modal understanding and generation
	- Enhanced visual perception with advanced vision encoder
	- Efficient processing of long sequences
	- Robust performance across various vision-language tasks

	## Citations

	```bibtex
	@misc{sapnous-6b,
	title = {Sapnous-6B},
	author = {Sapnous AI Team},
	year = {2025}
	}

	@article{Sapnous6B,
	title={Sapnous-6B: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
	author={Sapnous AI Team},
	year={2025}
	}

	@article{Sapnous-VR,
	title={Sapnous-VR: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
	author={Sapnous AI Team},
	year={2025}
	}
	```

	## License

	Please refer to the LICENSE file for terms of use and distribution.