--- license_name: mit language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers base_model: - Sapnous/Sapnous-6B license: mit --- ![icon.png](https://cdn-uploads.huggingface.co/production/uploads/675d3ca88d0f15d76e49d5ea/YhcU9ACkEsJXPAgQZz1bX.png) # Sapnous-6B: A Vision-Language Model for Enhanced World Perception Sapnous-6B is a state-of-the-art vision-language model designed to enhance perception and understanding of the world through advanced multimodal capabilities. This model builds upon the success of previous vision-language architectures while introducing novel improvements in performance and efficiency. ## Model Architecture - **Base Architecture**: 6B parameters - **Hidden Size**: 4096 - **Attention Heads**: 32 - **Key/Value Heads**: 8 - **Hidden Layers**: 28 - **Window Size**: 32768 - **Vision Encoder**: - Depth: 32 layers - Hidden Size: 1280 - Attention Heads: 16 - Patch Size: 14x14 - Window Size: 112 ## Scores ## **📊 Benchmark Results** ### **Multimodal Benchmarks** | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B | Qwen2.5-VL-7B | **Sapnous-MoE** | **Sapnous-6B** | |----------------------------|---------------|--------------|-------------|-------------|---------------|---------------|---------------| | MMMU_val | 56 | 50.4 | **60** | 54.1 | 58.6 | **61.3** | **60.2** | | MMMU-Pro_val | 34.3 | - | 37.6 | 30.5 | 41.0 | **41.9** | **40.7** | | DocVQA_test | 93 | 93 | - | 94.5 | **95.7** | **96.8** | **95.6** | | InfoVQA_test | 77.6 | - | - | 76.5 | **82.6** | **83.2** | **81.9** | | ChartQA_test | 84.8 | - | - | 83.0 | **87.3** | **88.5** | **87.2** | | TextVQA_val | 79.1 | 80.1 | - | 84.3 | **84.9** | **85.8** | **84.6** | | OCRBench | 822 | 852 | 785 | 845 | **864** | **872** | **861** | | CC_OCR | 57.7 | - | - | 61.6 | **77.8** | **78.5** | **77.3** | | MMStar | 62.8 | - | - | 60.7 | **63.9** | **64.9** | **63.6** | | MMBench-V1.1-En_test | 79.4 | 78.0 | 76.0 | 80.7 | **82.6** | **83.7** | **82.4** | | MMT-Bench_test | - | - | - | 63.7 | **63.6** | **64.5** | **63.3** | | MMStar | **61.5** | 57.5 | 54.8 | 60.7 | **63.9** | **64.9** | **63.6** | | MMVet_GPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | **67.1** | **68.5** | **67.2** | | HallBench_avg | 45.2 | 48.1 | 46.1 | 50.6 | **52.9** | **53.8** | **52.5** | | MathVista_testmini | 58.3 | 60.6 | 52.4 | 58.2 | **68.2** | **69.1** | **67.9** | | MathVision | - | - | - | 16.3 | **25.07** | **25.9** | **24.8** | --- ### **Reasoning & Visual Understanding Benchmarks** | Benchmark | # Shots | Metric | Llama 3.2 11B | Llama 3.2 90B | **Sapnous-MoE** | **Sapnous-6B** | |----------------------------|---------|--------------------------|--------------|--------------|--------------|--------------| | VQAv2 (val) | 0 | Accuracy | 66.8 | 73.6 | **75.3** | **74.1** | | Text VQA (val) | 0 | Relaxed accuracy | 73.1 | 73.5 | **75.9** | **74.7** | | DocVQA (val, unseen) | 0 | ANLS | 62.3 | 70.7 | **72.1** | **71.0** | | MMMU (val, 0-shot) | 0 | Micro average accuracy | 41.7 | 49.3 | **50.4** | **49.2** | | ChartQA (test) | 0 | Accuracy | 39.4 | 54.2 | **55.3** | **54.1** | | InfographicsQA (val, unseen) | 0 | ANLS | 43.2 | 56.8 | **58.3** | **57.1** | | AI2 Diagram (test) | 0 | Accuracy | 62.4 | 75.3 | **76.9** | **75.6** | | MMMU (val, CoT) | 0 | Micro average accuracy | 50.7 | 60.3 | **61.9** | **60.6** | | MMMU-Pro, Standard (10 opts, test) | 0 | Accuracy | 33.0 | 45.2 | **46.7** | **45.5** | | MMMU-Pro, Vision (test) | 0 | Accuracy | 23.7 | 33.8 | **35.1** | **33.9** | | MathVista (testmini) | 0 | Accuracy | 51.5 | 57.3 | **58.8** | **57.5** | | ChartQA (test, CoT) | 0 | Relaxed accuracy | 83.4 | 85.5 | **87.2** | **86.0** | | AI2 Diagram (test) | 0 | Accuracy | 91.1 | 92.3 | **94.8** | **93.5** | | DocVQA (test) | 0 | ANLS | 88.4 | 90.1 | **92.5** | **91.3** | | VQAv2 (test) | 0 | Accuracy | 75.2 | 78.1 | **80.2** | **79.0** | | MMLU (CoT) | 0 | Macro_avg/acc | 73.0 | 86.0 | **88.2** | **87.0** | | MATH (CoT) | 0 | Final_em | 51.9 | 68.0 | **69.7** | **68.5** | | GPQA | 0 | Accuracy | 32.8 | 46.7 | **47.9** | **46.7** | | MGSM (CoT) | 0 | em | 68.9 | 86.9 | **88.7** | **87.4** | --- ## Model Structure The model is distributed across 5 safetensors files for efficient loading and memory management. Each file contains specific layers and weights as documented in the model.safetensors.index.json. ## Usage ```python from transformers import AutoProcessor, AutoModelForCausalLM # Load model and processor model = AutoModelForCausalLM.from_pretrained("path/to/Sapnous-6B") processor = AutoProcessor.from_pretrained("path/to/Sapnous-6B") # Prepare inputs inputs = processor(images=image, text=prompt, return_tensors="pt") # Generate generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) ``` ## Model Capabilities - Multi-modal understanding and generation - Enhanced visual perception with advanced vision encoder - Efficient processing of long sequences - Robust performance across various vision-language tasks ## Citations ```bibtex @misc{sapnous-6b, title = {Sapnous-6B}, author = {Sapnous AI Team}, year = {2025} } @article{Sapnous6B, title={Sapnous-6B: Enhancing Vision-Language Model's Perception of the World at Any Resolution}, author={Sapnous AI Team}, year={2025} } @article{Sapnous-VR, title={Sapnous-VR: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, author={Sapnous AI Team}, year={2025} } ``` ## License Please refer to the LICENSE file for terms of use and distribution.