lusxvr/nanoVLM-256M · Hugging Face

nanoVLM is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-512-86M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 256M parameter model. The model achieves ~x% accuracy on MMStar after training for 6 hours on a single H100 GPU using 1.7M samples from the cauldron dataset, making it a strong baseline for low-resource VLM research.

The model is ideal for researchers and developers interested in exploring VLM training with minimal computational overhead, and serves as a perfect starting point for tinkering with multimodal architectures.

Model Architecture:

Vision Transformer (SigLIP-B/16)
Causal Language Model (SmolLM2)
Modality Projection Layer

Training:

Trained on ~1.7M samples from the the_cauldron dataset
6 hours on a single NVIDIA H100 GPU
Resulting model size: 256M parameters

Evaluation:

MMStar Accuracy: ~x%

Usage:
Usable through the nanoVLM repository: https://github.com/huggingface/nanoVLM

path_to_hf_file = hf_hub_download(repo_id="lusxvr/nanoVLM-256M", filename="nanoVLM-256M.pth")
model = VLM(cfg.VLMConfig())
model.load_checkpoint(path_to_hf_file)

lusxvr
/

nanoVLM-256M

Dataset used to train lusxvr/nanoVLM-256M