YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

UNDER DEVELOPMENT

license: apache-2.0 tags: - vision-language - mixture-of-experts - text-generation - multimodal - transformer datasets: - HuggingFaceH4/llava-instruct-mix-vsft metrics: - loss

Dumb AI - OptimizedMoEVisionModel

Model Overview

Dumb AI is a lightweight, multimodal Mixture of Experts (MoE) transformer model designed for vision and text generation tasks. Created by Damienchakma, it integrates a frozen CLIP vision encoder with a custom MoE architecture to process images and generate text autoregressively. With approximately 321 million parameters (~170M trainable, ~151M frozen), it competes with models like GPT-2 Small (124M) or a minimally fine-tuned CLIP + small text decoder setup for specific multimodal tasks.

Key Features

  • Multimodal: Combines CLIP ViT-B/32 for vision with a text generation transformer.
  • MoE Efficiency: Uses 8 experts with top-2 routing for sparse computation.
  • Causal Attention: Supports autoregressive text generation (recently optimized).
  • Lightweight: Optimized for resource-constrained environments (e.g., 2x Tesla T4 GPUs).

Model Details

Architecture

  • Vision Encoder: CLIP ViT-B/32 (frozen, ~151M parameters).
  • Text Decoder:
    • 6 transformer layers.
    • d_model: 768.
    • num_heads: 12.
    • MoE MLP with 8 experts, expert_dim: 512, top-2 routing.
    • Rotary positional embeddings.
    • Causal attention for text generation.
  • Tokenizer: GPT-2 tokenizer (vocab_size: 50,257).
  • Total Parameters: 321M (170M trainable, ~151M frozen from CLIP).

Training

  • Dataset: Fine-tuned on 5,000 samples from HuggingFaceH4/llava-instruct-mix-vsft (train split).
  • Training Setup:
    • 3 epochs (~15,000 steps).
    • Batch size: 1 per device, gradient accumulation steps: 4 (effective batch size: 4).
    • Optimizer: AdamW, learning rate: 2e-4.
    • FP16 mixed precision.
    • Gradient checkpointing for memory efficiency.
  • Hardware: 2x NVIDIA Tesla T4 GPUs (15 GB VRAM each).
  • Training Time: Approximately hours to days (depending on iteration speed, ~0.1-0.5 it/s after optimization).

Intended Use

  • Primary Task: Generate text descriptions from images or continue text prompts with visual context.
  • Use Cases: Image captioning, vision-guided storytelling, small-scale multimodal experiments.
  • Limitations: Not pre-trained on a large corpus, so it lacks broad language generalization compared to models like LLaMA or GPT-3.

Performance

  • Competitors: Comparable to GPT-2 Small (124M) or a fine-tuned CLIP + small decoder (~200-300M) on specific multimodal tasks.
  • Strengths: Efficient multimodal processing, tailored to the llava-instruct-mix-vsft dataset.
  • Weaknesses: Limited text generation quality due to small training data and no pre-training. Coherence may degrade beyond short sequences.

Metrics

  • Training Loss: ~36.17 at step 10 (early training, expected to decrease further).
  • Evaluation: No formal benchmarks yet; qualitative testing suggests task-specific competence.

Usage

Installation

pip install transformers datasets torch
Downloads last month
45
Safetensors
Model size
239M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.