UNDER DEVELOPMENT

license: apache-2.0 tags: - vision-language - mixture-of-experts - text-generation - multimodal - transformer datasets: - HuggingFaceH4/llava-instruct-mix-vsft metrics: - loss

Dumb AI - OptimizedMoEVisionModel

Model Overview

Dumb AI is a lightweight, multimodal Mixture of Experts (MoE) transformer model designed for vision and text generation tasks. Created by Damienchakma, it integrates a frozen CLIP vision encoder with a custom MoE architecture to process images and generate text autoregressively. With approximately 321 million parameters (~170M trainable, ~151M frozen), it competes with models like GPT-2 Small (124M) or a minimally fine-tuned CLIP + small text decoder setup for specific multimodal tasks.

Key Features

Multimodal: Combines CLIP ViT-B/32 for vision with a text generation transformer.
MoE Efficiency: Uses 8 experts with top-2 routing for sparse computation.
Causal Attention: Supports autoregressive text generation (recently optimized).
Lightweight: Optimized for resource-constrained environments (e.g., 2x Tesla T4 GPUs).

Model Details

Architecture

Vision Encoder: CLIP ViT-B/32 (frozen, ~151M parameters).
Text Decoder:
- 6 transformer layers.
- d_model: 768.
- num_heads: 12.
- MoE MLP with 8 experts, expert_dim: 512, top-2 routing.
- Rotary positional embeddings.
- Causal attention for text generation.
Tokenizer: GPT-2 tokenizer (vocab_size: 50,257).
Total Parameters: ~~321M (~~170M trainable, ~151M frozen from CLIP).

Training

Dataset: Fine-tuned on 5,000 samples from HuggingFaceH4/llava-instruct-mix-vsft (train split).
Training Setup:
- 3 epochs (~15,000 steps).
- Batch size: 1 per device, gradient accumulation steps: 4 (effective batch size: 4).
- Optimizer: AdamW, learning rate: 2e-4.
- FP16 mixed precision.
- Gradient checkpointing for memory efficiency.
Hardware: 2x NVIDIA Tesla T4 GPUs (15 GB VRAM each).
Training Time: Approximately hours to days (depending on iteration speed, ~0.1-0.5 it/s after optimization).

Intended Use

Primary Task: Generate text descriptions from images or continue text prompts with visual context.
Use Cases: Image captioning, vision-guided storytelling, small-scale multimodal experiments.
Limitations: Not pre-trained on a large corpus, so it lacks broad language generalization compared to models like LLaMA or GPT-3.

Performance

Competitors: Comparable to GPT-2 Small (124M) or a fine-tuned CLIP + small decoder (~200-300M) on specific multimodal tasks.
Strengths: Efficient multimodal processing, tailored to the llava-instruct-mix-vsft dataset.
Weaknesses: Limited text generation quality due to small training data and no pre-training. Coherence may degrade beyond short sequences.

Metrics

Training Loss: ~36.17 at step 10 (early training, expected to decrease further).
Evaluation: No formal benchmarks yet; qualitative testing suggests task-specific competence.

Usage

Installation

pip install transformers datasets torch