YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
UNDER DEVELOPMENT
license: apache-2.0 tags: - vision-language - mixture-of-experts - text-generation - multimodal - transformer datasets: - HuggingFaceH4/llava-instruct-mix-vsft metrics: - loss
Dumb AI - OptimizedMoEVisionModel
Model Overview
Dumb AI is a lightweight, multimodal Mixture of Experts (MoE) transformer model designed for vision and text generation tasks. Created by Damienchakma, it integrates a frozen CLIP vision encoder with a custom MoE architecture to process images and generate text autoregressively. With approximately 321 million parameters (~170M trainable, ~151M frozen), it competes with models like GPT-2 Small (124M) or a minimally fine-tuned CLIP + small text decoder setup for specific multimodal tasks.
Key Features
- Multimodal: Combines CLIP ViT-B/32 for vision with a text generation transformer.
- MoE Efficiency: Uses 8 experts with top-2 routing for sparse computation.
- Causal Attention: Supports autoregressive text generation (recently optimized).
- Lightweight: Optimized for resource-constrained environments (e.g., 2x Tesla T4 GPUs).
Model Details
Architecture
- Vision Encoder: CLIP ViT-B/32 (frozen, ~151M parameters).
- Text Decoder:
- 6 transformer layers.
d_model
: 768.num_heads
: 12.- MoE MLP with 8 experts,
expert_dim
: 512, top-2 routing. - Rotary positional embeddings.
- Causal attention for text generation.
- Tokenizer: GPT-2 tokenizer (
vocab_size
: 50,257). - Total Parameters:
321M (170M trainable, ~151M frozen from CLIP).
Training
- Dataset: Fine-tuned on 5,000 samples from
HuggingFaceH4/llava-instruct-mix-vsft
(train split). - Training Setup:
- 3 epochs (~15,000 steps).
- Batch size: 1 per device, gradient accumulation steps: 4 (effective batch size: 4).
- Optimizer: AdamW, learning rate: 2e-4.
- FP16 mixed precision.
- Gradient checkpointing for memory efficiency.
- Hardware: 2x NVIDIA Tesla T4 GPUs (15 GB VRAM each).
- Training Time: Approximately hours to days (depending on iteration speed, ~0.1-0.5 it/s after optimization).
Intended Use
- Primary Task: Generate text descriptions from images or continue text prompts with visual context.
- Use Cases: Image captioning, vision-guided storytelling, small-scale multimodal experiments.
- Limitations: Not pre-trained on a large corpus, so it lacks broad language generalization compared to models like LLaMA or GPT-3.
Performance
- Competitors: Comparable to GPT-2 Small (124M) or a fine-tuned CLIP + small decoder (~200-300M) on specific multimodal tasks.
- Strengths: Efficient multimodal processing, tailored to the
llava-instruct-mix-vsft
dataset. - Weaknesses: Limited text generation quality due to small training data and no pre-training. Coherence may degrade beyond short sequences.
Metrics
- Training Loss: ~36.17 at step 10 (early training, expected to decrease further).
- Evaluation: No formal benchmarks yet; qualitative testing suggests task-specific competence.
Usage
Installation
pip install transformers datasets torch
- Downloads last month
- 45
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.