Dcas89 (Dcas89)

Posts 1

Post

291

After months of experimentation, I'm excited to share Aurea - a novel adaptive Spatial-Range attention mechanism that approaches multimodal fusion from a fundamentally different angle.

Most vision-language models use a single vision encoder followed by simple projection layers, creating a bottleneck that forces rich visual information through a single representational "funnel" before language integration.

What if we could integrate multiple visual perspectives throughout the modeling process?

The key innovation in Aurea isn't just using multiple encoders (DINOv2 + SigLIP2) - it's how we fuse them. The spatial-range attention mechanism preserves both spatial relationships and semantic information.

This dual awareness allows for richer representations which can be used for any downstream tasks. For instance, Aurea can better understand relational positions between objects, fine-grained details, and complex spatial hierarchies.

I've integrated Aurea into a language model (Phi-4 Mini) via basic pre-training and instruction-tuning. Everything is available - code, weights, and documentation. The CUDA implementation is particularly interesting if you enjoy high-performance computing.

I'd love to see what the community builds with this foundation and would appreciate your feedback. Whether you're interested in theoretical aspects of multimodal fusion or practical applications, there's something in Aurea for you.

models 1

Dcas89/Aurea

Image-Text-to-Text • Updated about 15 hours ago • 1

datasets 0

None public yet

Dcas89 PRO

AI & ML interests

Recent Activity

Organizations

Posts 1

models 1

Dcas89/Aurea

datasets 0