Dcas89 PRO

Dcas89
·

AI & ML interests

None yet

Recent Activity

liked a dataset about 11 hours ago
eltorio/ROCOv2-radiology
updated a model about 15 hours ago
Dcas89/Aurea
View all activity

Organizations

None yet

Posts 1

view post
Post
291
After months of experimentation, I'm excited to share Aurea - a novel adaptive Spatial-Range attention mechanism that approaches multimodal fusion from a fundamentally different angle.

Most vision-language models use a single vision encoder followed by simple projection layers, creating a bottleneck that forces rich visual information through a single representational "funnel" before language integration.

What if we could integrate multiple visual perspectives throughout the modeling process?

The key innovation in Aurea isn't just using multiple encoders (DINOv2 + SigLIP2) - it's how we fuse them. The spatial-range attention mechanism preserves both spatial relationships and semantic information.

This dual awareness allows for richer representations which can be used for any downstream tasks. For instance, Aurea can better understand relational positions between objects, fine-grained details, and complex spatial hierarchies.

I've integrated Aurea into a language model (Phi-4 Mini) via basic pre-training and instruction-tuning. Everything is available - code, weights, and documentation. The CUDA implementation is particularly interesting if you enjoy high-performance computing.

I'd love to see what the community builds with this foundation and would appreciate your feedback. Whether you're interested in theoretical aspects of multimodal fusion or practical applications, there's something in Aurea for you.

datasets 0

None public yet