arxiv:2502.12488

Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

Published on Feb 18

Authors:

Xiang He ,

Abstract

A semantic-alignment cross-modal residual learning framework for multimodal audio-visual processing using spiking neural networks achieves state-of-the-art performance.

AI-generated summary

Humans interpret and perceive the world by integrating sensory information from multiple modalities, such as vision and hearing. Spiking Neural Networks (SNNs), as brain-inspired computational models, exhibit unique advantages in emulating the brain's information processing mechanisms. However, existing SNN models primarily focus on unimodal processing and lack efficient cross-modal information fusion, thereby limiting their effectiveness in real-world multimodal scenarios. To address this challenge, we propose a semantic-alignment cross-modal residual learning (S-CMRL) framework, a Transformer-based multimodal SNN architecture designed for effective audio-visual integration. S-CMRL leverages a spatiotemporal spiking attention mechanism to extract complementary features across modalities, and incorporates a cross-modal residual learning strategy to enhance feature integration. Additionally, a semantic alignment optimization mechanism is introduced to align cross-modal features within a shared semantic space, improving their consistency and complementarity. Extensive experiments on three benchmark datasets CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS demonstrate that S-CMRL significantly outperforms existing multimodal SNN methods, achieving the state-of-the-art performance. The code is publicly available at https://github.com/Brain-Cog-Lab/S-CMRL.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.12488 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.12488 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.12488 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.