최진우 교수 연구실 (Vision and Learning Lab), TPAMI(IF 18.6, JCR Top 1.1%)에 멀티모달 학습 논문 게재 승인
최진우 교수 연구실(Vision and Learning Lab)의 컴퓨터공학과 석사과정 이종서와 장주현이 공동1저자로 작성한 논문이 컴퓨터 비전, 기계학습 및 인공지능 분야에서 세계 최고 권위를 자랑하는 저널인 IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, Impact Factor 18.6, JCR 상위 1.1%) 에 게재가 승인 되었습니다.
본 논문은 Audio, Space, Time 세 가지 맥락에서의 전문가 간 정보 교환을 통해, 효율적이고 통합적인 video multi-modal learning을 가능하게 하는 프레임워크를 제안합니다.
[논문정보]
Title: CA2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Authors: Jongseo Lee*, Joohyun Chang*, Dongho Lee, Jinwoo Choi† (* 교신 저자)
Venue: IEEE Transactions on Pattern Analysis and Machine Intelligence
TL; DR
We propose Cross-Attention in Audio, Space, and Time (CA2ST), a transformer-based method for holistic video recognition.
Abstract
We propose Cross-Attention in Audio, Space, and Time (CA2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, Kinetics-400, ActivityNet, and HD-EPIC to show balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, EPIC-SOUNDS, and HD-EPIC-SOUNDS. CAVA shows favorable performance on these datasets, demonstrating the effective information exchange among multiple experts within the B-CA module. In addition, CA2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.
2025.11.03