LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

CVPR 2025

Abstract

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.

Conceptual information of proposed BiM

Annotation Pipeline of LongVALE

Conceptual information of proposed BiM

Statistics

Conceptual information of proposed BiM

Creative Commons License License

The LongVALE dataset is available for download under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License . The copyright remains with the original video owners. Please contact the authors if you have any questions regarding the dataset.

BibTeX


            @article{geng2024longvale,
                title={Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos},
                author={Geng, Tiantian and Zhang, Jinrui and Wang, Qingni and Wang, Teng and Duan, Jinming and Zheng, Feng},
                journal={arXiv preprint arXiv:2411.19772},
                year={2024}
              }