Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

1Seoul National University 2NAVER AI Lab
*These authors contributed equally to this work.
TL;DR: This paper presents a systematic analysis of where and how information flows in VideoLLMs for temporal reasoning in VideoQA, revealing key patterns and effective pathways.
VideoLLMs findings

Summary of our findings on VideoLLMs' information flow.

(a) Temporal reasoning begins with cross-frame interactions within video tokens at early-middle layers [green], followed by video-language integration into temporal keywords in the question [purple]. This information is conveyed to the last token at middle-late layers [orange], where answer generation occurs [yellow].

(b) These effective pathways are identified via Attention Knockout, which disconnects attention pairs and tracks the drop in probability of the final answer to quantify their impact.

(c) Layer-wise answer probability rises immediately after video-language integration, indicating that the model is ready to predict correct answers after the middle layers.

Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT.

Abstract

Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization.

Research Questions

In this study, we aim to provide a complete blueprint that reveals the systematic behaviors of VideoLLMs on temporal reasoning tasks, with a focus on the information flow across different layers and modalities.

To understand how VideoLLMs generate an answer from a given (video, question) pair, we decompose the temporal reasoning process into several stages and investigate the following key questions:

  1. How do VideoLLMs encode spatiotemporal information from the given flattened sequence of video tokens?
  2. How are the temporal concepts in the question extracted from video tokens and propagated to text tokens?
  3. At what stage does the model become ready to generate an answer?
  4. Can we identify effective information flow pathways sufficient to solve VideoQA?

Our Findings

1. Active Temporal Interaction Within Video Tokens in Early-to-Middle Layers

Cross-frame attention impact

Training with VideoQA data boosts cross-frame interactions in the early-to-middle layers. We use Attention Knockout to selectively disconnects attention edges to quantify their impact. Blocking cross-frame interactions in early-to-middle layers significantly harms LLaVA-NeXT-7B-Video-FT’s prediction, while LLaVA-NeXT-7B remains mostly unaffected, showing that this capability is uniquely acquired through VideoQA instruction tuning from base ImageLLMs.

Cross-frame attention impact

Impact of cross-frame attention on answer generation. We block cross-frame attention in the first half of the total layers and measure the resulting accuracy drop. Without cross-frame attention, the model generates incorrect or even opposite answers to the given videos. This suggests that VideoLLMs rely heavily on cross-frame interactions in the early stage to reason about temporal events.

2. Video-Language Integration on Temporal Keywords in Middle Layers

Overall cross-modal flow

Overall cross-modal information flow in VideoLLMs. We analyze changes in the prediction probability when intervening on attention edges between video, question, and last token (i.e., the starting position for answer generation). Information from the video tokens is conveyed to the question tokens in the early-to-middle layers, followed by the transfer of information from the question tokens to the last token in the middle-to-late layers.

Emergence of temporal concepts

Emergence of temporal concepts in video tokens. Analyzing semantic concepts in video tokens through Logit Lens shows that temporal concepts are emergent among video tokens in the vocabulary space. Interestingly, spatial concepts start to appear in the very early layers, whereas temporal concepts develop later in the middle layers.

Video-to-question attention

Video-language alignment enables selective spatiotemporal propagation. In this video-to-question attention maps, (a) with spatiotemporal interactions, each question token attends to semantically relevant regions: "begins" focuses on blue sphere at start, "ends" on blue sphere and green square at end. (b) When temporal interactions among video tokens are blocked, video-text alignment fails and text tokens instead attend to positionally proximate regions rather than semantically relevant ones.

3. Answer Generation at Middle-to-Late Layers

Answer generation

Tracing layer-wise answer probability at the last token reveals that the model is prepared to generate a correct answer immediately once the video-language integration concludes after middle layers.

4. Effective Information Flow Pathways Are Sufficient for Solving VideoQA Tasks

Effective information flow pathways

To validate the above findings, we disable all information pathways except those identified as critical. Evaluation on VideoQA benchmarks shows that the models retain performance comparable to baselines, demonstrating that these effective pathways suffice for accurate answer generation.

BibTeX

@article{kim2025map,
  author    = {Kim, Minji and Kim, Taekyung and Han, Bohyung},
  title     = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs},
  journal   = {arxiv preprint},
  year      = {2025},
}