MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

MMInference, bottom-up system-algorithm co-design sparse attention, can process 1M tokens video 8.3x faster in a single A100 using Long-context VLMs like LongVila, Llava-Video, VideoChat-Flash, Qwen2.5-VL, with even better accuracy, try MMInference right now!

News

🐝 [25/5/2] MMInference has been accepted at ICML'25.
🌳 [25/4/23] We will present MMInference at the Microsoft Booth and FM-Wild at ICLR'25. See you in Singapore!.

Abstract

The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse patterns, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundaries issue. By offline searching the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks—including Video QA, Captioning, VisionNIAH, and Mix Modality-NIAH—with state-of-the-art long-context VLMs (LongVila, Llava-Video, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining competitive performance.

Insights

Hardware-efficient sparse attention constraints;
Multi-modality Attention is Dynamically Sparse;
Grid-Shape, vision inputs exhibit unique inductive biases, such as locality across spatial and temporal dimensions;
Mix-Modality Boundary, the interaction and processing patterns between modalities vary considerably, leading to distinct modality boundaries in attention;

Why MMInference?

Long-context enables LLMs and VLMs to power advanced applications such as long-video understanding, reasoning, robotics, autonomous driving, and healthcare. However, the quadratic attention cost makes processing long multi-modal inputs (i.e., pre-filling) extremely slow—often taking minutes. While prior works use dynamic sparse attention to accelerate text-only LLMs, they overlook the unique sparse patterns in VLMs and struggle with interleaved modalities, limiting their effectiveness without performance trade-offs.

Figure 1. Visualization of pre- vs. post-permutation sparsity attention patterns in VLMs.

Unlike long-text inputs, visual inputs in VLMs exhibit spatiotemporal locality, leading to grid-like attention patterns with regular vertical and horizontal structures (Fig. 1a). In mixed-modality settings, clear modality boundaries emerge, where cross-modality attention deviates notably from intra-modality patterns (Fig. 1b). These characteristics introduce unique challenges for leveraging sparsity to accelerate the pre-fill stage.

To address this gap, we introduce MMInference, a permutation-based dynamic sparse attention method that significantly reduces attention FLOPs, accelerating the pre-fill stage of longcontext VLMs. 1) MMInference identifies the grid heads and leverages a row- and column-wise permutation to gather the sparse grid for efficient hardware computation; 2) detect Query-boundary and 2D-boundary patterns to address inter-modality boundary issues, and apply a modality-wise permutation to isolate intra-modality regions. 3) a Modality-Aware Sparse Attention Search Algorithm is devised to fine-tune both inter- and intra-modality patterns offline, to optimize performance with minimal overhead.

Figure 2. The framework of MMInference, encompassing both inter- and intra-modality sparse attention patterns.

Specifically, we implement three distinct permutation-based sparse attention mechanisms, with FlashAttention, FlashDecoding and PIT, to address the grid patterns in vision inputs and the modality boundary issues in mixed-modality scenarios, as detailed in Algorithms 5-7.

Experiments Results in Long-video Benchmarks

We evaluate MMInference across diverse scenarios, including long-video understanding, Video Needle-in-a-Haystack, and Video-Text Needle-in-a-Haystack. The video understanding benchmark comprises six long-video tasks on SoTA long-context VLMs—LongVILA, Llava-Video, and Qwen2.5-VL—using 110 to 256 frames.

Methods FLOPs VideoDC ActNet-QA EgoSchema Next-QA PerceptionTest VideoMME
w/o sub. VideoMME
w/ sub. Avg.

LongVILA-7B 100% 2.76 59.5 61.9 80.7 58.1 60.1 65.1 55.5

SF-fixed 2.2% 1.99 51.3 59.6 76.5 55.5 57.1 63.0 52.1

SF-strided 26.6% 2.58 56.0 61.4 76.7 55.5 53.6 59.2 52.2

A-shape 29.1% 2.75 56.6 60.9 75.0 55.3 49.1 59.6 51.3

Tri-shape 29.3% 2.63 58.1 62.0 77.8 56.2 59.3 63.3 54.2

VisionZip OOM — — — — — — — —

MInference 47.0% 2.77 59.7 62.2 79.1 57.8 60.0 65.2 55.2

Ours 31.8% 2.84 60.2 62.2 79.4 57.8 60.0 65.5 55.4

Table 1. Performance (%) of different methods on video understanding tasks evaluated at 256 frames using LongVILA-7B.

To further evaluate performance across different context lengths and positions of key information within prompts, we tested various models and methods using the Video-Needle in a Haystack task (VNIAH). As shown in Fig. 3, MMInference performs well across different models, context windows, and positions within the prompt, maintaining or even slightly improving performance compared to the original models.

Figure 3. Video Needle In A Haystack results using LongVila-Qwen2-7B-1M.

Beyond V-NIAH, we introduce a mixed-modality NIAH (MMNIAH) test to evaluate the performance of different sparse methods on video-text inputs, in Fig. 4. Mixed-modality inputs lead to more pronounced performance degradation across all methods. However, by incorporating inter-modality sparse patterns, our method maintains performance close to full attention, especially when compared to MInference and ours w/o inter-modality. Notably, Tri-shape and MInference show significant drops at 1.8k frames (i.g. 440K tokens) and 2.7k frames (i.g. 660K tokens).

Figure 4. Mixed-Modality Needle In A Haystack results using LongVila-Qwen2-7B-1M.

Latency Benchmarks

Fig. 5 and 6 present end-to-end and kernel-level latency across different context sizes. The grid pattern significantly outperforms the vertical-slash pattern in sparsity, achieving a 2-3x speedup even at 1M tokens. Additionally, the grid pattern achieves an end-to-end speedup of 8.3x and a kernellevel speedup of 12x.

Figure 5. End-to-End Latency.

Figure 6. The latency breakdown of a single attention kernel for four sparse attention patterns and FlashAttention across different context windows in a single A100, including the index time for dynamic sparse approximation and building dynamic sparsity. At 1M tokens, the latency for Grid is 358ms.

Transition of Sparse Patterns Across Modalities

LLMs and VLMs exhibit distinct sparse patterns, necessitating tailored strategies. As shown in Fig. 7, Llava-Video-7B primarily adopts the Vertical-Slash pattern for text-only inputs, but switches to a Grid pattern upon receiving visual input—aligning with the modality boundary. This shift reflects the spatial structure of visual content and underscores the need for specialized sparsity designs in visual and mixed-modality settings, rather than directly reusing text-based sparse patterns.

Figure 7. Transition of sparse patterns from textual context to visual context. (a) The vertical-slash pattern for all textual context. (b) Grid pattern appears when visual modality is appended. (c) Grid pattern dominates.

Integrate with Token Compression Methods

As shown in Table 2, our method integrates seamlessly with token compression, achieving near-lossless performance while enabling longer or higher-resolution video inputs. VideoChat-Flash reduces tokens per frame from 196 to 16 in the ViT stage, and our method further accelerates the LLM decoder via sparse attention, maintaining strong performance across benchmarks.

Methods VideoDC ActNet-QA EgoSchema Next-QA PerceptionTest VideoMME
w/o sub. VideoMME
w/ sub. Avg.

VideoChat-Flash 3.21 53.6 57.0 81.2 69.1 63.2 70.5 56.8

w/ MMInference 3.19 54.3 57.3 79.8 69.1 63.0 70.2 56.7

Table 2. Performance (%) on video understanding tasks evaluated at 512 frames with 8k tokens using VideoChat-Flash.

Sparse Attention in DiT

Recent efficient DiT methods (Hassani et al., 2023; Xi et al., 2025; Zhang et al., 2025; Xu et al., 2025b) adopt sparse attention to accelerate long video generation. We highlight that these approaches can further benefit from permutation-based transformations for kernel-efficient implementations. For instance, the 2D/3D sliding window attention in NATTEN can be converted into dense tensor core operations via permutation (Fig. 8). Similarly, the temporal head in Sparse VideoGen and the anti-diagonal structure in xAttention can be permuted to enable sparse loading with dense computation, significantly accelerating DiT inference in long-context settings.

Figure 8. Permutation-based implementation of 2D/3D sliding window attention optimization for DiT architectures.

Additional Sparse Attention Pattern Visualization

We further analyze the sparse patterns in Qwen2.5-VL with dynamic resolution inputs and in VideoChatFlash under visual token compression, across both video benchmark and mixed-modality inputs, as shown in Fig.9.

Figure 8. Visualization of sparse attention patterns in Qwen2.5-VL with dynamic resolution input and VideoChat-Flash with visual token compression across different benchmarks.

BibTeX

If you find this project helpful, please cite the following papers:

@inproceedings{li2025mminference, title={{MMI}ference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention}, author={Li, Yucheng and Jiang, Huiqiang and Zhang, Chengruidong and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili}, booktitle={Forty-second International Conference on Machine Learning}, year={2025}, url={https://openreview.net/forum?id=me6PfbATWM} }

unique visitors | © 2025 Microsoft
Website template borrowed from NeRFies, VIMA, and L2R.

Methods	FLOPs	VideoDC	ActNet-QA	EgoSchema	Next-QA	PerceptionTest	VideoMME w/o sub.	VideoMME w/ sub.	Avg.
LongVILA-7B	100%	2.76	59.5	61.9	80.7	58.1	60.1	65.1	55.5
SF-fixed	2.2%	1.99	51.3	59.6	76.5	55.5	57.1	63.0	52.1
SF-strided	26.6%	2.58	56.0	61.4	76.7	55.5	53.6	59.2	52.2
A-shape	29.1%	2.75	56.6	60.9	75.0	55.3	49.1	59.6	51.3
Tri-shape	29.3%	2.63	58.1	62.0	77.8	56.2	59.3	63.3	54.2
VisionZip	OOM	—	—	—	—	—	—	—	—
MInference	47.0%	2.77	59.7	62.2	79.1	57.8	60.0	65.2	55.2
Ours	31.8%	2.84	60.2	62.2	79.4	57.8	60.0	65.5	55.4

Methods	VideoDC	ActNet-QA	EgoSchema	Next-QA	PerceptionTest	VideoMME w/o sub.	VideoMME w/ sub.	Avg.
VideoChat-Flash	3.21	53.6	57.0	81.2	69.1	63.2	70.5	56.8
w/ MMInference	3.19	54.3	57.3	79.8	69.1	63.0	70.2	56.7