Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain.
In this work, we identify a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework.
Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens.
We introduce a key empirical observation: semantic information from textual prompts is preferentially injected into specific spatial regions ---particularly the margin columns of the raster-scanned image token sequence. This finding naturally leads to the identification of two distinct types of attention heads with specialized roles: semantic heads that capture global context and spatial heads that handle local dependencies.
We evaluate our framework across two memory scenarios: a low setting (20% token budget) and a high setting (50% token budget). SD achieves better GenEval and DPG-Bench scores than H2O and StreamingLLM under 20% and 50% token budgets, comparable to the performance of the vanilla full-cache baseline.
We further demonstrate that SSD maintains robust performance across compression ratios, with minimal degradation even at 20% cache size, aligning with our hypothesis that combining window-based and attention-based compression leverages both local and global dependencies effectively.
SSD demonstrates superior efficiency compared to full cache, achieving up to 6.6× throughput improvement and 5× memory reduction, which highlights its practical advantage in real-world deployment scenarios. SSD with buffer method delivers significant performance improvements—achieving up to 10.7× higher throughput than full cache—while reducing memory consumption by approximately 80%.
@article{jian2025ssdspatialsemanticheaddecoupling,
title={SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation},
author={Jian, Siyong and Wang, Huan},
journal={arXiv preprint arXiv:2510.18716},
year={2025}
}