Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

Dwip Dalal¹, Gautam Vashishtha², Utkarsh Mishra³, Jeonghwan Kim¹,
Madhav Kanda¹, Hyeonjeong Ha¹, Svetlana Lazebnik¹, Heng Ji¹, Unnat Jain⁴
¹University of Illinois Urbana–Champaign  ²Skan AI  ³Texas A&M University  ⁴University of California, Irvine
📖 Presentation 📄 Paper 💻 Code 📊 Results

Video Gallery

TLDR

MLLMs often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations.