Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios.
To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation.
Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations.
Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent's internal reasoning by probing an explicit cognitive map from the MLLM, substantially enhancing navigation success.
Sparsely Grounded Long-Range Navigation: The agent must navigate without any landmark annotations or explicit city navigation instructions, relying exclusively on images observed at each intersection. This task requires agents to leverage their intrinsic world knowledge to facilitate spatial understanding, accurate self-positioning, and sequential decision-making to reach the goal.
Agents must retrieve relevant knowledge, reason about spatial relationships, plan actions, execute them, and revise plans based on new visual evidence—all without external guidance.
Routes span approximately 2km with 50+ decision points, requiring sustained reasoning over extended sequences—far beyond existing benchmarks limited to ~350m.
Four cities—New York, Tokyo, Vienna, São Paulo—test adaptation to diverse languages, signage systems, street layouts, and architectural styles.
Watch AgentNav successfully navigate to iconic landmarks using only visual observations and internal reasoning from GPT-5.
We propose Verbalization of Path (VoP), a mechanism designed to explicitly extract and leverage the latent world knowledge internalized by MLLMs. By prompting agents to verbalize navigation paths, VoP substantially enhances the performance of MLLM-based agents on long-range navigation tasks.
"Write the exact location of the destination" — explicitly defines the navigation goal, anchoring the agent's decision-making to a clear terminal state.
"Write the current estimated exact location" — compels the agent to continuously estimate and update its position, serving as a precise initial condition.
"Write walking directions from current position to destination" — leverages the agent's world knowledge to generate actionable spatial instructions.
Agent produces memory state at each step, transforming the POMDP into a Markovian process. Eliminates need for full episodic memory, reducing computational costs ~100×.
Maintains structured record of actions chosen at each intersection. Enables reasoning about prior choices, route corrections, and avoiding repeated loops.
Tracks visit counts for each node. As revisits increase, agent is discouraged from repeating actions—promoting exploration and preventing cyclic behavior.
4
Global Cities
~2km
Avg. Path Length
50+
Decision Points
400
Navigation Tasks
| City | Region | Characteristics | Distance (km) | Decision Points |
|---|---|---|---|---|
| New York | USA | Grid-based, well-spaced, rich street signs | 1.8 | 44 |
| São Paulo | Brazil | Non-block structure, Portuguese language | 2.0 | 55 |
| Tokyo | Japan | Short sightlines, narrow alleys, Japanese language | 1.9 | 80 |
| Vienna | Austria | Road blocks from rails, German language | 2.1 | 60 |
| MLLM | Config | New York | Tokyo | Vienna | São Paulo | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Succ. | SPL | D.A. | Succ. | SPL | D.A. | Succ. | SPL | D.A. | Succ. | SPL | D.A. | ||
| GPT-4o | AgentNav | 88 | 0.539 | 72.9 | 14 | 0.099 | 40.9 | 26 | 0.170 | 46.3 | 20 | 0.06 | 43.5 |
| Base | 13 | 0.064 | 39.0 | 4 | 0.046 | 36.8 | 4 | 0.031 | 35.7 | 3 | 0.040 | 34.7 | |
| GPT-5 | AgentNav | 94 | 0.711 | 83.0 | 30 | 0.163 | 55.0 | 56 | 0.226 | 54.8 | 29 | 0.126 | 49.0 |
| Base | 54 | 0.375 | 56.0 | 10 | 0.088 | 41.2 | 11 | 0.092 | 40.7 | 7 | 0.051 | 37.0 | |
| GPT-4.1 | AgentNav | 92 | 0.557 | 75.3 | 17 | 0.101 | 43.7 | 32 | 0.182 | 50.0 | 22 | 0.080 | 44.1 |
| Base | 15 | 0.097 | 42.3 | 5 | 0.044 | 38.8 | 2 | 0.037 | 34.7 | 5 | 0.049 | 35.5 | |
| Gemini 2.5 | AgentNav | 73 | 0.471 | 74.8 | 17 | 0.066 | 46.9 | 17 | 0.137 | 46.4 | 12 | 0.085 | 43.7 |
| Base | 12 | 0.060 | 41.6 | 8 | 0.049 | 40.0 | 1 | 0.010 | 29.3 | 5 | 0.049 | 35.8 | |
| Qwen 2.5 VL | AgentNav | 32 | 0.153 | 56.4 | 12 | 0.094 | 40.0 | 12 | 0.119 | 44.9 | 9 | 0.059 | 37.8 |
| Base | 7 | 0.089 | 35.1 | 2 | 0.023 | 30.0 | 0 | 0.0 | 26.1 | 2 | 0.011 | 29.9 | |
| Method | New York | Tokyo | Vienna | São Paulo | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Succ.% | SPL | D.A.% | Succ.% | SPL | D.A.% | Succ.% | SPL | D.A.% | Succ.% | SPL | D.A.% | |
| GPT-4.1 | 15 | 0.097 | 42.3 | 5 | 0.044 | 38.8 | 2 | 0.037 | 34.7 | 5 | 0.049 | 35.5 |
| Chain-of-Thought | 21 | 0.173 | 44.6 | 9 | 0.077 | 41.1 | 4 | 0.039 | 34.9 | 7 | 0.055 | 37.9 |
| Self Reflection (4.1) | 16 | 0.112 | 42.9 | 4 | 0.040 | 36.2 | 3 | 0.042 | 36.3 | 12 | 0.052 | 42.0 |
| Self Reflection (GPT-5) | 22 | 0.168 | 48.1 | 8 | 0.079 | 41.5 | 5 | 0.045 | 37.8 | 13 | 0.050 | 41.6 |
| AgentNav (Ours) | 92 | 0.557 | 75.3 | 17 | 0.101 | 43.7 | 32 | 0.182 | 50.0 | 22 | 0.080 | 44.1 |
| Method | Success (%) | SPL | D.A. (%) |
|---|---|---|---|
| GPT-4.1 (Base) | 15 | 0.097 | 42.3 |
| + Markovian Memory | 23 | 0.162 | 47.2 |
| + Decision History | 29 | 0.228 | 55.6 |
| + Previous Visit | 35 | 0.298 | 56.7 |
| + Partial Verbalization | 66 | 0.469 | 63.5 |
| AgentNav (Full) | 92 | 0.557 | 75.3 |
@article{dalal2025citynav,
title={City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs},
author={Dalal, Dwip and Mishra, Utkarsh and Ahuja, Narendra and Jojic, Nebojsa},
journal={arXiv preprint},
year={2025}
}