Benchmarking Spatial–Functional Intelligence in Multimodal LLMs
Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing higher-order cognitive abilities required for grounded intelligence. We introduce the Spatial–Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench evaluates two complementary dimensions: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence.
SFI-Bench evaluates cognitive abilities central to agentic intelligence: Structured Spatial Reasoning (understanding where things are) and Functional Reasoning (understanding what they are for).
Compositional counting with attribute constraints and set-based operations—intersection, union, and group-level aggregation.
Integrating spatial evidence across time and viewpoints to infer relationships not visible in any single frame.
Integrating distributed cues into a coherent global scene layout and reasoning about occlusion relationships.
Inferring affordance relationships between objects through cues such as brand, design, or spatial context.
Searching for device-specific information, interpreting retrieved knowledge, and assembling multi-step action plans.
Diagnosing problems by combining scene understanding with external knowledge via web search.
Each question requires watching the egocentric video and reasoning about spatial or functional relationships across multiple frames. Correct answers are highlighted.
1,555 questions sourced from 134 videos (avg. 102s). The benchmark covers both spatial and functional reasoning dimensions with carefully balanced task distributions.
23 models evaluated: proprietary APIs, open-source instruct, and open-source reasoning models. GPT-5.4-high achieves the best overall performance (72.1%), while counting remains the hardest task.
| Methods | Rank | Avg. | GCT. | MPR. | LI. | FA. | OP. | TS. |
|---|
Larger models produce shorter, more compact reasoning chains and achieve consistently higher accuracy. Overthinking introduces semantic drift and degrades performance.
Longer reasoning chains do not lead to better decisions. Once a moderate budget is reached, overthinking introduces semantic drift.
Cognitive map construction depends strongly on visual evidence. Models exhibit surprising insensitivity to temporal continuity.
GPT-5 exhibits performance gaps exceeding 20 points depending solely on whether web search is enabled for functional tasks.
Strong reasoning ability is a prerequisite for effective tool use. Low-reasoning variants perform worse with web search enabled.
Conditional counting persists as the hardest task across all model categories, requiring compositional logical reasoning.
Reasoning models show minimal gains over instruct counterparts, failing to transfer capacity to spatial–functional tasks.
@article{zhang2025sfibench,
title = {From Where Things Are to What They Are For:
Benchmarking Spatial-Functional Intelligence
in Multimodal LLMs},
author = {Zhang, Le and Yang, Jihan and Krishnan, Soundarya
and Majmudar, Jimit and Ge, Xiou and Puri, Prasoon
and Saraf, Prathamesh and Bhargava, Shruti
and Piraviperumal, Dhivya and Ling, Yinan
and Pan, Cindy and Yu, Hong
and Agrawal, Aishwarya and Tseng, Bo-Hsiang},
journal = {arXiv preprint},
year = {2025}
}