CVPR 2026

From Where Things Are to What They Are For

Benchmarking Spatial–Functional Intelligence in Multimodal LLMs

Le Zhang1 Jihan Yang2 Soundarya Krishnan3 Jimit Majmudar3 Xiou Ge3 Prasoon Puri3 Prathamesh Saraf3 Shruti Bhargava3 Dhivya Piraviperumal3 Yinan Ling3 Cindy Pan3 Hong Yu3 Aishwarya Agrawal1 Bo-Hsiang Tseng3
1 Mila – Québec AI Institute, UdeM    2 NYU    3 Apple
Paper Code Dataset HuggingFace

Abstract

Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing higher-order cognitive abilities required for grounded intelligence. We introduce the Spatial–Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench evaluates two complementary dimensions: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence.

1,555
Expert-Annotated Questions
200
Egocentric Indoor Videos
6
Core Task Types
23
Models Benchmarked
Benchmark
Two Dimensions of Spatial–Functional Intelligence

SFI-Bench evaluates cognitive abilities central to agentic intelligence: Structured Spatial Reasoning (understanding where things are) and Functional Reasoning (understanding what they are for).

🔢
Spatial

Global & Conditional Counting

Compositional counting with attribute constraints and set-based operations—intersection, union, and group-level aggregation.

🧭
Spatial

Cross-View Multi-hop Path Reasoning

Integrating spatial evidence across time and viewpoints to infer relationships not visible in any single frame.

🗺
Spatial

Layout Inference

Integrating distributed cues into a coherent global scene layout and reasoning about occlusion relationships.

🔗
Functional

Functional Association

Inferring affordance relationships between objects through cues such as brand, design, or spatial context.

📋
Functional

Operation Planning

Searching for device-specific information, interpreting retrieved knowledge, and assembling multi-step action plans.

🔧
Functional

Causal Troubleshooting

Diagnosing problems by combining scene understanding with external knowledge via web search.

Examples
Task Demonstrations

Each question requires watching the egocentric video and reasoning about spatial or functional relationships across multiple frames. Correct answers are highlighted.

Functional Association
Where is the object operated by the silver device on the table in the middle of the room?
  • AOn the right side of the console table's surface
  • BOn the center of the console table's surface
  • CNext to the computer monitor
  • DUnderneath the console table
Layout Inference
What objects are on the two sides of the doorway?
  • AThe round swivel chair and the L-shaped sofa
  • BThe L-shaped sofa and the TV console
  • CThe side table and the pile of toys
  • DThe large mirror and the wall art
Spatial Reasoning
Find the wall cabinet mounted to the right of the mirror. On the countertop below that cabinet, there is a charging base. What object is sitting next to that base?
  • AA tube of toothpaste
  • BA manual toothbrush
  • CAn electric toothbrush
  • DA bottle of contact lens solution
Operation Planning
I started the wrong wash cycle. How do I cancel the current program on the washing machine?
  • ARotate the Programme Selection knob to the "Off" position
  • BPress and hold the "Start/Pause" button for 3 seconds
  • CPress the main "On/Off" button to power down the machine
  • DPress and hold "Temperature" and "Spin Speed" for 5 seconds
Dataset
Data Statistics

1,555 questions sourced from 134 videos (avg. 102s). The benchmark covers both spatial and functional reasoning dimensions with carefully balanced task distributions.

SFI-Bench
1,555 Questions
Figure 1. Task distribution (left) and video duration histogram (right).
Results
Evaluation on SFI-Bench

23 models evaluated: proprietary APIs, open-source instruct, and open-source reasoning models. GPT-5.4-high achieves the best overall performance (72.1%), while counting remains the hardest task.

MethodsRankAvg. GCT.MPR.LI. FA.OP.TS.
GPT-5: Tool-Augmented vs Standard
Tool No Tool
Figure 2. Left: Radar chart of top models across 6 tasks. Right: Tool-augmented vs standard inference for GPT-5.
Analysis
Reasoning Depth Analysis

Larger models produce shorter, more compact reasoning chains and achieve consistently higher accuracy. Overthinking introduces semantic drift and degrades performance.

Reasoning Length vs Model Size
All Wrong (Instruct Correct)
Effect of Reasoning Compactness on Task Accuracy
Count Spatial Layout Functional | Size: 8B 32B 235B
Figure 3. Left: Reasoning length comparison for correct vs wrong answers across model sizes. Right: Shorter reasoning chains consistently correlate with higher accuracy; colors denote tasks, point size encodes model scale.
Insights
Key Findings
1

Reasoning Quality Saturates

Longer reasoning chains do not lead to better decisions. Once a moderate budget is reached, overthinking introduces semantic drift.

2

Visual Evidence Dominates

Cognitive map construction depends strongly on visual evidence. Models exhibit surprising insensitivity to temporal continuity.

3

External Knowledge is Critical

GPT-5 exhibits performance gaps exceeding 20 points depending solely on whether web search is enabled for functional tasks.

4

Reasoning Enables Tool Use

Strong reasoning ability is a prerequisite for effective tool use. Low-reasoning variants perform worse with web search enabled.

5

Counting Bottleneck

Conditional counting persists as the hardest task across all model categories, requiring compositional logical reasoning.

6

Open-source Models Lag

Reasoning models show minimal gains over instruct counterparts, failing to transfer capacity to spatial–functional tasks.

Citation
BibTeX
@article{zhang2025sfibench,
  title   = {From Where Things Are to What They Are For:
             Benchmarking Spatial-Functional Intelligence
             in Multimodal LLMs},
  author  = {Zhang, Le and Yang, Jihan and Krishnan, Soundarya
             and Majmudar, Jimit and Ge, Xiou and Puri, Prasoon
             and Saraf, Prathamesh and Bhargava, Shruti
             and Piraviperumal, Dhivya and Ling, Yinan
             and Pan, Cindy and Yu, Hong
             and Agrawal, Aishwarya and Tseng, Bo-Hsiang},
  journal = {arXiv preprint},
  year    = {2025}
}