CVPR 2026

From Where Things Are to What They Are For

Benchmarking Spatial–Functional Intelligence in Multimodal LLMs

Le Zhang¹ Jihan Yang² Soundarya Krishnan³ Jimit Majmudar³ Xiou Ge³ Prasoon Puri³ Prathamesh Saraf³ Shruti Bhargava³ Dhivya Piraviperumal³ Yinan Ling³ Cindy Pan³ Hong Yu³ Aishwarya Agrawal¹ Bo-Hsiang Tseng³

¹ Mila – Québec AI Institute, UdeM ² NYU ³ Apple

Paper Code Dataset HuggingFace

Abstract

Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing higher-order cognitive abilities required for grounded intelligence. We introduce the Spatial–Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench evaluates two complementary dimensions: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence.

Benchmark

Two Dimensions of Spatial–Functional Intelligence

SFI-Bench evaluates cognitive abilities central to agentic intelligence: Structured Spatial Reasoning (understanding where things are) and Functional Reasoning (understanding what they are for).

🔢

Spatial

Global & Conditional Counting

Compositional counting with attribute constraints and set-based operations—intersection, union, and group-level aggregation.

🧭

Spatial

Cross-View Multi-hop Path Reasoning

Integrating spatial evidence across time and viewpoints to infer relationships not visible in any single frame.

🗺

Spatial

Layout Inference

Integrating distributed cues into a coherent global scene layout and reasoning about occlusion relationships.

🔗

Functional

Functional Association

Inferring affordance relationships between objects through cues such as brand, design, or spatial context.

📋

Functional

Operation Planning

Searching for device-specific information, interpreting retrieved knowledge, and assembling multi-step action plans.

🔧

Functional

Causal Troubleshooting

Diagnosing problems by combining scene understanding with external knowledge via web search.

Examples

Task Demonstrations

Each question requires watching the egocentric video and reasoning about spatial or functional relationships across multiple frames. Correct answers are highlighted.

Functional Association

Where is the object operated by the silver device on the table in the middle of the room?

AOn the right side of the console table's surface
BOn the center of the console table's surface
CNext to the computer monitor
DUnderneath the console table

Layout Inference

What objects are on the two sides of the doorway?

AThe round swivel chair and the L-shaped sofa
BThe L-shaped sofa and the TV console
CThe side table and the pile of toys
DThe large mirror and the wall art

Spatial Reasoning

Find the wall cabinet mounted to the right of the mirror. On the countertop below that cabinet, there is a charging base. What object is sitting next to that base?

AA tube of toothpaste
BA manual toothbrush
CAn electric toothbrush
DA bottle of contact lens solution

Operation Planning

I started the wrong wash cycle. How do I cancel the current program on the washing machine?

ARotate the Programme Selection knob to the "Off" position
BPress and hold the "Start/Pause" button for 3 seconds
CPress the main "On/Off" button to power down the machine
DPress and hold "Temperature" and "Spin Speed" for 5 seconds

Dataset

Data Statistics

1,555 questions sourced from 134 videos (avg. 102s). The benchmark covers both spatial and functional reasoning dimensions with carefully balanced task distributions.

SFI-Bench

1,555 Questions

Figure 1. Task distribution (left) and video duration histogram (right).

Results

Evaluation on SFI-Bench

23 models evaluated: proprietary APIs, open-source instruct, and open-source reasoning models. GPT-5.4-high achieves the best overall performance (72.1%), while counting remains the hardest task.

Methods	Rank	Avg.	GCT.	MPR.	LI.	FA.	OP.	TS.

GPT-5: Tool-Augmented vs Standard

Tool No Tool

Figure 2. Left: Radar chart of top models across 6 tasks. Right: Tool-augmented vs standard inference for GPT-5.

Analysis

Reasoning Depth Analysis

Larger models produce shorter, more compact reasoning chains and achieve consistently higher accuracy. Overthinking introduces semantic drift and degrades performance.

Reasoning Length vs Model Size

All Wrong (Instruct Correct)

Effect of Reasoning Compactness on Task Accuracy

Count Spatial Layout Functional | Size: 8B 32B 235B

Figure 3. Left: Reasoning length comparison for correct vs wrong answers across model sizes. Right: Shorter reasoning chains consistently correlate with higher accuracy; colors denote tasks, point size encodes model scale.

Insights

Key Findings

Reasoning Quality Saturates

Longer reasoning chains do not lead to better decisions. Once a moderate budget is reached, overthinking introduces semantic drift.

Visual Evidence Dominates

Cognitive map construction depends strongly on visual evidence. Models exhibit surprising insensitivity to temporal continuity.

External Knowledge is Critical

GPT-5 exhibits performance gaps exceeding 20 points depending solely on whether web search is enabled for functional tasks.

Reasoning Enables Tool Use

Strong reasoning ability is a prerequisite for effective tool use. Low-reasoning variants perform worse with web search enabled.

Counting Bottleneck

Conditional counting persists as the hardest task across all model categories, requiring compositional logical reasoning.

Open-source Models Lag

Reasoning models show minimal gains over instruct counterparts, failing to transfer capacity to spatial–functional tasks.

Citation

BibTeX

@article{zhang2025sfibench,
  title   = {From Where Things Are to What They Are For:
             Benchmarking Spatial-Functional Intelligence
             in Multimodal LLMs},
  author  = {Zhang, Le and Yang, Jihan and Krishnan, Soundarya
             and Majmudar, Jimit and Ge, Xiou and Puri, Prasoon
             and Saraf, Prathamesh and Bhargava, Shruti
             and Piraviperumal, Dhivya and Ling, Yinan
             and Pan, Cindy and Yu, Hong
             and Agrawal, Aishwarya and Tseng, Bo-Hsiang},
  journal = {arXiv preprint},
  year    = {2025}
}