SpatialMed

Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question-answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi-agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question-answer pairs across multiple organs and tumor types. Our evaluations on 14 state-of-the-art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical imaging.

Our pipeline automatically generates clinically meaningful spatial VQA data through a multi-stage process: (1) Spatial Computational Tools — volume calculators, 3D bounding box extractors, and distance calculators derive precise spatial information from CT segmentation masks; (2) QA Generation — using Qwen-2.5 with retrieval-augmented generation (RAG) from PubMed; (3) Multi-Agent Validation — clinical validation agent + 3 medical specialist agents filter trivial and low-quality samples; (4) Expert Radiologist Review — 3 board-certified radiologists independently validate all QA pairs.

Overview of the SpatialMed agentic data synthesis pipeline: from spatial computation to multi-agent validation and expert radiologist review.

SpatialMed spans 6 spatial reasoning tasks organized into multiple-choice questions (MCA) and direct volumetric estimation:

DIR — Directional Reasoning (superior/inferior, anterior/posterior, left/right)
DIST — Distance Reasoning (center-to-center distances, closest/farthest relationships)
EXT — Extent/Size/Shape Reasoning (bounding boxes, axis extent, width/height/depth)
VOL — Volume Magnitude Reasoning (numeric values, intervals, threshold decisions)
COMP — Comparative Reasoning (pairwise comparison, ranking largest/smallest)
Volume Estimation — Absolute volume, volume ratios, cross-case comparisons

Anatomical distribution — Distribution of anatomical structures in SpatialMed.

Volume distribution (log2 scale) across the dataset.

We evaluate 14 state-of-the-art MLLMs: 6 general-domain 2D models, 6 medical-domain 2D models, and 2 3D models. Results reveal substantial gaps in spatial reasoning.

Method	Size	AVG	DIR	DIST	EXT	VOL	COMP	Volume Est.
Random	-	25.00	25.00	25.00	25.00	25.00	25.00	-
2D Models (Non-medical Pretraining)
LLaVA-Next (2024)	7B	40.80	56.96	28.18	47.82	44.68	45.81	21.35
GLM-4.1V (2025)	9B	26.18	22.03	19.63	26.97	37.14	31.09	22.93
Qwen3-VL 4B (2025)	4B	50.81	54.50	49.90	46.20	63.90	56.20	34.14
Qwen3-VL 8B (2025)	8B	44.04	34.59	60.05	40.56	58.18	50.14	20.72
InternVL3 8B (2025)	8B	42.83	43.26	46.07	26.38	64.68	47.89	28.69
InternVL3 9B (2025)	9B	47.87	51.48	56.00	37.35	50.39	56.12	35.92
2D Models (Medical Pretraining)
MedFlamingo (2023)	9B	32.14	36.07	25.40	36.24	39.22	42.00	13.93
LLaVA-Med (2024)	7B	46.38	73.86	31.29	55.22	47.27	55.48	15.17
HuatuoGPT-Vision (2024)	7B	1.34	1.60	1.04	1.76	2.08	1.56	NaN
MedMoE (2024)	4B	15.84	22.15	5.08	23.77	11.95	15.64	16.45
Med-VLM R1 (2025)	8B	6.30	5.14	6.58	4.45	11.95	9.67	NaN
Med-Gemma 4B (2025)	4B	43.18	54.11	24.36	50.22	47.53	57.16	25.69
3D Models
Med-2E3 (2024)	3B	38.60	33.68	30.25	39.32	50.65	46.09	31.58
M3D (2024)	7B	8.86	0.11	0.00	31.10	1.04	12.07	NaN

Spatial Reasoning Benchmark: Comparison of 2D and 3D vision-language models on SpatialMed. Best results per category in bold. Green rows highlight top performers per group.

Key Findings

Distance reasoning is the universal bottleneck — even top models struggle with DIST reasoning tasks.
Medical pretraining does not help — some medical models underperform general-domain models on spatial tasks.
Scale is non-monotonic — Qwen3-VL 4B outperforms its 8B variant (50.81% vs 44.04%).
65% of reasoning chains are hallucinated — models produce plausible but spatially incorrect explanations.
Numeric instability — multiple models return NaN for volume estimation, indicating lack of numeric commitment.

Per-organ accuracy heatmap — Per-organ model accuracy heatmap.

Reasoning faithfulness matrix — Reasoning faithfulness analysis: faithful reasoning, decision errors, lucky guesses, and hallucinations.

Error type breakdown — Human-annotated error taxonomy: visual perception, linguistic, relational, and numeric reasoning errors.

Tumor-wise accuracy radar — Tumor-wise MCA accuracy across models.

Model performance stratified by anatomical volume ranges.

Beyond Medical Diagnostics:
How Medical Multimodal Large Language Models Think in Space

3D CT visualizations showing spatial reasoning tasks: directional, distance, volume, and comparative reasoning across organs and tumors.

Abstract

Agentic Data Pipeline

SpatialMed Benchmark

Results

Key Findings

Analysis

Citation

Acknowledgments

Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

3D CT visualizations showing spatial reasoning tasks: directional, distance, volume, and comparative reasoning across organs and tumors.

Abstract

Agentic Data Pipeline

SpatialMed Benchmark

Results

Key Findings

Analysis

Citation

Acknowledgments

Beyond Medical Diagnostics:
How Medical Multimodal Large Language Models Think in Space