Beyond Medical Diagnostics:
How Medical Multimodal Large Language Models Think in Space

Quoc-Huy Trinh*1, Xi Ding*2, Yang Liu*2, Zhenyue Qin3, Xingjian Li2, Gorkem Durak4, Halil Ertugrul Aktas4, Elif Keles4, Ulas Bagci+4, Min Xu+2
*Equal contribution    +Co Advisor
1Aalto Aalto University 2CMU Carnegie Mellon University
3Yale Yale University 4Northwestern Northwestern University
9,782 QA Pairs
2,375 CT Scans
117 Anatomical Structures
14 MLLMs Evaluated
6 Spatial Tasks
3D CT visualization with spatial reasoning tasks

3D CT visualizations showing spatial reasoning tasks: directional, distance, volume, and comparative reasoning across organs and tumors.


Abstract

Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question-answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi-agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question-answer pairs across multiple organs and tumor types. Our evaluations on 14 state-of-the-art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical imaging.


Agentic Data Pipeline

Our pipeline automatically generates clinically meaningful spatial VQA data through a multi-stage process: (1) Spatial Computational Tools — volume calculators, 3D bounding box extractors, and distance calculators derive precise spatial information from CT segmentation masks; (2) QA Generation — using Qwen-2.5 with retrieval-augmented generation (RAG) from PubMed; (3) Multi-Agent Validation — clinical validation agent + 3 medical specialist agents filter trivial and low-quality samples; (4) Expert Radiologist Review — 3 board-certified radiologists independently validate all QA pairs.

SpatialMed agentic data pipeline
Overview of the SpatialMed agentic data synthesis pipeline: from spatial computation to multi-agent validation and expert radiologist review.

SpatialMed Benchmark

SpatialMed spans 6 spatial reasoning tasks organized into multiple-choice questions (MCA) and direct volumetric estimation:

  • DIR — Directional Reasoning (superior/inferior, anterior/posterior, left/right)
  • DIST — Distance Reasoning (center-to-center distances, closest/farthest relationships)
  • EXT — Extent/Size/Shape Reasoning (bounding boxes, axis extent, width/height/depth)
  • VOL — Volume Magnitude Reasoning (numeric values, intervals, threshold decisions)
  • COMP — Comparative Reasoning (pairwise comparison, ranking largest/smallest)
  • Volume Estimation — Absolute volume, volume ratios, cross-case comparisons
Anatomical distribution
Distribution of anatomical structures in SpatialMed.
Volume distribution
Volume distribution (log2 scale) across the dataset.

Results

We evaluate 14 state-of-the-art MLLMs: 6 general-domain 2D models, 6 medical-domain 2D models, and 2 3D models. Results reveal substantial gaps in spatial reasoning.

Method Size AVG DIR DIST EXT VOL COMP Volume Est.
Random-25.0025.0025.0025.0025.0025.00-
2D Models (Non-medical Pretraining)
LLaVA-Next (2024)7B40.8056.9628.1847.8244.6845.8121.35
GLM-4.1V (2025)9B26.1822.0319.6326.9737.1431.0922.93
Qwen3-VL 4B (2025)4B50.8154.5049.9046.2063.9056.2034.14
Qwen3-VL 8B (2025)8B44.0434.5960.0540.5658.1850.1420.72
InternVL3 8B (2025)8B42.8343.2646.0726.3864.6847.8928.69
InternVL3 9B (2025)9B47.8751.4856.0037.3550.3956.1235.92
2D Models (Medical Pretraining)
MedFlamingo (2023)9B32.1436.0725.4036.2439.2242.0013.93
LLaVA-Med (2024)7B46.3873.8631.2955.2247.2755.4815.17
HuatuoGPT-Vision (2024)7B1.341.601.041.762.081.56NaN
MedMoE (2024)4B15.8422.155.0823.7711.9515.6416.45
Med-VLM R1 (2025)8B6.305.146.584.4511.959.67NaN
Med-Gemma 4B (2025)4B43.1854.1124.3650.2247.5357.1625.69
3D Models
Med-2E3 (2024)3B38.6033.6830.2539.3250.6546.0931.58
M3D (2024)7B8.860.110.0031.101.0412.07NaN
Spatial Reasoning Benchmark: Comparison of 2D and 3D vision-language models on SpatialMed. Best results per category in bold. Green rows highlight top performers per group.

Key Findings

  • Distance reasoning is the universal bottleneck — even top models struggle with DIST reasoning tasks.
  • Medical pretraining does not help — some medical models underperform general-domain models on spatial tasks.
  • Scale is non-monotonic — Qwen3-VL 4B outperforms its 8B variant (50.81% vs 44.04%).
  • 65% of reasoning chains are hallucinated — models produce plausible but spatially incorrect explanations.
  • Numeric instability — multiple models return NaN for volume estimation, indicating lack of numeric commitment.

Analysis

Per-organ accuracy heatmap
Per-organ model accuracy heatmap.
Reasoning faithfulness matrix
Reasoning faithfulness analysis: faithful reasoning, decision errors, lucky guesses, and hallucinations.
Error type breakdown
Human-annotated error taxonomy: visual perception, linguistic, relational, and numeric reasoning errors.
Tumor-wise accuracy radar
Tumor-wise MCA accuracy across models.
Volume-bucketed performance
Model performance stratified by anatomical volume ranges.

Citation

@misc{trinh2026medicaldiagnosticsmedicalmultimodal,
      title={Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space}, 
      author={Quoc-Huy Trinh and Xi Ding and Yang Liu and Zhenyue Qin and Xingjian Li and Gorkem Durak and Halil Ertugrul Aktas and Elif Keles and Ulas Bagci and Min Xu},
      year={2026},
      eprint={2603.13800},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.13800}, 
}

Acknowledgments

This work was supported in part by U.S. NSF grants DBI-2238093, DBI-2422619, IIS-2211597, and MCB-2205148.