CVPRW 2026

PRS-Med: Position Reasoning Segmentation
in Medical Imaging

1Aalto University,  2Northwestern University,  3Technical University of Denmark,  4Chongqing University of Posts and Telecommunications,  5University of South Dakota
*Corresponding author    Co-advisors
Sample PRS-Med QA pairs and segmentation masks across modalities
PRS-Med takes position-grounded natural language queries and returns precise segmentation masks across six medical imaging modalities — CT, MRI, X-ray, ultrasound, endoscopy, and dermatology.

Abstract

Medical image segmentation requires more than pixel-level pattern recognition — it demands spatial position reasoning that reflects how clinicians communicate findings. When a radiologist says "there is a mass in the upper-left lobe," they are linking anatomical position to pathological presence. Yet existing medical multimodal large language models (MLLMs) and segmentation systems struggle to bridge this gap.

We introduce PRS-Med, a unified framework for Position Reasoning Segmentation in medical imaging, together with PosMed, the first large-scale benchmark dataset designed for this task. PosMed contains 116,000 expert-validated, spatially-grounded QA pairs derived from 38,731 images across six imaging modalities — CT, MRI, X-ray, ultrasound, endoscopy, and dermatology — built via a two-stage automated pipeline with radiologist review.

PRS-Med integrates a TinySAM vision encoder with LLaVA-Med via LoRA adaptation, using a cross-attention fusion module to produce segmentation masks conditioned on free-form spatial language queries. Our Clinical Action Zones approach encodes position using interpretable anatomical quadrants rather than pixel-coordinate regression, aligning model outputs with natural radiologist language. Extensive experiments across six modalities demonstrate state-of-the-art performance in both segmentation accuracy and position reasoning.


PosMed: Position-Reasoning Medical Dataset

The first large-scale benchmark for spatially-grounded QA and segmentation in medical imaging.

116K

Expert-Validated QA Pairs

38,731

Medical Images

6

Imaging Modalities

9

Source Datasets

55

QA Templates

PosMed two-stage dataset creation pipeline
Figure 1. PosMed dataset pipeline. Stage 1 — Template Preparation: 55 QA templates doctor-validated and GPT-4 refined. Stage 2 — Position Extraction: bounding-box centroids mapped to five Clinical Action Zones (top-left, top-right, bottom-left, bottom-right, center). A board-certified radiologist majority-vote review filters 310K raw pairs down to 116K validated QA pairs.
1

Template Preparation

55 QA templates co-designed with 3 doctors, then linguistically polished by GPT-4 for naturalness and clinical accuracy.

2

Position Extraction

Bounding-box centroids from ground-truth segmentation masks are mapped to one of five Clinical Action Zones.

3

GPT QA Generation

Templates and extracted positions are combined via GPT to generate ~310K diverse, clinically-phrased QA pairs.

4

Expert Radiologist Validation

Board-certified radiologists review for medical accuracy, positional relevance, and clinical plausibility via majority voting — yielding 116K final pairs.

Source dataset distribution
Source dataset distribution. PosMed aggregates 9 public datasets: BUSI, Brain MRI, LungCT, LungXray, Kvasir-SEG, ClinicDB, CVC300, ColonDB, ETIS, and ISIC.
Tumor vs anatomy-centric QA distribution
Description type. 84.3% anatomy-centric vs 15.7% tumor-centric QA pairs, reflecting the spatial language distribution found in real clinical reports.

PRS-Med Architecture

A unified framework for position-aware natural language segmentation.

PRS-Med overall architecture diagram
Figure 3. PRS-Med architecture. Given a medical image and a spatial language query, the TinySAM vision encoder and LLaVA-Med (LoRA-adapted) process image and text in parallel. A cross-attention fusion module combines both representations into a shared space, and a transposed-convolution mask decoder upsamples the result to a 1024×1024 segmentation mask.

Two-Stage Training

Stage 1 — Feature Alignment. The vision encoder and LLM are frozen; only the cross-attention fusion module is trained to align visual and language representations.

Stage 2 — End-to-End Fine-tuning. All components are jointly optimized on PosMed using a combined objective: L = LBCE + LDice + 0.5 · LCE, balancing mask quality and language generation.

Clinical Action Zones

Rather than regressing pixel coordinates, PRS-Med encodes lesion position using five interpretable quadrants: Top-Left, Top-Right, Bottom-Left, Bottom-Right, and Center — derived from bounding-box centroids of ground-truth masks.

This design mirrors the spatial language radiologists use in clinical reports ("upper-right lobe," "lower-left quadrant") and produces outputs that align naturally with clinical documentation.

Component Module Output Shape Notes
Vision Encoder TinySAM (TinyViT) 256 × 16 × 16 Unfrozen for medical adaptation
Multimodal LLM LLaVA-Med + LoRA b × l × 4096 rank=16, α=16; full hidden states used
Fusion Module Cross-Attention + skip b × 256 × 16 × 16 Projects both streams to 256-dim
Mask Decoder Transposed Conv b × 1 × 1024 × 1024 BCE + Dice loss

Results

Evaluated across 6 modalities against segmentation and position reasoning baselines, all fine-tuned on PosMed.

Segmentation Performance  (mDice / mIoU)

Method Breast US Brain MRI Lung CT Lung X-ray Polyp Skin
Segmentation Baselines
G-DINO + SAM-Med2D 0.768 / 0.693 0.607 / 0.591 0.657 / 0.567 0.949 / 0.916 0.796 / 0.751 0.868 / 0.794
BiomedParse 0.781 / 0.697 0.681 / 0.625 0.748 / 0.664 0.961 / 0.931 0.821 / 0.772 0.885 / 0.815
LISA-7B 0.783 / 0.698 0.667 / 0.625 0.736 / 0.667 0.972 / 0.951 0.824 / 0.774 0.893 / 0.822
LISA-13B 0.790 / 0.706 0.708 / 0.668 0.737 / 0.668 0.972 / 0.951 0.826 / 0.775 0.895 / 0.823
PRS-Med (Ours) 0.817 / 0.729 0.803 / 0.757 0.968 / 0.943 0.973 / 0.952 0.843 / 0.791 0.901 / 0.833

Position Reasoning Performance  (ROUGE / Accuracy)

Method Breast US Brain MRI Lung CT Lung X-ray Polyp Skin
Medical MLLM Baselines
LLaVA-Med 0.58 / 79.75% 0.61 / 80.17% 0.65 / 88.17% 0.59 / 82.08% 0.65 / 55.64% 0.71 / 92.08%
HuatuoGPT 0.59 / 81.63% 0.62 / 81.17% 0.66 / 88.17% 0.60 / 84.00% 0.66 / 56.64% 0.72 / 92.08%
Med-MoE 0.60 / 81.63% 0.63 / 81.17% 0.67 / 88.17% 0.61 / 84.00% 0.67 / 56.64% 0.73 / 92.08%
MedVLM-R1 0.61 / 81.63% 0.64 / 81.17% 0.69 / 88.17% 0.62 / 84.00% 0.68 / 55.64% 0.74 / 92.08%
PRS-Med (Ours) 0.64 / 92.92% 0.67 / 86.17% 0.71 / 76.67% 0.64 / 94.36% 0.71 / 72.43% 0.76 / 96.31%

Best results bold. PRS-Med (green) leads on 5/6 modalities for position reasoning and all 6 for segmentation.


Qualitative Examples

PRS-Med takes a position-grounded natural language query and returns a segmentation mask across all six modalities.

PRS-Med qualitative segmentation results across modalities
Figure 4. PRS-Med qualitative results. For each modality, the input image and spatial language query (e.g., "segment the lesion in the upper-left") are shown alongside the predicted segmentation mask and ground truth overlay.

BibTeX

@article{trinh2025prs,
  title={Prs-med: Position reasoning segmentation with vision-language model in medical imaging},
  author={Trinh, Quoc-Huy and Nguyen, Minh-Van and Zeng, Jung and Bagci, Ulas and Jha, Debesh},
  journal={arXiv preprint arXiv:2505.11872},
  year={2025}
}