Vision-language models as an integrative layer for clinical artificial intelligence in radiology: A systems-level perspective.
Authors
Affiliations (1)
Affiliations (1)
- Southern Hills Hospital and Medical Center, Las Vegas, NV, USA. Electronic address: [email protected].
Abstract
Clinical adoption of artificial intelligence (AI) in radiology has matured through task-specific tools for detection, segmentation, triage, and quantitative measurement, yet most deployed systems operate as isolated applications with distinct interfaces and limited clinical context. Vision-language models (VLMs), which jointly represent visual and textual information, offer a potential remedy. This review proposes that VLMs are most plausibly valuable as a downstream semantic layer on top of existing orchestration infrastructure, consuming structured outputs from task-specific models and clinical context to produce clinically interpretable summaries. We position this proposal against rule-based orchestration middleware, structured reporting standards (DICOM SR, HL7 FHIR), and medical knowledge graphs, and identify three capabilities VLMs uniquely add and three where they currently underperform. A critical appraisal of three state-of-the-art systems: MAIRA-2, RadioRAG, and a representative vision-language foundation model, reveals that even the best-performing radiology VLMs exhibit logical precision of only ~52-56% on primary benchmarks, model-dependent retrieval gains that can collapse to zero, fourfold latency increases, and systematic demographic bias exceeding that of board-certified radiologists. We discuss technical requirements for implementation: interoperability, conflict arbitration, provenance tagging, hallucination mitigation, latency, and regulatory classification under adaptive-AI frameworks. We argue throughout that most capabilities described remain at research or early-pilot stage, and that the value of VLMs lies in filling a specific coordination gap rather than in subsuming existing integration layers.