SpineVLM: A Markdown-Guided Structured Fine-Tuning Framework for Spine X-ray Report Generation.
Authors
Abstract
Automated medical report generation in specialized fields like spine radiography is constrained by data scarcity and high annotation costs. Consequently, existing multimodal large language models (MLLMs) struggle in these settings, often missing minute, scattered spinal abnormalities. We introduce SpineVLM, a data-efficient framework for structured spine X-ray report generation. The framework is built upon the newly constructed SXRG dataset, comprising 10,468 image-report pairs developed via a hierarchical AI-assisted annotation pipeline. To optimize learning under limited data, we propose Markdown-Guided Structured Learning (MGSL), which reformulates unconstrained free-text synthesis into a structured completion task, acting as a strong regularizer. Furthermore, an unsupervised Region-Focused Inference (RFI) module powered by foundation models (DINOv2) isolates the vertebral column to enhance the perception of subtle lesions without requiring manual spatial annotations. Evaluated on a 7B-parameter vision-language backbone, SpineVLM achieves strong performance against ten baseline multimodal models across standard linguistic metrics. In a double-blind reader study, the system achieved a diagnostic F1-score of 0.866, comparable to specialist performance, while reducing clinical reporting time by over 41%. By open-sourcing the dataset and codebase, we provide, to our knowledge, the first quantitative benchmark for automated spine radiography report generation, together with a structured framework for this data-limited setting. All data and code will be publicly released at https://github.com/LiuDongDaniel/SpineVLM.