Benchmarking large language models in breast cancer care: agreement with radiology-led multidisciplinary tumor board decisions.
Authors
Affiliations (2)
Affiliations (2)
- Department of Radiology, Başakşehir Çam and Sakura City Hospital, Istanbul, Turkey.
- Department of Radiology, Başakşehir Çam and Sakura City Hospital, Istanbul, Turkey. [email protected].
Abstract
Multidisciplinary tumor boards (MDTBs) play a central role in breast cancer management by integrating imaging findings with clinical and pathological information to guide treatment decisions. The increasing integration of artificial intelligence into clinical workflows has raised interest in the potential role of large language models (LLMs) as supportive tools in oncologic decision-making. The aim of this study was to evaluate the concordance between treatment recommendations generated by LLMs and decisions made by a radiology-led MDTB in newly diagnosed breast cancer patients, and to identify clinical contexts in which LLM-based recommendations are most reliable. This retrospective study included 286 breast cancer cases reviewed by an institutional MDTB. Standardized clinical and radiological case summaries were provided to three contemporary state-of-art LLMs (ChatGPT-4o (OpenAI), Claude 3.7 Sonnet (Anthropic), and Gemini 2.5 Pro (Google DeepMind)) using a guideline-referenced prompt aligned with ASCO, ESMO, and NCCN recommendations. MDTB consensus decisions served as the institutional benchmark comparator. Model performance was evaluated using concordance, Cohen's kappa, precision, recall, and F1 scores across treatment categories, disease stages, and molecular subtypes. Subgroup analyses were performed to delineate contexts of consistent model agreement and scenarios requiring more nuanced clinical reasoning. ChatGPT-4o demonstrated the highest overall concordance with MDTB decisions (83.2%), followed by Claude 3.7 Sonnet (79.7%) and Gemini 2.5 Pro (79.4%). Agreement exceeded 90% in HER2-enriched and triple-negative breast cancer, whereas Luminal A tumors showed the lowest concordance (~ 66%). F1 scores were highest for adjuvant systemic therapy (100) and neoadjuvant chemotherapy (≥ 91). Performance declined substantially for surgical decisions, including mastectomy (< 58) and axillary lymph node dissection (≤ 23.5). Stage-based analyses showed heterogeneous concordance patterns, with high agreement in several stage III-IV subgroups and lower agreement in scenarios requiring more complex multimodal or individualized treatment decisions. LLMs demonstrated substantial agreement with MDTB-aligned treatment recommendations in structured, guideline-based breast cancer settings, but performance declined when decisions required individualized clinical judgment, complex multimodal trade-offs, or clinically nuanced interpretation of available findings. These findings support further evaluation of LLMs as decision-support tools in straightforward cases, whereas complex surgical or multimodal treatment planning should remain under expert multidisciplinary oversight.