Performance of a self-attention-based model in the task of differentiating clear cell renal cell carcinoma from other renal tumors: variable Vision Transformer (vViT).

May 20, 2026

papers

DOI: 10.1093/bjr/tqag118 PMID: 42162958

Authors

Usuzaki T,Takahashi K,Takagi H,Ishikuro M,Obara T,Inamori R,Kamada H,Sato T,Oguro S,Takase K

Affiliations (6)

Department of Diagnostic Radiology, Tohoku University Hospital, Sendai, Japan.
Department of Clinical Imaging, Graduate School of Medicine, Tohoku University, Sendai, Miyagi, Japan.
Department of Advanced MRI Collaborative Research, Graduate School of Medicine, Tohoku University, Sendai, Japan.
Division of Molecular Epidemiology, Graduate School of Medicine, Tohoku University, Sendai, Miyagi, Japan.
Division of Molecular Epidemiology, Department of Preventive Medicine and Epidemiology, Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan.
Department of Pharmaceutical Sciences, Tohoku University Hospital, Sendai, Japan.

Abstract

To examine the performance of the variable Vision Transformer (vViT) in comparison with that of convolutional neural networks (CNNs) in the task of differentiating clear cell renal cell carcinoma (ccRCC) and non-ccRCC using computed tomography (CT) images. The vViT was designed to use patient characteristics, radiomic features extracted from arterial-phase CT images, and arterial-phase CT images. The training and test datasets were constructed from the training set of the 2019 Kidney and Kidney Tumour Segmentation Challenge (C4KC-KiTS) dataset. The training dataset contained 153 patients with 1,636 images (818 ccRCC, 818 non-ccRCC) and the test dataset contained 39 patients with 402 images (201 ccRCC, 201 non-ccRCC). After training, metrics including accuracy and the area under the curve of the receiver-operating characteristics (AUC-ROC) were calculated using the test dataset for vViT, Residual Network (ResNet), AlexNet, and GoogleNet in a patient-based approach. The metrics were calculated using CT images containing kidney and tumour. The AUC-ROC of vViT was compared with those of other models using the DeLong test. vViT, ResNet, AlexNet, and GoogleNet achieved accuracies of 0.82 (95% confidence interval, 0.72-0.86), 0.66 (0.56-0.73), 0.54 (0.45-0.62), and 0.61 (0.52-0.69), respectively. AUC-ROC values for vViT, ResNet, AlexNet, and GoogleNet were 0.91 (0.81-0.99), 0.61 (0.41-0.82), 0.53 (0.34-0.72), and 0.66 (0.46-0.85), respectively. The AUC-ROC of vViT was higher than that of ResNet, AlexNet, and GoogleNet (p < 0.05). vViT achieved comparable performance to CNN using CT images in the task of differentiating ccRCC and non-ccRCC. We developed a new deep learning model termed vViT that simultaneously analyzes non-image information and medical images. vViT can evaluate associations between input factors and outcome.

View Source Full Text PDF

Topics

Journal Article

Performance of a self-attention-based model in the task of differentiating clear cell renal cell carcinoma from other renal tumors: variable Vision Transformer (vViT).

Authors

Affiliations (6)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?