Back to all papers

Performance of a self-attention-based model in the task of differentiating clear cell renal cell carcinoma from other renal tumors: variable Vision Transformer (vViT).

May 20, 2026pubmed logopapers

Authors

Usuzaki T,Takahashi K,Takagi H,Ishikuro M,Obara T,Inamori R,Kamada H,Sato T,Oguro S,Takase K

Affiliations (6)

  • Department of Diagnostic Radiology, Tohoku University Hospital, Sendai, Japan.
  • Department of Clinical Imaging, Graduate School of Medicine, Tohoku University, Sendai, Miyagi, Japan.
  • Department of Advanced MRI Collaborative Research, Graduate School of Medicine, Tohoku University, Sendai, Japan.
  • Division of Molecular Epidemiology, Graduate School of Medicine, Tohoku University, Sendai, Miyagi, Japan.
  • Division of Molecular Epidemiology, Department of Preventive Medicine and Epidemiology, Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan.
  • Department of Pharmaceutical Sciences, Tohoku University Hospital, Sendai, Japan.

Abstract

To examine the performance of the variable Vision Transformer (vViT) in comparison with that of convolutional neural networks (CNNs) in the task of differentiating clear cell renal cell carcinoma (ccRCC) and non-ccRCC using computed tomography (CT) images. The vViT was designed to use patient characteristics, radiomic features extracted from arterial-phase CT images, and arterial-phase CT images. The training and test datasets were constructed from the training set of the 2019 Kidney and Kidney Tumour Segmentation Challenge (C4KC-KiTS) dataset. The training dataset contained 153 patients with 1,636 images (818 ccRCC, 818 non-ccRCC) and the test dataset contained 39 patients with 402 images (201 ccRCC, 201 non-ccRCC). After training, metrics including accuracy and the area under the curve of the receiver-operating characteristics (AUC-ROC) were calculated using the test dataset for vViT, Residual Network (ResNet), AlexNet, and GoogleNet in a patient-based approach. The metrics were calculated using CT images containing kidney and tumour. The AUC-ROC of vViT was compared with those of other models using the DeLong test. vViT, ResNet, AlexNet, and GoogleNet achieved accuracies of 0.82 (95% confidence interval, 0.72-0.86), 0.66 (0.56-0.73), 0.54 (0.45-0.62), and 0.61 (0.52-0.69), respectively. AUC-ROC values for vViT, ResNet, AlexNet, and GoogleNet were 0.91 (0.81-0.99), 0.61 (0.41-0.82), 0.53 (0.34-0.72), and 0.66 (0.46-0.85), respectively. The AUC-ROC of vViT was higher than that of ResNet, AlexNet, and GoogleNet (p < 0.05). vViT achieved comparable performance to CNN using CT images in the task of differentiating ccRCC and non-ccRCC. We developed a new deep learning model termed vViT that simultaneously analyzes non-image information and medical images. vViT can evaluate associations between input factors and outcome.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.