Beyond Accuracy: A MultiDimensional Framework for Evaluating Medical Image Classification Through Win vs. Lose Model Comparisons.
Authors
Affiliations (1)
Affiliations (1)
- University of the West of England Bristol, Gloucestershire, UK. [email protected].
Abstract
High-performing deep learning models such as ResNet, originally optimized for large-scale natural image datasets, often fail to generalize when applied directly to medical imaging tasks. This study investigates the limitations of "off-the-shelf" models in the context of skin lesion classification using the DermaMNIST dataset. Through a systematic evaluation of 35 architectural configurations across varying image resolutions and depths, the analysis reveals that mid-depth architectures (3-4 layers) and intermediate resolutions ( <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mn>128</mn> <mo>×</mo> <mn>128</mn></mrow> </math> ) achieve the best balance between accuracy and generalization. The top-performing RevNet-layer3 model (accuracy = 0.766) significantly outperforms deeper ResNet baselines (p < 0.05), demonstrating that increased depth or resolution does not guarantee improved medical domain performance. Beyond accuracy metrics, this study introduces a cross-architectural interpretability framework that compares Grad-CAM heatmaps between "Win" and "Lose" models, supported by quantitative perceptual metrics such as fractal dimension, entropy, and symmetry. The results show that fractal dimension consistently discriminates between effective and ineffective attention patterns, offering a more objective measure of model interpretability. These findings challenge the assumption of a universal top-performing model and highlight the need for domain-specific architectural probing and interpretability-driven evaluation. The work contributes a novel methodology for understanding performance-interpretability trade-offs in medical image classification, promoting more transparent and reliable deployment of deep learning systems in healthcare.