Beyond Diagnosis: Cross-Dataset Evaluation of Risk Factors for Thyroid  Cancer Recurrence

Mehmet Ali DURSUN; Pınar ÖZEN KAVAS

doi:10.30855/ais.2025.08.01.03

Authors

Mehmet Ali DURSUN Kutahya Dumlupınar University
Pınar ÖZEN KAVAS

DOI:

https://doi.org/10.30855/ais.2025.08.01.03

Keywords:

Thyroid malignancy, Recurrence risk, Computer-aided diagnosis, Demographic stratification

Abstract

This study aims to comparatively evaluate various machine learning algorithms developed for the classification of thyroid diseases. By employing five distinct datasets with differing statistical structures and class imbalances, the performance of nine algorithms—CatBoost, XGBoost, LightGBM, Random Forest, Artificial Neural Network (ANN), KNN, SVM, Stacking, and GridSearch-Tuned Logistic Regression (gst-LR) has been comprehensively analyzed. Model performance was assessed not only based on accuracy but also through multidimensional metrics such as F1-score, precision, recall, and specificity. Stratified K-Fold cross-validation was applied in the model validation processes to ensure class representation and enhance generalizability. The findings reveal that boosting-based algorithms (particularly CatBoost, XGBoost, and LightGBM) delivered high and stable accuracy across several datasets. The Random Forest model stood out with its consistent performance even on imbalanced data, whereas the ANN model demonstrated notable fluctuations depending on the structural properties of the dataset. Classical methods such as KNN and SVM achieved competitive results only when the data exhibited well-defined decision boundaries, showing limitations in more complex distributions. The systematic approach adopted in this study presents a multilayered classification framework not only for model comparison but also for the evaluation of explainability, reproducibility, and contextual suitability. The overall results indicate that no single model dominates across all scenarios; rather, the success of classification strongly depends on data characteristics such as class distribution, dimensionality, and feature separability. Models such as Random Forest and boosting algorithms consistently performed well in terms of both accuracy and F1-score, with scores exceeding 98% and 95% respectively on certain datasets. These findings underscore the importance of context-aware model selection and reinforce the need for multi-metric evaluations in real-world clinical decision support applications.