Beyond Diagnosis: Cross-Dataset Evaluation of Risk Factors for Thyroid Cancer Recurrence
DOI:
https://doi.org/10.30855/ais.2025.08.01.03Keywords:
Thyroid malignancy, Recurrence risk, Computer-aided diagnosis, Demographic stratificationAbstract
This study aims to comparatively evaluate various machine learning algorithms developed for the classification of thyroid diseases. By employing five distinct datasets with differing statistical structures and class imbalances, the performance of nine algorithms—CatBoost, XGBoost, LightGBM, Random Forest, Artificial Neural Network (ANN), KNN, SVM, Stacking, and GridSearch-Tuned Logistic Regression (gst-LR) has been comprehensively analyzed. Model performance was assessed not only based on accuracy but also through multidimensional metrics such as F1-score, precision, recall, and specificity. Stratified K-Fold cross-validation was applied in the model validation processes to ensure class representation and enhance generalizability. The findings reveal that boosting-based algorithms (particularly CatBoost, XGBoost, and LightGBM) delivered high and stable accuracy across several datasets. The Random Forest model stood out with its consistent performance even on imbalanced data, whereas the ANN model demonstrated notable fluctuations depending on the structural properties of the dataset. Classical methods such as KNN and SVM achieved competitive results only when the data exhibited well-defined decision boundaries, showing limitations in more complex distributions. The systematic approach adopted in this study presents a multilayered classification framework not only for model comparison but also for the evaluation of explainability, reproducibility, and contextual suitability. The overall results indicate that no single model dominates across all scenarios; rather, the success of classification strongly depends on data characteristics such as class distribution, dimensionality, and feature separability. Models such as Random Forest and boosting algorithms consistently performed well in terms of both accuracy and F1-score, with scores exceeding 98% and 95% respectively on certain datasets. These findings underscore the importance of context-aware model selection and reinforce the need for multi-metric evaluations in real-world clinical decision support applications.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Artificial Intelligence Studies

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Artificial Intelligence Studies (AIS) publishes open access articles under a Creative Commons Attribution 4.0 International License (CC BY). This license permits user to freely share (copy, distribute and transmit) and adapt the contribution including for commercial purposes, as long as the author is properly attributed.
For all licenses mentioned above, authors can retain copyright and all publication rights without restriction.