Machine learning models have been extensively applied across various domains, often without a deep understanding of their underlying mechanisms. Black-box models, such as Deep Neural Networks, present significant challenges in counterfactual analysis, interpretability, and explainability.
In this presentation, we introduce a novel binary classification model called the Connected Cloud of Spheres. This model is formulated as a Mixed-Integer Nonlinear Programming (MINLP) problem. The method is particularly effective for datasets with highly non-linear and non-convex structures while remaining adaptable to linearly separable cases. Unlike neural networks, our approach operates directly in the original feature space, eliminating the need for kernel functions or extensive hyperparameter tuning.
Although primarily designed for binary classification, this method can be extended for anomaly detection, particularly in scenarios where negative examples are unavailable at the outset. Additionally, we discuss heuristic strategies for outlier identification and explainability, offering insights into how this approach enhances model transparency and interpretability.
This seminar presents a comparative analysis of traditional time series forecasting techniques, such as Holt, Holt-Winters, and ARIMA, with modern machine learning and deep learning approaches, including XGBoost, Random Forest, CNNs, RNNs, LSTMs, and Transformers. As a practical application, we explore a case study focused on predicting monthly sales of a strategic health insurance product. The study evaluates the impact of incorporating exogenous macroeconomic indicators, including consumer price index variation and unemployment rates, on model accuracy. Particular emphasis is placed on identifying and correcting outliers, assessing their influence on both classical and machine learning models. Results show that the ARIMAX model applied to an outlier-adjusted dataset achieved the best forecasting performance, demonstrating the value of integrating external variables. While machine learning models performed competitively, their accuracy was notably lower when outliers remained untreated. Overall, the findings highlight the complementary strengths of classical statistical models and modern learning algorithms, encouraging a hybrid approach to enhance the robustness and reliability of sales forecasting in the insurance sector.
The analysis of real data is often vulnerable to the violation of underlying model assumptions, which can be especially exacerbated by data misspecifications such as errors or outliers. In the context of linear regression, the presence of even a single outlier can disrupt the normality assumption, leading to compromised parameter estimation and other subsequent, also compromised inferential results. Machine learning methods, including Random Forests (RF), are not immune to data contamination, and existing literature has recognized the need for robust statistical techniques to address this issue, particularly in high-dimensional data analysis, which includes variable selection and prediction. While data contamination can occur at both the response (output) and covariate (feature) levels, this work primarily focuses on the former. We assess the predictive performance of the classical RF method through simulations using a synthetic animal dataset from the literature, to which we introduce several contamination levels involving different types of outliers.
In parallel, we evaluate several robust strategies proposed in the literature, incorporating them into the RF framework to assess whether, and to what extent, they enhance its stability and predictive accuracy under contamination. The aim of this study is to clarify the practical potential of such robust adaptations as complementary tools to the classical RF algorithm in routine genomic prediction workflows, as well as to identify a simple and practical robust strategy that delivers improved prediction in the presence of outliers.