Seminars

Date: TBA

Title: Integrating Statistical and Machine Learning Models for Enhanced Time Series Forecasting

Speaker: Jorge Caiado , ISEG Lisbon School of Economics and Management, Universidade de Lisboa, Lisbon, Portugal

Abstract:

This seminar presents a comparative analysis of traditional time series forecasting techniques, such as Holt, Holt-Winters, and ARIMA, with modern machine learning and deep learning approaches, including XGBoost, Random Forest, CNNs, RNNs, LSTMs, and Transformers. As a practical application, we explore a case study focused on predicting monthly sales of a strategic health insurance product. The study evaluates the impact of incorporating exogenous macroeconomic indicators, including consumer price index variation and unemployment rates, on model accuracy. Particular emphasis is placed on identifying and correcting outliers, assessing their influence on both classical and machine learning models. Results show that the ARIMAX model applied to an outlier-adjusted dataset achieved the best forecasting performance, demonstrating the value of integrating external variables. While machine learning models performed competitively, their accuracy was notably lower when outliers remained untreated. Overall, the findings highlight the complementary strengths of classical statistical models and modern learning algorithms, encouraging a hybrid approach to enhance the robustness and reliability of sales forecasting in the insurance sector.

Date: TBA

Title: Random Forests in Genomic Prediction \& Selection: Challenges and Paths Toward Robustness

Speaker: Vanda Lourenço, NOVA FCT, Portugal

Abstract:

The analysis of real data is often vulnerable to the violation of underlying model assumptions, which can be especially exacerbated by data misspecifications such as errors or outliers. In the context of linear regression, the presence of even a single outlier can disrupt the normality assumption, leading to compromised parameter estimation and other subsequent, also compromised inferential results. Machine learning methods, including Random Forests (RF), are not immune to data contamination, and existing literature has recognized the need for robust statistical techniques to address this issue, particularly in high-dimensional data analysis, which includes variable selection and prediction. While data contamination can occur at both the response (output) and covariate (feature) levels, this work primarily focuses on the former. We assess the predictive performance of the classical RF method through simulations using a synthetic animal dataset from the literature, to which we introduce several contamination levels involving different types of outliers.

In parallel, we evaluate several robust strategies proposed in the literature, incorporating them into the RF framework to assess whether, and to what extent, they enhance its stability and predictive accuracy under contamination. The aim of this study is to clarify the practical potential of such robust adaptations as complementary tools to the classical RF algorithm in routine genomic prediction workflows, as well as to identify a simple and practical robust strategy that delivers improved prediction in the presence of outliers.