Seminars

15 June, 14:15, Library Auditorium

Title: A Connected Cloud of Spheres Classification Method

Speaker: Paula Amaral, NOVA FCT, Portugal

Abstract:

Machine learning models have been extensively applied across various domains, often without a deep understanding of their underlying mechanisms. Black-box models, such as Deep Neural Networks, present significant challenges in counterfactual analysis, interpretability, and explainability. 

In this presentation, we introduce a novel binary classification model called the Connected Cloud of Spheres. This model is formulated as a Mixed-Integer Nonlinear Programming (MINLP) problem. The method is particularly effective for datasets with highly non-linear and non-convex structures while remaining adaptable to linearly separable cases. Unlike neural networks, our approach operates directly in the original feature space, eliminating the need for kernel functions or extensive hyperparameter tuning. 

Although primarily designed for binary classification, this method can be extended for anomaly detection, particularly in scenarios where negative examples are unavailable at the outset. Additionally, we discuss heuristic strategies for outlier identification and explainability, offering insights into how this approach enhances model transparency and interpretability.


18 June, 10:00, Library Auditorium

Title: Integrating Statistical and Machine Learning Models for Enhanced Time Series Forecasting

Speaker: Jorge Caiado , ISEG Lisbon School of Economics and Management, Universidade de Lisboa, Lisbon, Portugal

Abstract:

This seminar presents a comparative analysis of traditional time series forecasting techniques, such as Holt, Holt-Winters, and ARIMA, with modern machine learning and deep learning approaches, including XGBoost, Random Forest, CNNs, RNNs, LSTMs, and Transformers. As a practical application, we explore a case study focused on predicting monthly sales of a strategic health insurance product. The study evaluates the impact of incorporating exogenous macroeconomic indicators, including consumer price index variation and unemployment rates, on model accuracy. Particular emphasis is placed on identifying and correcting outliers, assessing their influence on both classical and machine learning models. Results show that the ARIMAX model applied to an outlier-adjusted dataset achieved the best forecasting performance, demonstrating the value of integrating external variables. While machine learning models performed competitively, their accuracy was notably lower when outliers remained untreated. Overall, the findings highlight the complementary strengths of classical statistical models and modern learning algorithms, encouraging a hybrid approach to enhance the robustness and reliability of sales forecasting in the insurance sector.

  

18 June, 11:00, Library Auditorium

Title: Random Forests in Genomic Prediction \& Selection: Challenges and Paths Toward Robustness

Speaker: Vanda Lourenço, NOVA FCT, Portugal

Abstract:

The analysis of real data is often vulnerable to the violation of underlying model assumptions, which can be especially exacerbated by data misspecifications such as errors or outliers. In the context of linear regression, the presence of even a single outlier can disrupt the normality assumption, leading to compromised parameter estimation and other subsequent, also compromised inferential results. Machine learning methods, including Random Forests (RF), are not immune to data contamination, and existing literature has recognized the need for robust statistical techniques to address this issue, particularly in high-dimensional data analysis, which includes variable selection and prediction. While data contamination can occur at both the response (output) and covariate (feature) levels, this work primarily focuses on the former. We assess the predictive performance of the classical RF method through simulations using a synthetic animal dataset from the literature, to which we introduce several contamination levels involving different types of outliers.

In parallel, we evaluate several robust strategies proposed in the literature, incorporating them into the RF framework to assess whether, and to what extent, they enhance its stability and predictive accuracy under contamination. The aim of this study is to clarify the practical potential of such robust adaptations as complementary tools to the classical RF algorithm in routine genomic prediction workflows, as well as to identify a simple and practical robust strategy that delivers improved prediction in the presence of outliers.


22 June 2026, 14:00, Library Auditorium (online)

Title:  Large Language Models and Data Economy

Speaker: Anna Rogers, IT University of Copenhagen

Abstract:

In the 'bitter lesson' paradigm of NLP technology development, what matters is computation, and the data comes free. This assumption has consequences for creator economy, and in the long run - also for the NLP technology. In a domain where further improvements depend on high-quality data contributions, the learning approaches need to focus on data attribution as a way to provide social and economic incentives to creators.


22 June 2026, 15:00, Library Auditorium

Title:  Transformer-based CoVaR: Systemic Risk in Textual Information

Speaker:   Weining Wang, University of Bristol

Abstract:

Conditional Value-at-Risk (CoVaR) quantifies systemic financial risk by measuring the loss quantile of one asset, conditional on another asset experiencing distress. We develop a Transformer-based methodology that integrates financial news articles directly with market data to improve CoVaR estimates. Unlike approaches that use predefined sentiment scores, our method incorporates raw text embeddings generated by a large language model (LLM). We prove explicit error bounds for our Transformer CoVaR estimator, showing that accurate CoVaR learning is possible even with small datasets. Using U.S. market returns and Reuters news items from 2006--2013, our out-of-sample results show that textual information impacts the CoVaR forecasts. With better predictive performance, we identify a pronounced negative dip during market stress periods across several equity assets when comparing the Transformer-based CoVaR to both the CoVaR without text and the CoVaR using traditional sentiment measures. Our results show that textual data can be used to effectively model systemic risk without requiring prohibitively large data sets.