RA

R. Alvarez Lucendo

info

Please Note

2 records found

What controls memorization rate? From entropy to conditional entropy or conditioning structure

Master thesis (2026) - R. Alvarez Lucendo, Kubilay Atasu, J.C. van Gemert, Jérémie Decouchant, Madhur Panwar
Large language models (LLMs) can reproduce passages from their training data verbatim, raising privacy and copyright concerns. Prior work attributes memorization to factors such as model size, sequence entropy, context length, and repetition, but these findings lack a unified explanation. This thesis proposes a disambiguation complexity framework: memorization speed is governed not by the information content of a sequence, but by the difficulty of identifying it, specifically by the complexity of the minimal conditioning structure the model must extract from context to uniquely determine the correct continuation.

We demonstrate a counterintuitive regime in which random token sequences are memorized faster than structured natural language, contradicting standard explanations. We formalize a hierarchy of conditioning levels and introduce K-arity, a scalar complexity measure counting the number of prefix tokens jointly required to make a continuation deterministic. Through controlled experiments on synthetic datasets, we show that conditioning level and K-arity are predictive of memorization behavior. Attention analysis reveals that disambiguating cues are most clearly visible in early attention patterns. Natural language experiments show that, in text rich with redundant linguistic cues, isolated manipulations of conditioning complexity do not produce detectable differences, highlighting the gap between synthetic and naturalistic settings. This single principle connects input representation, entropy, identifying tokens, and context length within a common theoretical lens. ...

Implementing a UNet Architecture to evaluate the differences between both settings

Forecasting algal blooms using remote sensing data is less labour-intensive and has better cover- age in time and space than direct water sampling. The paper implements a deep learning technique, the UNet Architecture, to predict the chlorophyll concentration, which is a good indicator for al- gal bloom in the Rio Negro water reservoirs of Uruguay. The research question focuses on the dif- ferences between classification and regression in algal bloom forecasting. The experiments show that the regression implementation achieves bet- ter accuracy and lower mean squared error than the classification implementation that uses cross- entropy loss and four pre-fixed bins. Different loss functions that account for the class imbalance in the data do not improve the model’s performance. Fi- nally, a quantile-based binning strategy that consid- ers the data’s underlying distribution achieves the highest accuracy in both settings. ...