AM

A.D. Manolache

info

Please Note

6 records found

On Learning RNN Gates with RNNs

Gated recurrent neural networks are commonly explained by their ability to create additive copy paths through time, which can preserve information and gradients over long sequences. This explanation is correct, but incomplete: useful gate values must themselves be learned, and this gate-learning process is also performed through recurrent computation. We study this missing learning step with controlled sequence classification tasks. We show that gated architectures do not solve long-range dependencies by architecture alone: when all training samples require long-range memory from the start, gated and non-gated recurrent models both fail. However, when training also contains short-dependency samples in which the same label relation can be learned over shorter temporal gaps, gated models can first learn a selective update behavior and then apply it to long-range samples. Diagnostic probes show that during successful training, larger gradients reach early recurrent states, and state updates depend more clearly on the input. A multi-class extension further shows that the learned behavior transfers partially beyond the subset of informative inputs that receives short-dependency samples. Overall, our results suggest that gates help not because they automatically solve long-range dependencies, but because they provide a mechanism that can be learned once the data makes the gate-learning problem simple enough. This reframes gated recurrence from an automatic solution to a learnable scaffold, and suggests that training data should be designed to expose gate-learning signals before relying on gates for long-range memory. ...

Assessing the Data Efficiency of Masked Autoencoders in Resource-Constrained Environments

Visual foundation models based on Vision Transformers often depend on large datasets and substantial computational resources, limiting their accessibility for resource-constrained research settings. This paper investigates the data efficiency of Masked Autoencoders (MAE) by studying how pre-training dataset size and mask ratio affect downstream representation quality. An MAE model is pre-trained on nested subsets of the same dataset ranging from 1k to 100k images, using different mask ratios, and then evaluated on a different downstream task dataset. The results show that MAE learns transferable representations even from small unlabeled datasets, with downstream accuracy increasing steadily as more pre-training data is used. The experiments also show that the optimal masking difficulty depends on the data regime: lower masking improves validation accuracy for the smallest subsets, while the original 75% MAE mask ratio becomes stronger as the dataset size increases. These findings suggest that mask ratio should not be treated as a fixed default in MAE training. Instead, reducing the mask ratio can improve data efficiency when pre-training data is limited, while higher masking remains effective when more visual variation is available. ...

Data-Efficiency of Self-Supervised Learning with DINO Multi-Crop

Self-supervised learning (SSL) lets computer vision models learn from unlabelled image datasets. Most DINO benchmarks pretrain on ImageNet — a million-image dataset that takes days of multi-GPU training per run, out of reach for the rapid iteration cycles smaller research groups rely on. This leaves practitioners with smaller datasets unsure whether DINO is worth running, or which of its design choices still hold at this scale.

We pretrain a small Vision Transformer (ViT-Tiny/8) using DINO on Tiny-ImageNet subsets from 1K to 100K images at 64x64 resolution, evaluated on downstream classification tasks. Downstream accuracy grows steadily with pretraining-set size and approaches the accuracy of a fully supervised baseline at the largest scale.

Our main contribution is a multi-crop ablation across data scale, training duration, and downstream task category. We find that multi-crop's benefit at sub-ImageNet scale is delayed rather than absent, and that the optimal multi-crop count depends on the downstream task category — no single setting wins across all tasks.

These findings show that the canonical DINO recipe does not transfer cleanly to sub-ImageNet scale. We recommend choosing the multi-crop count based on training budget and downstream task type, rather than copying the ImageNet default. ...

Data-Efficiency of Self-Supervised Learning with Momentum Contrast

Self-supervised contrastive learning is a popular way to pre-train vision foundation models. So far, it has mostly been studied with large pre-training datasets, and it is most accessible to organizations with massive computational resources. In this work we evaluate the data-efficiency of one such method, Momentum Contrast (MoCo), and investigate how to make it work better when less data is available. We pre-train a Vision Transformer with MoCo on subsets of Tiny-ImageNet ranging from 1,000 to 100,000 images, and evaluate the learned representations on a diverse set of downstream tasks using linear probing. We investigate how the training parameters of MoCo should be chosen for a given amount of data, how the downstream accuracy scales with the amount of pre-training data, and how this scaling differs across types of downstream tasks. We find that the best parameters depend on the amount of data: the optimal number of negatives used for the contrastive objective grows with the size of the dataset, while the momentum coefficient has no single best value. We also find that pre-training is beneficial even with very little data, the downstream accuracy grows approximately log-linearly with the size of the pre-training set, and the data-efficiency growth rate is larger for tasks that are similar to the pre-training data. ...

A small-compute characterization with a ViT-Tiny on Tiny-ImageNet subsets

Modern computer vision often reuses a single model, trained once on many images, as a start- ing point for new tasks. Because labels are ex- pensive, a common way to train such a model is self-supervised learning (SSL), which learns from unlabeled images. SSL normally uses millions of images, and it is unclear how well it works when far fewer are available. We study one SSL method, Barlow Twins, in that case. We pre-train a small vision transformer (5.4M parameters) on parts of Tiny-ImageNet, from 1k to 100k unlabeled images, and train every run for the same 1000 epochs, so the only thing that changes is the amount of data. We then freeze each model and measure how well its features transfer to the 19 VTAB-1k tasks. Pre- training helps at every dataset size: the VTAB-1k average rises from 33.7% with 1k images to 39.2% with 100k, well above a 24.4% untrained baseline. But this average hides large differences between tasks: accuracy on natural-image tasks keeps rising with data, while accuracy on more specialized and structured tasks (medical, satellite, and geometric images) changes little. On the smallest dataset, training too long even lowers accuracy. And as the dataset grows, the checkpoint that scores best on the pre-training data moves further from the one that transfers best. At this small scale, then, the amount of data is not the only thing that matters: the kind of downstream task and the checkpoint we keep matter just as much. ...

Optimizing I-JEPA for Data Efficiency

Self-supervised learning eliminates the need for image labels to learn meaningful visual representations, but it does not remove the need for large pretraining datasets. This work studies how Image-based Joint-Embedding Predictive Architecture (I-JEPA) behaves when pretraining data is deliberately limited. We train I-JEPA on stratified Tiny ImageNet subsets and evaluate the frozen representations with CIFAR-10 linear probing. The results show a steep improvement from the smallest subsets to the medium-data regime, followed by a plateau around the largest subsets under the standard final-checkpoint protocol. We also test two architectural modifications motivated by I-JEPA's design: reducing predictor capacity, to test whether an over-expressive predictor absorbs the pretext task instead of forcing useful encoder features, and adding shared photometric augmentation, to test whether extra input variation helps in low-data training. The shallow predictor improves transfer at 32k and 64k images but is neutral or harmful at the smallest and largest splits. The augmentation decreased downstream accuracy at 16k and was neutral at 32k. Additional controls---predictor depth sweeps, fixed-update budgets, and intermediate checkpoint analysis---suggest that the largest-split plateau is partly a training-dynamics issue rather than a pure data-efficiency ceiling. A cross-method comparison with Barlow Twins, MoCo, DINO, and MAE under the shared protocol contextualizes I-JEPA's data efficiency among SSL alternatives. ...