MF
M. Fregonara
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
2 records found
1
“Don’t tune hyperparameters on the test set” is often stated in machine learning textbooks. Violating it is considered a cardinal sin that produces misleadingly optimistic results, corrupts benchmark integrity, and thus can even be interpreted as scientific fraud. Yet evidence suggests that test set hyperparameter tuning does occur in practice, making it all the more important to understand its actual consequences. So how bad is it, really? In this work we question this dogma and put it to an empirical test. We systematically study the magnitude of the performance inflation caused by tuning the hyperparameters on the test set for MNIST-1D and CIFAR-10. Our experiments show that while the effect is real and significant, it is frequently small relative to other sources of noise. In many cases, we find that tuning on the test set recovers exactly the same model as when tuning on the validation set. Most importantly, we find that the rankings of models remain preserved after tuning on the test set, and therefore that consistent test-set tuning does not invalidate benchmarks or model selection. Our results call for a more nuanced view of tuning hyperparameters on the test set, stimulating researchers to openly report test tuning.
...
“Don’t tune hyperparameters on the test set” is often stated in machine learning textbooks. Violating it is considered a cardinal sin that produces misleadingly optimistic results, corrupts benchmark integrity, and thus can even be interpreted as scientific fraud. Yet evidence suggests that test set hyperparameter tuning does occur in practice, making it all the more important to understand its actual consequences. So how bad is it, really? In this work we question this dogma and put it to an empirical test. We systematically study the magnitude of the performance inflation caused by tuning the hyperparameters on the test set for MNIST-1D and CIFAR-10. Our experiments show that while the effect is real and significant, it is frequently small relative to other sources of noise. In many cases, we find that tuning on the test set recovers exactly the same model as when tuning on the validation set. Most importantly, we find that the rankings of models remain preserved after tuning on the test set, and therefore that consistent test-set tuning does not invalidate benchmarks or model selection. Our results call for a more nuanced view of tuning hyperparameters on the test set, stimulating researchers to openly report test tuning.
With the development of new technologies and approaches in the field of social signal processing, concerns regarding privacy around recording conversations have arised. One of the main ways to preserve the privacy of the speakers in recorded conversations consists of decimating said conversations, which consists of reducing the sample frequency and the frequency bandwidth of the audio. This theoretically makes the verbal content of the conversation (the words themselves) unintelligible, while still preserving other useful non-verbal social cues such as laughter, pitch modulation and frequency of speech, amongst others. However, this has not been experimentally verified. This research paper addresses this knowledge gap by exploring the performance of laughter detection machine learning models with decimated audio. An existing pre-trained state-of-the-art laughter detection model was employed and its performance was evaluated for a dataset of decimated audio with sample frequencies ranging from 300Hz to 44100Hz.
...
With the development of new technologies and approaches in the field of social signal processing, concerns regarding privacy around recording conversations have arised. One of the main ways to preserve the privacy of the speakers in recorded conversations consists of decimating said conversations, which consists of reducing the sample frequency and the frequency bandwidth of the audio. This theoretically makes the verbal content of the conversation (the words themselves) unintelligible, while still preserving other useful non-verbal social cues such as laughter, pitch modulation and frequency of speech, amongst others. However, this has not been experimentally verified. This research paper addresses this knowledge gap by exploring the performance of laughter detection machine learning models with decimated audio. An existing pre-trained state-of-the-art laughter detection model was employed and its performance was evaluated for a dataset of decimated audio with sample frequencies ranging from 300Hz to 44100Hz.