LP

L. Poenaru-Olaru

info

Please Note

4 records found

How well do Margin Density-based concept drift detectors identify concept drift in case of synthetic/real-world data?

When deployed in production, machine learning models sometimes lose accuracy over time due to a change in the distribution of the incoming data, which results in the model not reflecting reality any longer. A concept drift is this loss of accuracy over time. Drift detectors are algorithms used to detect such drifts. Drift detectors are important as they allow us to detect when a classification model becomes inaccurate. Some possible uses of drift detectors can even go as far as detecting adversarial attacks on machine learning algorithms. The detectors discussed in this paper are Margin Density drift detectors. Their evaluation is made within an unsupervised context, where we assume no testing labels are available. In real world applications of machine learning models, this might often be the case, as finding labels is costly. Experiments in this paper have found that margin density detectors can be useful tools in detecting the first drift for synthetic data, even though parameter tuning must be done to achieve high accuracy for some datasets. In an unsupervised environment with more than one drift, the drift detectors are unreliable as was seen in experiments involving real world data. With this paper comes an implementation of margin density detectors. ...
Various techniques have been studied to handle unexpected changes in data streams, a phenomenon called concept drift. When the incoming data is not labeled and the labels are also not obtainable with a reasonable effort, detecting these drifts becomes less trivial. This study evaluates how well two data distribution based label-independent drift detection methods, SyncStream and Statistical Change Detection for Multi-Dimensional Data, detect concept drift. This is done by implementing the algorithms and evaluating them side by side on both synthetic and real-world datasets. The metrics used for synthetic datasets are False Positive Rate and Latency; for real-world datasets, Accuracy is used instead of Latency. The experiments show that both drift detectors perform significantly worse on real-world than on synthetic data. ...
Concept drift is an unforeseeable change in the underlying data distribution of streaming data, and because of such a change, deployed classifiers over that data show a drop in accuracy. Concept drift detectors are algorithms capable of detecting such a drift, and unsupervised ones detect drift without needing the data’s actual labels, which can be expensive to obtain. This work is concerned with the implementation and evaluation of two existing unsupervised concept drift detectors based on clustering, UCDD and MSSW, by evaluation on both synthetic and real-world data. Our biggest contribution is in making implementations publicly available. By evaluation, we also realise that UCDD detects drift earlier for simple numerical synthetic datasets, MSSW detects drift earlier for more complex synthetic datasets with categorical features, and none seems suitable for real-world datasets. ...
Bachelor thesis (2023) - T. Zamfirescu, L. Poenaru-Olaru, J.S. Rellermeyer
Label-independent concept drift detectors represent an emerging topic in machine learning research, especially in models deployed in a production environment where obtaining labels can become increasingly difficult and costly. Concept drift refers to unforeseeable changes in the distribution of data streams, which directly impact the performance of a model trained on historical data. This paper initially focuses on two mixed label-independent drift detectors, SQSI and UDetect, which are implemented and evaluated on a specific setup using synthetic and real-world data sets. Next, multiple label-dependent drift detectors are evaluated on real-world data sets, and the results are compared to those of the label-independent detectors. This paper presents a framework for comparing multiple concept drift detectors on different data sets and configurations, checking whether they can be reliably used in a production environment. ...