Y. Chen
Please Note
67 records found
1
To study such vulnerabilities, this thesis considers two kinds of parties: the clients and the servers. Clients act as data owners that perform localized computations and share only model parameters, thereby preserving raw data privacy, yet they introduce vulnerabilities through potential malicious behaviors (e.g., data/model poisoning attacks) or unreliable contributions due to data quality. In contrast, the server, while facilitating model convergence through aggregation, poses inherent privacy risks by potentially inferring sensitive client information from shared gradients, even without direct data access. These two parties create a dual-threat landscape: clients may compromise model performance through adversarial manipulations, while servers break confidentiality via reconstruction methods.... ...
To study such vulnerabilities, this thesis considers two kinds of parties: the clients and the servers. Clients act as data owners that perform localized computations and share only model parameters, thereby preserving raw data privacy, yet they introduce vulnerabilities through potential malicious behaviors (e.g., data/model poisoning attacks) or unreliable contributions due to data quality. In contrast, the server, while facilitating model convergence through aggregation, poses inherent privacy risks by potentially inferring sensitive client information from shared gradients, even without direct data access. These two parties create a dual-threat landscape: clients may compromise model performance through adversarial manipulations, while servers break confidentiality via reconstruction methods....
To address these challenges, this thesis proposes five research questions, combining theoretical analysis with empirical validation across diverse machine learning scenarios. The first challenge considers noisy crowdsourced labels, where non-professional workers introduce errors that degrade model performance. It calls for online aggregation methods to process data incrementally rather than in one go on a whole set. The second vulnerability involves black-box model distillation without real data, where efficiently generating high-quality synthetic queries remains difficult. The third challenge extends this to incorporating semantic information from public data, aiming to reduce the number of queries typically required for effective distillation. The fourth investigates generative model distillation, asking whether dark knowledge (inference probabilities) exists beyond final outputs and how it improves generalization. The fifth examines diffusion models, whose multi-step Markov chain structure introduces unique difficulties for distillation and sampling acceleration.
Chapter 2 tackles distilling knowledge from noisy crowdsourced labels. Unlike offline aggregation methods requiring all labels at once, we propose BILA , an online framework that processes label chunks incrementally using a confusion matrix-based neural network model which can be trained by first-order stochastic optimizers. BILA achieves higher accuracy than existing offline algorithms, enabling robust real-time label cleaning.
Chapter 3 addresses black-box distillation without access to real training data. Existing methods only explore the input space inefficiently. We propose TANDEMGAN, which combines exploration, which generates diverse synthetic queries, with exploitation, which focuses on high-confidence queries. This tandem architecture enables effective substitute model training in general adversarial scenarios where only class labels are available.
Chapter 4 further improves black-box efficiency by incorporating semantic information from public data knowledge. We introduce AEDM, which leverages pre-trained diffusion models to generate semantically rich query images resembling real data. By optimizing the input noise of the diffusion model based on substitute model feedback, AEDM achieves superior distillation accuracy with significantly fewer queries and extends to federated learning settings.
To address these challenges, this thesis proposes five research questions, combining theoretical analysis with empirical validation across diverse machine learning scenarios. The first challenge considers noisy crowdsourced labels, where non-professional workers introduce errors that degrade model performance. It calls for online aggregation methods to process data incrementally rather than in one go on a whole set. The second vulnerability involves black-box model distillation without real data, where efficiently generating high-quality synthetic queries remains difficult. The third challenge extends this to incorporating semantic information from public data, aiming to reduce the number of queries typically required for effective distillation. The fourth investigates generative model distillation, asking whether dark knowledge (inference probabilities) exists beyond final outputs and how it improves generalization. The fifth examines diffusion models, whose multi-step Markov chain structure introduces unique difficulties for distillation and sampling acceleration.
Chapter 2 tackles distilling knowledge from noisy crowdsourced labels. Unlike offline aggregation methods requiring all labels at once, we propose BILA , an online framework that processes label chunks incrementally using a confusion matrix-based neural network model which can be trained by first-order stochastic optimizers. BILA achieves higher accuracy than existing offline algorithms, enabling robust real-time label cleaning.
Chapter 3 addresses black-box distillation without access to real training data. Existing methods only explore the input space inefficiently. We propose TANDEMGAN, which combines exploration, which generates diverse synthetic queries, with exploitation, which focuses on high-confidence queries. This tandem architecture enables effective substitute model training in general adversarial scenarios where only class labels are available.
Chapter 4 further improves black-box efficiency by incorporating semantic information from public data knowledge. We introduce AEDM, which leverages pre-trained diffusion models to generate semantically rich query images resembling real data. By optimizing the input noise of the diffusion model based on substitute model feedback, AEDM achieves superior distillation accuracy with significantly fewer queries and extends to federated learning settings.
Learn together over time
Distributed Multi-frequency time series framework
Transformer-Based Synthetic Relational Data
Closing the Gap Between Diffusion-Based and Transformer-Based Synthetic Relational Data Generation
Recent advances in synthetic data generation have demonstrated considerable success for single-table datasets, with emerging research extending these capabilities to multi-table relational scenarios.
While transformer and diffusion architectures achieve state-of-the-art performance in single-table generation, a notable performance gap emerges when applied to relational data, where diffusion approaches consistently outperform transformer-based methods.
This thesis examines the factors contributing to this performance difference, conducting an evaluation using multiple baselines across both single and relational tabular datasets, with REaLTabformer and ClavaDDPM as state-of-the-art transformer- and diffusion-based approaches, respectively.
Our investigation reveals that the performance can mainly be attributed to the inadequate processing of contextual relationships and suboptimal strategies for representing inter-table dependencies in transformer-based models.
To close this gap, we introduce two changes for transformer-based models: layer sharing to enhance parameter utilization and contextual encoding to better preserve the relational structure.
These changes provide insight into the key design principles behind effective synthetic relational data generation using transformer-based models, particularly the need for architectures that account for context and facilitate practical knowledge transfer.
The proposed methods result in substantial performance improvements, with a 1.52-fold improvement in Logistic Detection and a 1.94-fold reduction in the Discriminator Measure metric.
...
Recent advances in synthetic data generation have demonstrated considerable success for single-table datasets, with emerging research extending these capabilities to multi-table relational scenarios.
While transformer and diffusion architectures achieve state-of-the-art performance in single-table generation, a notable performance gap emerges when applied to relational data, where diffusion approaches consistently outperform transformer-based methods.
This thesis examines the factors contributing to this performance difference, conducting an evaluation using multiple baselines across both single and relational tabular datasets, with REaLTabformer and ClavaDDPM as state-of-the-art transformer- and diffusion-based approaches, respectively.
Our investigation reveals that the performance can mainly be attributed to the inadequate processing of contextual relationships and suboptimal strategies for representing inter-table dependencies in transformer-based models.
To close this gap, we introduce two changes for transformer-based models: layer sharing to enhance parameter utilization and contextual encoding to better preserve the relational structure.
These changes provide insight into the key design principles behind effective synthetic relational data generation using transformer-based models, particularly the need for architectures that account for context and facilitate practical knowledge transfer.
The proposed methods result in substantial performance improvements, with a 1.52-fold improvement in Logistic Detection and a 1.94-fold reduction in the Discriminator Measure metric.
A comparative study between GPT-based models and diffusion models showed that diffusion models produce synthetic data of higher quality. LDMs were then evaluated as a potential alternative. Their reliance on a variational autoencoder led to low quality outputs. Hence, standard diffusion models were elected as the superior watermarking candidate. Finally, we introduced an extended set of time-, frequency-, and time-frequency domain attacks to asses watermark robustness. TimeWak emerged as the most robust watermark. However, our extended attack suite revealed new vulnerabilities in all watermarks, highlighting the importance of comprehensive robustness evaluations. ...
A comparative study between GPT-based models and diffusion models showed that diffusion models produce synthetic data of higher quality. LDMs were then evaluated as a potential alternative. Their reliance on a variational autoencoder led to low quality outputs. Hence, standard diffusion models were elected as the superior watermarking candidate. Finally, we introduced an extended set of time-, frequency-, and time-frequency domain attacks to asses watermark robustness. TimeWak emerged as the most robust watermark. However, our extended attack suite revealed new vulnerabilities in all watermarks, highlighting the importance of comprehensive robustness evaluations.
preserving inference protocol for Hybrid BNs, (iii) an optimized message-passing scheme that
improves communication efficiency even in the purely discrete domain. Our extensive evaluation
show that Hybrid CCJT improves the predictive accuracy of continuous target variables by an average of 32% in Mean Squared Error and reduce the communication cost up to 86-fold, against the best state-of-the-art baseline. ...
preserving inference protocol for Hybrid BNs, (iii) an optimized message-passing scheme that
improves communication efficiency even in the purely discrete domain. Our extensive evaluation
show that Hybrid CCJT improves the predictive accuracy of continuous target variables by an average of 32% in Mean Squared Error and reduce the communication cost up to 86-fold, against the best state-of-the-art baseline.
Attacking Federated Time Series Forecasting Models
Reconstructing Private Household Energy Data during Federated Learning with Gradient Inversion Attacks
Unfortunately, privacy risks in federated learning persist, as servers can potentially reconstruct clients' training data through gradient inversion attacks.
While gradient inversion attacks are demonstrated for image, text and tabular classification tasks, little is known for time series regression tasks.
In this paper, we first conduct an extensive empirical study on inverting time series data across 4 time series forecasting models and 4 datasets, identifying the unique challenges of reconstructing both observations and targets of time series data.
We then propose TS-Inverse, a novel gradient inversion attack that improves the inversion of time series data through (i) learning a gradient inversion model that outputs quantile predictions, (ii) a unique loss function incorporating periodicity and trend regularization, and (iii) regularization according to the quantile predictions. Our evaluations demonstrate a remarkable performance of TS-Inverse, achieving at least 2x-10x improvement in terms of sMAPE metric over existing gradient inversion attacks methods on time series data. ...
Unfortunately, privacy risks in federated learning persist, as servers can potentially reconstruct clients' training data through gradient inversion attacks.
While gradient inversion attacks are demonstrated for image, text and tabular classification tasks, little is known for time series regression tasks.
In this paper, we first conduct an extensive empirical study on inverting time series data across 4 time series forecasting models and 4 datasets, identifying the unique challenges of reconstructing both observations and targets of time series data.
We then propose TS-Inverse, a novel gradient inversion attack that improves the inversion of time series data through (i) learning a gradient inversion model that outputs quantile predictions, (ii) a unique loss function incorporating periodicity and trend regularization, and (iii) regularization according to the quantile predictions. Our evaluations demonstrate a remarkable performance of TS-Inverse, achieving at least 2x-10x improvement in terms of sMAPE metric over existing gradient inversion attacks methods on time series data.
Watermarking Diffusion Graph Models
GUISE: Graph GaUssIan Shading watErmark
We conduct several experiments using the LDM-3DG model on publicly available datasets QM9 and Drugs, to assess the robustness and effectiveness of our technique. Our results demonstrate that the watermarked molecules maintain statistical parity in 9 out of 10 performance metrics compared to the original. Moreover, they exhibit a 100\% detection rate and a 99\% extraction rate in a 2D decoded pipeline, while also showing robustness against post-editing attacks. ...
We conduct several experiments using the LDM-3DG model on publicly available datasets QM9 and Drugs, to assess the robustness and effectiveness of our technique. Our results demonstrate that the watermarked molecules maintain statistical parity in 9 out of 10 performance metrics compared to the original. Moreover, they exhibit a 100\% detection rate and a 99\% extraction rate in a 2D decoded pipeline, while also showing robustness against post-editing attacks.
Time's Up!
Robust Watermarking in Large Language Models for Time Series Generation
Through comprehensive experiments on four real world datasets (Abalone, Adult, Default, and Diabetes), we demonstrate that the adapted watermarking technique has a negligible drop of 3.5% in data quality, measured through correlations between real and synthetic distributions, performance of downstream machine learning tasks, and discriminability between the real and synthetic data. This is a better result than the 12.46% drop in data quality offered by having a circle mask. Ellipse introduces a non-significant average drop of 0.4% in detection efficiency compared to having a circle mask. Our implementation also offers resilience against value skewing and deletion attacks on the rows and columns of the dataset. When exposed to attacks, Ellipse has a higher Area Under the Curve (AUC) than the circular mask of Tree-Ring by an average of 7.17%. The code for Ellipse is publicly available at https://github.com/6toma/ellipse-watermark.
...
Through comprehensive experiments on four real world datasets (Abalone, Adult, Default, and Diabetes), we demonstrate that the adapted watermarking technique has a negligible drop of 3.5% in data quality, measured through correlations between real and synthetic distributions, performance of downstream machine learning tasks, and discriminability between the real and synthetic data. This is a better result than the 12.46% drop in data quality offered by having a circle mask. Ellipse introduces a non-significant average drop of 0.4% in detection efficiency compared to having a circle mask. Our implementation also offers resilience against value skewing and deletion attacks on the rows and columns of the dataset. When exposed to attacks, Ellipse has a higher Area Under the Curve (AUC) than the circular mask of Tree-Ring by an average of 7.17%. The code for Ellipse is publicly available at https://github.com/6toma/ellipse-watermark.
Go With The Flow: Fault-Tolerant Decentralized Training of Large Language Models
Decentralised Training of Large Language Models
Through the incorporation of watermarks into the model’s process, we guarantee resilience and undetectability. Our approach preserves the LDCast model’s predicted accuracy while still being able to verifying the origins of the data. We confirm the efficacy of our method through comprehensive evaluation, underscoring its potential to improve the se- curity and integrity of time series forecasting models. ...
Through the incorporation of watermarks into the model’s process, we guarantee resilience and undetectability. Our approach preserves the LDCast model’s predicted accuracy while still being able to verifying the origins of the data. We confirm the efficacy of our method through comprehensive evaluation, underscoring its potential to improve the se- curity and integrity of time series forecasting models.
the Fourier-transformed latent of the table, extending gradually across a large portion of the space. The watermark can be detected by calculating the distance between the Fourier-transformed tabular latent and the ground-truth watermark patch. Additionally, we develop post-editing attacks, including row/column/value deletion and distortion, to evaluate the robustness of the watermark. Our evaluation on four datasets demonstrates that our watermarking scheme effectively preserves the quality of synthetic tables in terms of resemblance, discriminability, and downstream utility. The average quality difference is less than 0.6% compared to non-watermarked data, while maintaining high detectability, with average statistical p-values over 25× lower than 0.02. Additionally, our robustness analysis
shows that the watermark is resilient against various post-editing actions, with
85% of the p-values remaining below 0.05 across all 18 attack settings on four
datasets. ...
the Fourier-transformed latent of the table, extending gradually across a large portion of the space. The watermark can be detected by calculating the distance between the Fourier-transformed tabular latent and the ground-truth watermark patch. Additionally, we develop post-editing attacks, including row/column/value deletion and distortion, to evaluate the robustness of the watermark. Our evaluation on four datasets demonstrates that our watermarking scheme effectively preserves the quality of synthetic tables in terms of resemblance, discriminability, and downstream utility. The average quality difference is less than 0.6% compared to non-watermarked data, while maintaining high detectability, with average statistical p-values over 25× lower than 0.02. Additionally, our robustness analysis
shows that the watermark is resilient against various post-editing actions, with
85% of the p-values remaining below 0.05 across all 18 attack settings on four
datasets.
Exploring the Impact of Single-Character Attacks in Federated Learning Language Classification
Introducing the Novel Single-Character Strike
Time-Series Forecasting with Hybrid Federated Learning
A Personalized Approach to Collaboration
Existing solutions have addressed some of these challenges for forecasting or different purposes but there lacks a comprehensive approach that handles all of them for time series forecasting. To solve this problem, we introduce Time-series-based Personalized Hybrid Federated Learning (TPHFL), a hybrid federated learning (FL) strategy that combines Horizontal FL and Vertical FL to enable multi-level knowledge exchange while preserving data privacy. All participants use a personalization mechanism to make predictions that better suit their underlying data distribution.
Our approach employs a distributed model to handle vertical privacy constraints and addresses data heterogeneity across equipment through a personalisation mechanism. Through extensive experiments on four public and one industry-specific datasets, we show that TPHFL outperforms independent learning scenarios by 27.2%, providing a strong incentive for parties to collaborate.
We demonstrate the effectiveness of personalisation by showing an accuracy improvement of up to 42.7% when comparing TPHFL with personalisation to TPHFL without personalisation, and 32.7% when comparing traditional HFL methods to HFL with personalisation. Additionally, we evaluate a different configuration for personalisation and perform a detailed hyperparameter analysis to better understand the behaviour of TPHFL in different contexts. ...
Existing solutions have addressed some of these challenges for forecasting or different purposes but there lacks a comprehensive approach that handles all of them for time series forecasting. To solve this problem, we introduce Time-series-based Personalized Hybrid Federated Learning (TPHFL), a hybrid federated learning (FL) strategy that combines Horizontal FL and Vertical FL to enable multi-level knowledge exchange while preserving data privacy. All participants use a personalization mechanism to make predictions that better suit their underlying data distribution.
Our approach employs a distributed model to handle vertical privacy constraints and addresses data heterogeneity across equipment through a personalisation mechanism. Through extensive experiments on four public and one industry-specific datasets, we show that TPHFL outperforms independent learning scenarios by 27.2%, providing a strong incentive for parties to collaborate.
We demonstrate the effectiveness of personalisation by showing an accuracy improvement of up to 42.7% when comparing TPHFL with personalisation to TPHFL without personalisation, and 32.7% when comparing traditional HFL methods to HFL with personalisation. Additionally, we evaluate a different configuration for personalisation and perform a detailed hyperparameter analysis to better understand the behaviour of TPHFL in different contexts.