C. Hong
Please Note
6 records found
1
To address these challenges, this thesis proposes five research questions, combining theoretical analysis with empirical validation across diverse machine learning scenarios. The first challenge considers noisy crowdsourced labels, where non-professional workers introduce errors that degrade model performance. It calls for online aggregation methods to process data incrementally rather than in one go on a whole set. The second vulnerability involves black-box model distillation without real data, where efficiently generating high-quality synthetic queries remains difficult. The third challenge extends this to incorporating semantic information from public data, aiming to reduce the number of queries typically required for effective distillation. The fourth investigates generative model distillation, asking whether dark knowledge (inference probabilities) exists beyond final outputs and how it improves generalization. The fifth examines diffusion models, whose multi-step Markov chain structure introduces unique difficulties for distillation and sampling acceleration.
Chapter 2 tackles distilling knowledge from noisy crowdsourced labels. Unlike offline aggregation methods requiring all labels at once, we propose BILA , an online framework that processes label chunks incrementally using a confusion matrix-based neural network model which can be trained by first-order stochastic optimizers. BILA achieves higher accuracy than existing offline algorithms, enabling robust real-time label cleaning.
Chapter 3 addresses black-box distillation without access to real training data. Existing methods only explore the input space inefficiently. We propose TANDEMGAN, which combines exploration, which generates diverse synthetic queries, with exploitation, which focuses on high-confidence queries. This tandem architecture enables effective substitute model training in general adversarial scenarios where only class labels are available.
Chapter 4 further improves black-box efficiency by incorporating semantic information from public data knowledge. We introduce AEDM, which leverages pre-trained diffusion models to generate semantically rich query images resembling real data. By optimizing the input noise of the diffusion model based on substitute model feedback, AEDM achieves superior distillation accuracy with significantly fewer queries and extends to federated learning settings.
To address these challenges, this thesis proposes five research questions, combining theoretical analysis with empirical validation across diverse machine learning scenarios. The first challenge considers noisy crowdsourced labels, where non-professional workers introduce errors that degrade model performance. It calls for online aggregation methods to process data incrementally rather than in one go on a whole set. The second vulnerability involves black-box model distillation without real data, where efficiently generating high-quality synthetic queries remains difficult. The third challenge extends this to incorporating semantic information from public data, aiming to reduce the number of queries typically required for effective distillation. The fourth investigates generative model distillation, asking whether dark knowledge (inference probabilities) exists beyond final outputs and how it improves generalization. The fifth examines diffusion models, whose multi-step Markov chain structure introduces unique difficulties for distillation and sampling acceleration.
Chapter 2 tackles distilling knowledge from noisy crowdsourced labels. Unlike offline aggregation methods requiring all labels at once, we propose BILA , an online framework that processes label chunks incrementally using a confusion matrix-based neural network model which can be trained by first-order stochastic optimizers. BILA achieves higher accuracy than existing offline algorithms, enabling robust real-time label cleaning.
Chapter 3 addresses black-box distillation without access to real training data. Existing methods only explore the input space inefficiently. We propose TANDEMGAN, which combines exploration, which generates diverse synthetic queries, with exploitation, which focuses on high-confidence queries. This tandem architecture enables effective substitute model training in general adversarial scenarios where only class labels are available.
Chapter 4 further improves black-box efficiency by incorporating semantic information from public data knowledge. We introduce AEDM, which leverages pre-trained diffusion models to generate semantically rich query images resembling real data. By optimizing the input noise of the diffusion model based on substitute model feedback, AEDM achieves superior distillation accuracy with significantly fewer queries and extends to federated learning settings.
While diffusion models effectively generate remarkable synthetic images, a key limitation is the inference inefficiency, requiring numerous sampling steps. To accelerate inference and maintain high-quality synthesis, teacher-student distillation is applied to compress the diffusion models in a progressive and binary manner by retraining, e.g., reducing the 1024-step model to a 128-step model in 3 folds. In this paper, we propose a single-fold distillation algorithm, SFDDM, which can flexibly compress the teacher diffusion model into a student model of any desired step, based on reparameterization of the intermediate inputs from the teacher model. To train the student diffusion, we minimize not only the output distance but also the distribution of the hidden variables between the teacher and student model. Extensive experiments on four datasets demonstrate that our student model trained by the proposed SFDDM is able to sample high-quality data with steps reduced to less than 1%, thus, trading off inference time. Our remarkable performance highlights that SFDDM effectively transfers knowledge in single-fold distillation, achieving semantic consistency and meaningful image interpolation.
GIDM
Gradient Inversion of Federated Diffusion Models
Maverick Matters
Client Contribution and Selection in Federated Learning
Federated learning (FL) enables collaborative learning between parties, called clients, without sharing the original and potentially sensitive data. To ensure fast convergence in the presence of such heterogeneous clients, it is imperative to timely select clients who can effectively contribute to learning. A realistic but overlooked case of heterogeneous clients are Mavericks, who monopolize the possession of certain data types, e.g., children hospitals possess most of the data on pediatric cardiology. In this paper, we address the importance and tackle the challenges of Mavericks by exploring two types of client selection strategies. First, we show theoretically and through simulations that the common contribution-based approach, Shapley Value, underestimates the contribution of Mavericks and is hence not effective as a measure to select clients. Then, we propose FedEMD, an adaptive strategy with competitive overhead based on the Wasserstein distance, supported by a proven convergence bound. As FedEMD adapts the selection probability such that Mavericks are preferably selected when the model benefits from improvement on rare classes, it consistently ensures the fast convergence in the presence of different types of Mavericks. Compared to existing strategies, including Shapley Value-based ones, FedEMD improves the convergence speed of neural network classifiers with FedAvg aggregation by 26.9% and its performance is consistent across various levels of heterogeneity.
Online label aggregation
A variational bayesian approach
Noisy labeled data is more a norm than a rarity for crowd sourced contents. It is effective to distill noise and infer correct labels through aggregating results from crowd workers. To ensure the time relevance and overcome slow responses of workers, online label aggregation is increasingly requested, calling for solutions that can incrementally infer true label distribution via subsets of data items. In this paper, we propose a novel online label aggregation framework, BiLA , which employs variational Bayesian inference method and designs a novel stochastic optimization scheme for incremental training. BiLA is flexible to accommodate any generating distribution of labels by the exact computation of its posterior distribution. We also derive the convergence bound of the proposed optimizer. We compare BiLA with the state of the art based on minimax entropy, neural networks and expectation maximization algorithms, on synthetic and real-world data sets. Our evaluation results on various online scenarios show that BiLA can effectively infer the true labels, with an error rate reduction of at least 10 to 1.5 percent points for synthetic and real-world datasets, respectively.