Accurate Differentially Private Deep Learning on the Edge

Deep learning (DL) models are increasingly built on federated edge participants holding local data. To enable insight extractions without the risk of information leakage, DL training is usually combined with differential privacy (DP). The core theme is to tradeoff learning accuracy by adding statistically calibrated noises, particularly to local gradients of edge learners, during model training. However, this privacy guarantee unfortunately degrades model accuracy due to edge learners’ local noises, and the global noise aggregated at the central server. Existing DP frameworks for edge focus on local noise calibration via gradient clipping techniques, overlooking the heterogeneity and dynamic changes of local gradients, and their aggregated impact on accuracy. In this article, we present a systematical analysis that unveils the influential factors capable of mitigating local and aggregated noises, and design PrivateDL to leverage these factors in noise calibration so as to improve model accuracy while fulfilling privacy guarantee. PrivateDL features on: (i) sampling-based sensitivity estimation for local noise calibration and (ii) combining large batch sizes and critical data identification in global training. We implement PrivateDL on the popular Laplace/Gaussian DP mechanisms and demonstrate its effectiveness using Intel BigDL workloads, i.e., considerably improving model accuracy by up to 5X when comparing against existing DP frameworks.


INTRODUCTION
D EEP learning (DL) has gained unquestionable success in many domains (e.g., image classification, video object recognition) and continues to flourish. Today, the emergence of edge computing [1] presents a new paradigm that trains DL models on multiple edge nodes/participants holding local private data. Federated learning is a prevalent framework to support such collaborative learning using datasets distributed across multiple participants (e.g., edge nodes) [2], [3], [4], [5], [6]. With federated learning, edge nodes compute the model updates using decentralized data and contribute to the global model updates, e.g., exchanging the gradients under the orchestration of a central server. However, it's shown that there is still a risk of information leakage when only intermediate model updates are exchanged between edge nodes and the server [7], [8], [9]. Differential privacy [10], [11], [12] is one prevalent technique to protect data privacy by adding statistically calibrated noises at different learning stages based on the estimated risk of privacy loss [2], [13], [14], [15]. The privacy guarantee here is unfortunately at the cost of accuracy losses: lower risks of privacy leakage, but also lower DL accuracies.
Example. Fig. 1 shows a typical example of distributed DL training, which performs global aggregation of gradients across multiple edge nodes in an iterative manner. With differential privacy, each node first computes its local gradients and noise-added gradients, instead of the actual values, and then sends them to a central server. One can see that the DL model accuracy can be degraded first by local noise injected from each local node and then through the aggregation of these noises in the central server. Our evaluations of BigDL jobs on the KubeEdge platform (in Sections 3.2 and 3.3) show that either local noises or their aggregation lead to more than 50 percent accuracy degradation compared to the one without noises.
Differentially Private DL. The core task of differentially private DL is to determine when, which type and how many noises to add given a privacy budget , which determines how much information is leaked by a differential privacy mechanism. Conceptually, this budget defines the theoretical upper bound (e " À 1) of privacy leakage [16]. That is, a smaller budget denotes a lower bound of privacy leakage and thus requires larger noises to perturb the model learning process. In practice, two types of statistical noises are typically considered: Laplace and Gaussian, and the amount of noise is jointly determined by the privacy budget and the data sensitivity. The later one measures how sensitive of individual data reacting to noises, and it is defined by the pair-wise norm of gradients in DL training. Essentially, data that has a higher sensitivity needs stronger/ larger noises to prevent the privacy leakage.
Challenges of High-Privacy and Accurate DL Training. Edgebased DL training usually requires small privacy budgets to guarantee high privacy levels, at the same time, it requires low noises to mitigate model accuracy degradation. Achieving these objectives gives rise to three major challenges in practice.
First, it is necessary to systematically analyze the influential factors, in addition to privacy budget, of both local and aggregated noises. Specifically, the extent of local noise is mainly determined by the data sensitivity given a privacy budget. Deducing the influential factor of sensitivity, therefore, is the prerequisite to appropriately estimate the sensitivity. Moreover, unveiling the influential factors to minimize the aggregated noise at the central server is another imperative and unsolved challenge in a federated learning setting.
Second, the state-of-the-art of differentially private DL [2], [15], [17], [18] employs gradient clipping techniques that bound the data sensitivity and thus further bound noises added to gradients. However, real-world DL training tasks usually have to tackle with varying models, datasets, and parameters (initial model parameters and hyperparameters in training), which lead to large discrepancies of gradients among different tasks as well as high fluctuations of gradients during a task's iterative training process (for example, the gradients at iteration 10k can be 10000x larger than those at iteration 100 [19]). Hence there is no "one-sizefits-all" best clip bound for different training tasks, and even in a task, the proper bound dynamically changes across training iterations because it is determined by the current range of gradients.
Finally, existing techniques address the question of how to estimate sensitivity for local noises in individual nodes, but do not consider issues relating to the overall model accuracy degraded by aggregated noises in a federated learning setting. Hence how to reduce aggregated noises without increasing the model training overhead is another challenge to be addressed.
Motivated by these challenges, this paper proposes PrivateDL, a novel framework that can reduce local noises by accurately estimating data sensitivity locally without pre-defined clip bounds, while further reducing aggregated noises via virtual batch size amplification with critical set. We specifically consider the federated learning setting, where edge participants train their local models on the local data and exchange gradients via the central server. The design of this framework is based on the systematical analysis that unveils the complex dependency from local noise, to aggregated noise on model accuracy, given a privacy budget. The core feature of Pri-vateDL thus is to tune factors that can minimize the impact on local and aggregated noises, that is, sensitivity estimation, large batch sizes (to reduce aggregated noises), and training on a subset of critical input data (to improve performance). In detail, we make the following technical contributions: We analyze the characteristics of distributed DL model training and deduce the influential factors of local and aggregated noises for both -differential privacy and (; d)-differential privacy noise mechanisms. The experimental evaluations on real DL models and datasets show these factors indeed have significant influences on noises, namely model accuracies (Section 3). We design two PrivateDL modules for efficient and accurate model training (Section 4). First, the sensitivity estimation module reduces local noises injected in each edge node by dynamically sampling the range of gradients at each iteration, avoiding using the predefined gradient clipping bounds. Second, to mitigate model accuracy degradation due to aggregated noises, the virtual batch size amplification module increases the batch size to reduce aggregated noises. This module also employs a redundant input data removal technique to decrease the computational costs in gradient calculation. We implement PrivateDL on KubeEdge [20], an emerging edge computing platform in the Kubenetes ecosystem [21] and incorporate it with the deep ML algorithms in Intel BigDL [22] and PyTorch [23] (Section 4.4). By applying PrivateDL in both -differential privacy and (; d)-differential privacy noise mechanisms, the comparative experiments against existing clip techniques show: (i) under different DL training settings of local noises, PrivateDL increases model accuracy by an average of 411.65 percent, in particular, the accuracy increase is 565.87 percent for the smallest privacy budgets (i.e., the highest privacy level); (ii) under low privacy levels where the accuracy improvement of local noise reduction is small, PrivateDL still improves the model accuracy by an average of 131.88 percent via aggregated noise reduction (Section 5).

BACKGROUND
This section first explains the basic concepts of data privacy violation in DL (Section 2.1) and then introduces differential privacy (Section 2.2). We summarize the notations in Table 1.

Data Privacy and Inference Attacks in DL
DL models are susceptible to various data privacy leakages as they remember information about their training data, both in the model parameters and in the parameter updates during training. Within this context, inference attacks on DL algorithms fall into two major categories [24]: tracing/ membership attacks that infer if a particular data point was included in the training samples; and reconstruction attacks that infer attributes of data points in the training data.
Membership Attack. This attack can be divided into two types. In the black-box setting, an attacker can only quantify membership information leakage using the prediction outputs of the target model [25]. Hence an attack model is trained to distinguish the target model's behaviour on the entire training data from its behavior on the data without some training samples. In the white-box setting, an attacker can observe the activation functions of the target model, thus either passively observes the model updates or actively influences the training process to exact more information [24].
Reconstruction Attack. This attack learns from the target model about the properties that characterize the entire class [26] or a subset of classes [27], making it possible to construct the representatives of the learned classes. For example, federated learning is designed to support model learning using private training data from different participants. Reconstruction attacks allow an attacker to infer properties of other participants training data [9].
To prevent the above attacks and mask the contribution (privacy information) of any individual participant, differential privacy is one major privacy-preserving technique that introduces uncertainty into the target model. The notion of differential privacy was initially proposed by Dwork et al. [28] to bound the probability of data privacy leakage in database by adding Laplacian noises. The differentially private mechanisms were first adopted by the database community, e.g., sublinear query (SuLQ) database models [29] and INQ (Privacy Integrated Queries) systems [14]. The core step of differential private algorithms is to decide the level of injected statistical noises based on the privacy budget and how sensitive the data is to the perturbation. The typical differential private DL algorithms are introduced in the following section.

Differentially Private DL
Given a DL task, let D and D 0 denote arbitrary adjacent datasets (that is, they differ in just one record), and let M be a differential private algorithm that adds statistical noises to gradient updates, according to a given privacy budget defined by and data sensitivity defined by D.
Definition 1: (-Differential Privacy). Let P M be the domain of all possible outputs of algorithm M and S M be any subset in P M . M provides the -differential privacy preserving if it satisfies the following formula: where budget denotes the level of privacy leakage. Conceptually, smaller values of mean lower tolerances to the privacy leakage and hence requiring higher levels of noise perturbations that degrade the model accuracy.
Definition 2: ((; d)-Differential Privacy). Algorithm M provides the (; d)-differential privacy preserving if it satisfies the following formula: where parameter d denotes the probability that the standard -differential privacy is broken. Hence, this definition provides a weaker privacy guarantee. Note that the parameters and d decide the theoretical upper bound of privacy degradation [30]. Specifically, in -differential privacy, the upper bound is e À 1. In (; d)-differential privacy, the bound is e À 1 þ a Â d, where a ! 1 represents the number of algorithms used to produce machine learning models. When applying to DL, the values of these parameters depend on the trained model. For example, our empirical tests show that when setting d to 1e À6 , ranges from 0.03 to 1 in LeNet-5, and ranges from 10 to 1e 3 in AlexNet. The parameters smaller than these values inject too much noises and hence considerably decrease model accuracy. The parameters larger than values have negligible impact on model accuracy because little noises are injected.
Noise Mechanism. In -differential privacy [15] and (; d)-differential privacy [2], the prevalent methods inject noises to sensitive data according to a predetermined distribution, such as the Laplace and Gaussian distributions. These methods, therefore, are termed Laplace mechanism [28] and Gaussian mechanism, which employ the L 1 -norm and L 2 -norm to estimate the differences between gradient outcomes with and without noise perturbation.
Definition 3: (Laplace Mechanism). The mechanism controls the noise injection using the Laplace distribution and the L 1 -norm sensitivity: where m is the mean of Laplace distribution, and b is the scale parameter calculated with the L 1 -sensitivity D 1 and the privacy budget . In Eq. (3), we use LapðbÞ or Lapð0; bÞ to represent the random value satisfying the Laplace distribution: pðxÞ ¼ 1 2b expðÀ jxÀmj b Þ. Definition 4: (Gaussian Mechanism). The mechanism controls the noise injection using the Gaussian distribution and the total number of data points in T , namely the batch size g a gradient that corresponds to a model parameter n the number of model parameters/gradients Y a vector ðg 1 ; g 2 ; . . . ; g n Þ of gradients in a DL model of n parameters fð:Þ the function applied in a dataset for gradient computation g ratio of input data reduction privacy budget d the parameter in (; d)-differential privacy Noiseðm; sÞ a noise distribution with mean value m and standard deviation s s aggregate the standard deviation of aggregated noise D p L p -sensitivity (p = 1 or 2) C clip bound the L 2 -norm sensitivity: where m and s 2 are the mean and variance of the normal distribution Definition 5: (Sensitivity). The sensitivity of function f over two adjacent datasets D and D 0 is calculated as the maximal distance between the function's outputs f(D) and f(D'): where jfðDÞ À fðD 0 Þj represents the norm distance. In L 1 -norm (jjÁjj 1 ) and L 2 -norm (jjÁjj 2 ), the sensitivity is calculated as: D 1 f ¼ maxjjfðDÞ À fðD 0 Þjj 1 and D 2 f ¼ maxjjfðDÞ ÀfðD 0 Þjj 2 , respectively. Sensitivity estimation is deemed the most computation expensive component of differential privacy algorithms because it requires exhaustive computation on all possible pairs of ðD; D 0 Þ. Several estimation schemes were proposed by the related work [31].

INFLUENTIAL FACTORS OF PRIVACY GUARANTEE IN DIFFERENTIALLY PRIVATE MODEL TRAINING
In this section, we first introduce the threat model within the context of distributed DL on edge nodes, and explain how differential privacy preserves privacy by injecting noises into the released gradient updates (Section 3.1). Within this context, we systematically analyze the impact of noises from the perspective of local edge nodes (Section 3.2) and the entire training system (Section 3.3). Using image benchmarks, we show the challenges of achieving high accuracy given the local privacy budget and uncover factors that can minimize the overall accuracy loss and adhere to the privacy guarantee.

Threat Model in Edge-Based DL
We assume that K participants (where K ! 2) jointly train a DL model in a federated learning setting, in which each participant generates gradients through interactions with her/ his local edge node. In a typical model averaging setting [26], model updates correspond to a two-step aggregation at one iteration. First, every participant aggregates the gradients calculated using the local batch of data on the edge node. Second, a central server aggregates the gradients from all K participants. In the threat model, either the server or one of the participants can be the attacker that aims to infer information about other participants' training data during iterative model training. This inference is conducted by inspecting the gradient updates received from the server. For example, attackers can reconstruct the original images from gradient information [32]. Differential privacy addresses the above inference attacks by first applying a differentially private transformation to each participant's gradients and then transferring them to the central server [2], [3], [4]. Specifically, at each iteration, differential privacy adds noises to a participant's gradients according to the pre-specified privacy budget and other model training settings. All these variables may influence the amount of noises injected and thus affect the final model accuracy. In the following two sections, we explain the influential factors of noises at the edge node level (local noises) and at the master server level (that is, the aggregated noises from all edge nodes).

Influential Factors of Local Noises Per Edge Node
To study the impact of local privacy budget, we first rewrite the noise obfuscation level as a function of privacy budget (), data sensitivity (D), and the parameters of statistical distribution as following.
In the Laplace mechanism (Definition 3), the noise follows the Laplace distribution: Similarly, the noise follows the Gaussian distribution in the Gaussian mechanism (Definition 4): Gradient Clipping Techniques. Data sensitivity is defined as the maximum L 1 or L 2 norm of two adjacent gradient vectors. As sensitivity can have a wide range, the prior art proposes to clip the gradient ranges such that the sensitivity is analytically bounded and ease its computation overhead. In traditional database systems, the minimum and maximum values of data points in D and D 0 (Eq. (5)) are known before calculating sensitivity. However, when applying differential privacy to DL (that is, adding noises to gradients), it is difficult to know these values because the gradients vary across different models, datasets, and training iterations. For such an issue, clip techniques are proposed to enforce a range on gradient values in sensitivity calculation. We now define two prevalent clip techniques.
First, the clip-by-norm1 technique [15] is designed to calculate the L 1 -sensitivity following the Laplace mechanism. It utilizes a fixed, input-independent bound C to bound an original gradient g original in sensitivity estimation: Hence in the Laplace Mechanism, parameter b can be set as: b ¼ 2C . Second, the clip-by-norm2 technique [2] scales the original gradient g original to gradient g as follows: Note that this techniques also scales down the noises ðmaxð1; . . . ; g 0 n Þ be two vectors of gradients, and each g i (g 0 i ) is independent of others for 1 i n. According to Eq. (5), the sensitivity in the clip-by-norm1 technique can be estimated as follows: Similarly, in the clip-by-norm2 technique, the sensitivity is estimated as follows: Evaluation of Accuracy Degradation Under Different Clip Bounds and Privacy Budgets. We take LeNet-5 and the MNIST dataset [33] as an example to show the impact of clip bounds on the overall model accuracy. We tested both clip-by-norm1 and clip-by-norm2 using two privacy budgets (0.1 and 0.3) and four clip bounds (1, 0.1, 0.001, and 0.000001). The evaluation results in Fig. 2 show that: (i) the best clip bound varies when encountering different sampling techniques and budgets. For example, bound 0.000001 achieves the highest accuracy when the budget is 0.3 in the clip-by-norm1 technique (Fig. 2b). In contrast, bound 0.001 is the best one when the budget is 0.1 in the clip-by-norm2 technique (Fig. 2c); (ii) the setting of any bound considerably deteriorates the model accuracy compared to the one without noise injection (Fig. 2d). The average accuracy loss is 53.81 percent when considering all bounds.
Challenges. There is no "one-size-fits-all" best bound in existing clip techniques. The challenge of determining the optimal bound problem is further exacerbated when considering differentiated noise injection mechanisms in different DL models, datasets, and training iterations.

Influential Factors of Global Aggregated Noises
In this section, we first analytically capture the aggregated accuracy loss from the perspective of aggregating local noises, and then experimentally demonstrate their impact on the overall accuracy. Note that in the distributed model training, the aggregated privacy loss from multiple edge nodes differs from the sum of privacy losses in the centralized training. In the later one, the privacy losses come from different stages of the training and these stages have the same privacy barrier, which is the prerequisite for applying the composition theorem [12] in its privacy summarization. However, in the distributed training scenario, each edge node has its own privacy barrier and hence the composition theorem is not applicable. To the best of our knowledge, it is not clear how such aggregated obfuscation (noises) will impact the overall accuracy of distributed DL. We thus conduct a quantitative analysis that unveils the factors that influence the extent of aggregated noises and their resulting accuracy loss.
We let batch size per iteration and per node be B. Suppose the noises injected on each node follow the same distribution Noiseðm; sÞ, and the batch of data points processed by the kth node is ft i k ; t i k þ1 ; . . . ; t i k þb k À1 g. That is, the batch size of this node is b k and P K b¼1 b k ¼ B. We have: (1) the sum of gradients reported by the kth node is P i k þb k À1 i¼i k g i þ Noiseðm; sÞ; (2) the final gradient computed in the central server is: According to the Central-Limit Theorem [34], we have: therefore, if K is large enough, the formula above can be: where the standard deviation s aggregate is: Hence, when fixing the standard deviation s of local noise distribution and the edge node number K, the aggregated noise is inversely proportional to the batch size B. Based on the Eq. (15), it is easy to know the aggregated noise in the Laplace mechanism is: Similarly, the aggregated noise in the Gaussian mechanism is: Evaluation of Aggregated Noises. We now empirically demonstrate how the different batch sizes affect the aggregated noise and the model accuracy. Taking clip bound 0.001 and budget 0.3 (per node and per iteration) as an example, we tested four batch sizes (480, 720, 2,400, and 3,600) and Fig. 3 shows the fluctuating model accuracies across the training iterations. We can see that the batch size considerably affects the model accuracy in each evaluation, in which different numbers of training samples are processed. In particular, batch size 3600 has the highest accuracy and its accuracy loss is only 5.85 percent compared to the one without adding gradient noises. This is because this batch size brings the lowest noises according to Eqs. (16) and (17). Fig. 3 shows that, for batch sizes 480, 720, 2400, and 3600, the standard deviations of aggregated noises are 0.72, 0.48, 0.14, and 0.10, and 14.72E-05, 9.81E-05, 2.94E-05, and 1.96E-05, respectively. The lower noises cause less turbulence to the model and thus bring higher model accuracies.
Challenges. We note that in distributed DL model training, large batch size significantly decreases the aggregated noises in differential privacy. However, real edge nodes such as mobile devices usually have limited resources to process large datasets. Hence how to maintain large batch size while efficiently processing large datasets on resourceconstrained edge nodes is another major challenge to be addressed.

ACCURATE DIFFERENTIALLY PRIVATE DL
In this section, we first describe the design overview of Pri-vateDL in Section 4.1, following by explaining its unique modules: (i) sensitivity estimation to inject local gradient noises without clip bounds (Section 4.3) and (ii) reducing the aggregated accuracy loss by virtual batch size amplification (Section 4.2). Finally, we introduce our implementation of PrivateDL on KubeEdge, Spark, and PyTorch (Section 4.4).

Overview of PrivateDL
For a DL model trained across K edge nodes, PrivateDL is designed to reduce the noise perturbation of differential privacy during the model training process. Specifically, at each iteration, it reduces aggregated noises by dynamically amplifying batch sizes according to the ratio of critical input data, and injects less local gradient noises on each edge node based on accurate sampling-based sensitivity estimation. The four steps in an edge node are shown in Fig. 4.
Virtual Batch Size Amplification via Critical Set. In this module, step 1 first amplifies the batch size on the node to reduce the aggregated noise.
Step 2 then identifies the critical data points in this batch of data, namely the data points relevant to model parameter updating. In the following gradient calculation, step 3 only uses these points to save computational cost, while achieving very similar results to the entire batch of data.
Sampling-Based Sensitivity Estimation.
Step 4 adds noises to these gradients according to the differential privacy mechanism. To reduce the noise perturbation, PrivateDL proposes a sampling method that accurately estimates sensitivity according to the latest model parameters and input data at each iteration.

Virtual Batch Size Amplification Via Critical Set
This module aims to increase the batch size such that the aggregated gradient noise and its impact on the overall model accuracy can be reduced (see Eq. (15)), without exceeding the resource constraints of single edge node. The enabling feature here is to only use the critical data in the amplified dataset in gradient calculation, thus lowering the computational time and resources on each edge node. Given a DL model trained across K edge nodes, Algorithm 1 details the steps of this process. At each iteration, it first amplifies the batch size on node i according to the ratio g i of critical input data at the previous iteration (line 6). At the first iteration, this ratio is set to 1 (lines 3 to 5). At other iterations, the original batch size B i is amplified as: This amplification is based on the observation that the ratio gradually changes across iterations (e.g., from 50 to 70 percent in AlexNet), and the ratios of two adjacent iterations are very similar [19]. Hence the algorithm first samples a set T i of size B i g iÀ1 (line 7), and then only uses the critical set whose size is similar to B i in model training (line 8). Finally, the central server updates the aggregated noises according to the amplified datasets on all K edge nodes (line 11). That is, the standard deviation of aggregated noise (Eq. (15)) is  calculated as:

Algorithm 1. Virtual Batch Size Amplification Via Critical Set
Require: i: the index of an edge node (1 i K); T i : the local batch of input data on the ith edge node; B i : the original batch size on the ith node; C i : the critical set of input data processed on the ith node; g i : the ratio of critical input data in the ith node; In Algorithm 1, function CriticalSet() identifies and extracts critical subset from the batch of input data points and it is developed based on SlimML [19] for three reasons. (1) Low overheads. SlimML generates aggregated data points to approximate the original input data points. That is, each aggregated data point corresponds to multiple original ones (e.g., 10 ones). The generation and processing of aggregated points, therefore, causes small overheads (less than 5 percent of model training time). (2) High precision of identification. By calculating the sum of model parameters' gradients for each aggregated data point, SlimML identifies redundant data points that have negligible effects on model parameter updating. For example, a data point whose cumulative effect is smaller than 1 percent of the total effects. These points' corresponding original data points are removed and the retained input data is critical. (3) Applicability to SGD-based model training. SlimML is developed for gradient descent based model training and hence it is applicable for a wide range of DL applications. Note that importance sampling [35] and coreset [36] also employ data points' loss values or gradients to select a subset of importance points to accelerate the training speed. However, this subset is kept unchanged and thus cannot adapt the changing gradients during the iterative training process. In contrast, SlimML calculates data points' gradients at the beginning of each training iteration and dynamically decides the ratio of critical input data (namely the amplified batch size).

Sampling-Based Sensitivity Estimation
Existing differential privacy techniques suffer from finding proper clipping bounds for gradients, because any bound may become oversize or undersize across the training iterations. To this end, PrivateDL designs a sampling-based module to calculate the data sensitivity of differential privacy without clipping bounds. This module is based on the observation of the 3-sigma rule [37]: for arbitrary distribution of gradients, most values (e.g., at least 88.89 percent) range from (m À 3s) to (m þ 3s). For example, in LeNet and AlexNet, when gradients follow the unimodal distribution, over 95 percent of the values fall into (m À 3s; m þ 3s).
Definition 6: (L p -Sensitivity D p in the Sampling Method).
Suppose D p ¼ maxjjY À Y 0 jj p , where Y and Y 0 are two arbitrary gradient vectors computed by function fð:Þ and p ! 1. Let jjY À Y 0 jj p be a random variable whose expectation is EðjjY À Y 0 jj p Þ and standard deviation is , sensitivity D p is calculated as the maximal value of jjY À Y 0 jj p according to the 3-sigma rule: Let n be the number of gradients and each gradient g satisfies the same and arbitrary distribution whose expectation and standard deviation are m and s. We deduce the calculation of L 1 -and L 2 -sensitivity.
Proposition 1: Assume that two gradients g and g 0 are independent of each other, we have Eðg À g 0 Þ 2 ¼ 2s 2 . Proof: Proof: Let Y ¼ ðg 1 ; g 2 ; . . . ; g n Þ and Y 0 ¼ ðg 0 1 ; g 0 2 ; . . . ; g 0 n Þ be two gradient vectors, where each g i (g 0 i ) is independent of each other for 1 i n. The L 1 -norm utilizes the Manhattan distance between Y and Y 0 : MD ¼ jg 1 À g 0 1 j þ jg 2 À g 0 2 j þ ::: þ jg n À g 0 n j. According to Eq. (20) and Proposition 1, we have: Given that Varðjg À g 0 jÞ ¼ Eððg À g 0 Þ 2 Þ À E 2 ðjg À g 0 jÞ ! 0 and Eððg À g 0 Þ 2 ¼ 2s 2 (Proposition 1), we have Eððg À g 0 Þ 2 Þ À E 2 ðjg À g 0 jÞ ! 0Eðjg À g 0 jÞ ffiffi ffi 2 p s By substituting Eq. (23) into Eq. (22), t u Proof: Similar to the computation of L 1 -sensitivity, the distance between Y and Y 0 in the L 2 -norm can calculated as: jjY À Y 0 jj 2 ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðg 1 À g 0 1 Þ 2 þ ðg 2 À g 0 2 Þ 2 þ ::: þ ðg n À g 0 n Þ 2 q . The L 2 -sensitivity is calculated as: In Eq. (25), EðjjY À Y 0 jj 2 2 Þ = Eð P n i¼1 ðg i À g 0 i Þ 2 Þ = P n i¼1 Eðg i À g 0 i Þ 2 = 2ns 2 (Proposition 1). According to the arithmetic mean-geometric mean (AM-GM) inequality, we have At each training iteration, the module calculates the standard deviation s of gradients on each edge node, then estimates its L 1 -norm sensitivity (Eq. (24)) or L 2 -norm sensitivity (Eq. (26)). Hence the estimation error of sensitivity is determined by the estimation error of standard deviation s in this module. Let s be the actual value of s estimated using the entire gradient space, and S be estimated value of s using m samples (m is smaller than the number n of model parameters), the estimation error t u Given that DL models usually have thousands of or even millions of parameters (i.e., gradients), batch size m can be a large value. According to Central-Limit Theorem [34], we can approximate the distribution of S 2 as a normal distribution: That is, Given confidence level a, the confidence interval of S s is ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Eq. (32) indicates that when the sampling size is m, there exists a probability a that S s is larger than the lower bound of the confidence interval. According to Eq. (27), the estimation error of standard deviation e is smaller than 1 À ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Example of Estimation Error. Based on the above analysis, we consider two factors when applying our sampling based approach in DL training: (1) confidence level a: the training process usually consists of hundreds of thousands of iterations, this means we need a sufficiently high confidence level a to guarantee a small estimation error. (2) Batch size m: DL models usually have a large number of model parameters (gradients), e.g., 60 k parameters in LeNet-5 and 61.5 million parameters in AlexNet, thus allowing a large batch size m. In this example, we set a ¼ 99:99999%, m ¼ 10 k, the estimation error e ¼ 3:5% when gradient follows the normal distribution (that is, m 4 =s 4 ¼ 3). Similarity, e ¼ 2:53% when m 4 =s 4 ¼ 2 and e ¼ 4:43% when m 4 =s 4 ¼ 4. These results show that our sampling based approach can provide low estimation errors of gradient variance with a sufficiently high probability. In addition, our batch size amplification approach influences the estimation error of gradient variance (namely data sensitivity) from another aspect. In gradient calculation, large batch sizes decrease fluctuations in gradient values and thus may reduce estimation errors.

Implementation
PrivateDL is implemented using Java and Scala and it is designed for KubeEdge [20], an emerging edge computing platform extended from Kubenetes [21]. Kubenetes is a dominant engine to orchestrate containers and schedule jobs in today's cloud datacenters [38]. Its design is a loosely coupled architecture, which allows components to interact with each other only using API-Server. The current version of Kubernetes only supports nodes with rich computational resources. KubeEdge therefore is designed to manage resources in edge devices with limited resources.
PrivateDL currently targets at DL jobs running in KubeEdge. Fig. 5 demonstrates how the edge and cloud parts of KubeEdge collaboratively completes a training iteration using a list of components. Specifically, in the edge part, KubeEdge uses the MetaManager component to hold a partition of model parameters and a local input dataset on each edge node. At the beginning of each iteration, the input dataset is partitioned into multiple containers (in a pod) for parallel execution. Each container executes a parallel DL task, which processes a subset t i of input data to compute gradients of the model partition. After completing all tasks, the gradients are summarized and injected noises to preserve privacy. Our sensitivity estimation and virtual batch size amplification modules are implemented in this process. Finally, the cloud (central server) part collects gradients from all K edge nodes. In the collection, the communication between edge nodes and the master node is implemented based on the EdgeHub and the CloudHub components. The cloud part then calculates the average of the gradients and uses it to update the global model parameters.
We choose to implement the above distributed training process in KubeEdge with three objectives. (1) Efficient synchronization between cloud and edge parts. During the model training process, KubeEdge supports the efficient synchronization between the server and edge nodes with message packing and web socket [39]. Specifically, an edge node uses the EdgeHub component to report local gradients, the server uses the CloudHub component to collect gradients and send back the latest global parameters. (2) Supporting devices with limited resources. Inherited from Kubenetes, KubeEdge develops a lightweight version of the kubelet component to enable DL tasks running on devices with limited resources (e.g., NVIDIA Jetson). (3) Applicability to heterogeneous machines. KubeEdge employs the EventBus component to transform different machine/device protocols into a unified one (MQTT [40]). Hence in a DL model training, heterogeneous edge nodes can communicate with each other.
Implemented Noise Mechanisms and Distributed DL Applications. To evaluate the effectiveness of our approach, we incorporated it into two representative noise mechanisms in differential privacy: -differential privacy [15] (Laplace distribution) and (; d)-differential privacy [2] (Gaussian distribution). The original versions of these mechanism only support DL model training in a standalone node. To this end, we implemented the distributed version of these noise mechanisms based on Apache Spark [41] (a mainstream platform built upon the MapReduce paradigm and supports in-memory computing using the resilient distributed dataset (RDD) data structure [42]) and PyTorch [23] (an open-source DL library designed for GPUs). Moreover, we incorporated PrivateDL with the DL applications in Intel BigDL [22], a library that provides rich DL applications and supports their training with parameter server paradigm and acceleration techniques [43].

EVALUATION
Based on the implementation of PrivateDL on KubeEdge, PyTorch, and BigDL, our evaluation on real DL applications and datasets has three objectives. First, we compare Priva-teDL against existing gradient clipping techniques to evaluate the proposed sensitivity estimation and its impact on overall accuracy from the perspective of local noise reduction (Section 5.2). Second, we present the comparison results across multiple edge nodes, further highlighting the effectiveness of PrivateDL in reducing aggregated noises (Section 5.3). Finally, we discuss the effectivenss and applicability of our approach (Section 5.4). Tables 2 and 3 summarize the improvement of PrivateDL over the state-of-theart: how higher accuracy improvement can be achieved under different privacy levels (privacy budgets). We can see that using our approach, the reduction of local noises achieves more accuracy improvements for higher privacy levels. Moreover, we tested the low privacy levels when the influence of local noises is small, the results show that the reduction of aggregated noises still improves accuracy by an average of 120.37 percent.  Tested Workloads and Datasets. We test three DL models (LeNet-5, AlexNet, and ResNet-18) based on the implementation of PrivateDL on Intel BigDL and PyTorch. All models consist of multiple layers, in which the pooling/ReLU layers implement fixed functions, and each convolutional/ full connected (FC) layer has multiple neurons ranging from 128 to 4,096. In evaluation, LeNet-5 [44] (60k parameters) and AlexNet [45] (61.5 million parameters [45]), and ResNet-18 [46] are tested using the MNIST dataset [33], the Cifar-10 dataset [47], and the ImageNet32Â32 dataset [48], respectively. Both MNIST and Cifar-10 datasets have 60k data points and the ratios of training and testing points are 0.8 and 0.2. ImageNet32Â32 has 1.28 million data points (downsampled 32 Â 32 images) from 1000 classes and 50k testing points (50 ones per class).
Baseline Comparisons. To the best of our knowledge, our approach is the first technique that performs samplingbased sensitivity estimation. Hence, we compare against baselines with representative clip techniques developed for L 1 -norm and L 2 -norm. Specifically, in the clip-by-norm1 technique, the sensitivity is estimated as D 1 ¼ 2nC (Eq. (10)), where n is the number of model parameters and C is the clip bound. In the clip-by-norm2 technique, the sensitivity is estimated as D 1 ¼ 2 ffiffiffi n p C (Eq. (11)). Hence in evaluation, we set smaller bounds C for models with larger number n of parameters (e.g., AlexNet), and also set larger budgets for the clip-by-norm1 technique, because it injects larger noises for the same values of n and C compared to the clip-by-norm2 technique.
Evaluation Metrics. For a fair comparison, we report the accuracy of different techniques under the same training time and budget in differential privacy. The accuracy metric is top-1 classification accuracy on the test set: the top 1 predicted class (the one having the highest probability) is the same as the target/actual class label.

Evaluation of Local Noise Reduction
The effectiveness of PrivateDL is considerably impacted by its ability to find appropriate sigma of gradients in sensitivity estimation (Eqs. (22) and (25)). Given a privacy budget, this estimation decides the intensity of noise obfuscation that thus affects the model accuracy.
Evaluation Scenarios. We test different cases of differentially private training from three aspects: (1) six model training settings, including three DL models (LeNet-5, AlexNet, and ResNet-18), and two clip techniques (clip-by-norm1 and clipby-norm2) that correspond to two noise mechanisms (Laplace and Gaussian). (2) Six privacy budgets. Each training setting has a distinct range of budgets, which represent its privacy levels in model training. Note that the model training will not converge for the budget smaller than this range, and the budget larger than this range may have a negligible impact on model accuracy. Specifically, the ranges of budgets are 0.1 to 1, 0.03 to 1, 5e 5 to 1e 7 , 10 to 1e 3 , 1e 4 to 1e 6 , 1 to 50 for LeNet-5 (clip-by-norm1), LeNet-5 (clip-by-norm2), AlexNet (clip-by-norm1), and AlexNet (clip-by-norm2), ResNet-18 (clip-by-norm1), and ResNet-18 (clip-by-norm2), respectively. (3) Three clip bounds. In existing clip techniques, the clip bound should be manually set according to the values of gradients in model training. Hence each DL model is tested using three bounds: 0.00001, 0.001, 0.1 for LeNet-5, 0.01, 0.03, 0.1 for AlexNet, and 0.1, 0.01, 0.001 for ResNet-18, respectively. Note that we set these values according to our empirical evaluations: for the upper bound, we decide its value using a list of steps. The first step starts from a large clip bound and the following steps gradually decrease its value (e.g., the value is reduced by 10 times between two consecutive steps). Given that large bounds incur high noises (Eqs. (10) and (11)) and thus may result in non convergence in model training, each step uses five epoches to test the convergent tendency. The time complexity of each epoch is Oðn Ã i Ã dÞ, where n, i, and d are the number of model parameters, the number of training samples, and the dimensionality of each sample, respectively. We decide the lower bound in a similar way. An undersize bound also leads to non convergence in model training because it clips most of the useful gradients and only retains gradients of small values (that is, small impacts on model parameter updating). Suppose five steps are used in either upper or lower bound, the evaluation results show the preprocessing takes 897.5, 310,386.67, and 15,091.7 seconds for LeNet-5, AlexNet, and ResNet-18, respectively. In contrast, our method needs no preprocessing and calculates the standard deviation at each iteration of model training using m samples of gradient. The time complexity of this calculation is OðmÞ and the evaluation results show it complete within 2 milliseconds.
Model Training Settings. The Adam method [49] is used in model training and the learning rate is set to 0.001 for LeNet-5 and 0.0002 for AlexNet and ResNet-18. Other hyperparameters follow the default values in the BigDL library [22]. At each iteration, the batch size is 2,400 for LeNet-5, 120 for AlexNet, and 256 for ResNet-18. The trainings of LeNet-5, AlexNet, and ResNet-18 take 1,000, 8,340, and 12,000 iterations to converge, respectively. In comparative evaluations, we use the same hyperparameters and initial model parameters. During training, we test the model accuracy after each epoch and then retrain the model. In LeNet-5, AlexNet, and ResNet-18, the testing and retraining occur every 20, 6000, and 175 seconds, respectively. Evaluation Results. Fig. 6 shows the model accuracies between PrivateDL and the clip techniques under different experimental settings. We can see that PrivateDL consistently provides higher accuracies because it dynamically estimates the sensitivity according to the latest gradients at each iteration. In contrast, the sensitivity estimation is fixed in the clip techniques (Eqs. (10) and (11)), and there is no "one-size-fits-all" best bound for different training tasks. For example, bound 0.0001 achieves the highest accuracy among three bounds when the budget is 0.03 in LeNet-5 (Fig. 6b1). However, this bound results in the lowest accuracy when the budget is 0.1 (Fig. 6b2). In addition, a smaller budget represents a higher level of privacy guarantee, but also incurs larger noises and lower model accuracies. Evaluation results show that PrivateDL suffers less from small budgets because it can better estimate the sensitivity by dynamically adapting to the changing gradients during iterative training, and hence injects the lowest noises according to Eqs. (6) and (7). We also observe that model accuracy generally increases across the training itertions. During the test phases, the stability of model accuracy is inversely proportional to the amount of injected noises. For example, Figs. 6a1, 6b1, and 6d1 show that the models with the highest noises have the largest flucactions in test accuracies.
Results. When considering different DL training settings, Pri-vateDL increases model accuracy by an average of 411.65 percent compared to the existing clip techniques, and the accuracy increase is 565.87 percent for the highest privacy level (the smallest privacy budget).

Evaluation of Aggregated Noise Reduction
In distributed model training, aggregated noise has two major influential factors: the batch size B that represents the total number of input data points processed across all edge nodes, and the number K of nodes. In this section, we evaluate PrivateDL's effectiveness in reducing aggregated noises with consideration of these factors.
Evaluation of Batch Size B. The evaluation in the previous section shows that the accuracy improvements are small when the privacy budgets are high. Here, we extend this comparison evaluation to four EC nodes under low privacy levels (high privacy budgets) where local noises are small. Specifically, in LeNet-5, the clip bound is set to 0.001 and the privacy budget is set to 0.3; in AlexNet, the clip bound is set to 0.03, the privacy budget is 1e 7 for the clip-by-norm1 technique and 1e 3 for the clip-by-norm2 technique; in ResNet-18, the the clip bound is set to 0.001, the privacy budget is 1e 5 for the clipby-norm1 technique and 50 for the clip-by-norm2 technique. For the aggregate noise, we evaluate two different batch sizes for all models: 480, 720 for LeNet-5; 60, 120 for AlexNet; and 128, 256 for ResNet-18. The evaluation of AlexNet has the smallest batch sizes because it has the largest number of parameters and uses the most resources in gradient calculation. In redundant input data removal, each aggregated data point corresponds to 10 original data points and an aggregated data point is removed if its summarized gradient is smaller than 1 percent of the total gradients.
Evaluation Results. Figs. 7, 8, and 9 show the comparing results for three DL models. We can see that in all cases, a larger batch size B brings a higher model accuracy, which verifies the analysis of aggregated noises in Eqs. (16) and (17). Hence for the same batch size, PrivateDL achieves a higher accuracy because it amplifies the batch size before model training. At the same time, our approach removes the redundant data points at the gradient calculation stage to save computation costs. Hence the gradient calculation time in our approach is very similar to that of the clip techniques for two reasons: (1) the aggregated data points are only generated once before training and the generation time is two to three orders of magnitude shorter than that of model training time. At each iteration, the processing time of aggregated data points takes less than 5 percent of the total model training time; (2) the amplified ratio is set according to the removal ratio at the previous iteration such that the redundant input data removal efficiently decreases the required calculations of data points.
Evaluation of Node Number K. In this evaluation, we take AlexNet as an example and test its model training in a cluster of 100 edge nodes (each one has two CPU cores and 10 GB memory). For both clip-by-norm1 and clip-by-norm2 techniques, we tested two privacy budgets and two batch sizes, as shown in Fig. 10. The evaluation results show that in this larger training deployment, the model accuracies in both PrivateDL and the clip techniques are considerably lower than those (Figs. 6c1 to 6d4) in the smaller deployment (four nodes) when setting the same or larger privacy budgets. With larger injected noises, the accuracies in the clip-by-norm1 and clip-by-norm2 techniques start decreasing in the mediate stage of model training. PrivateDL suffers less from such noise perturbation. In particular, it provides sufficient small noises in the L 1 -norm (Figs. 10a and 10b) and avoids model accuracy degradation. Figs. 10a and 10b show that when the batch size increases from 400 to 1000, the accuracy improvement is small in PrivateDL. This is becasue in addition to batch size, the accuracy improvement (namely the noise reduction) also depends on privacy budget accoring to Eqs. (16) and (17). The budgets in the L 1 -norm are 5e 3 to e 4 times larger than those of the L 2 -norm. Hence both the noises and the decreases in noise (when batch size increases) in the L 1 -norm are small. We note that the accuracy improvement will become even smaller when the batch size further increases, because the amount of noise reduction gradually decreases when the batch size increases. In Fig. 3a's example, the noise reduction is 0.24 when batch size increases from 480 to 720, and this reduction is only 0.04 between when batch size increases from 2400 to 3600.
Discussion of Training Time. We can also observe that in Fig. 10's large cluster, the training time is much longer than that of Fig. 6's small cluster. This is because when the node number is small (K = 4), each iteration's training time mainly consists of two parts: the gradient calculation time (determined by model size and training sample) and the noise injection time (taking 10 percent of the whole training time). When the node number increases to 100, the synchronization among these nodes becomes the bottleneck of training time, because the standard Bulk Synchronous Parallel (BSP) used here only allows the training to move to the next iteartion when all the nodes complete their trainings. We note that this synchronization is time consuming for large clusters. In the future, we plan to study asynchronous training techniques such as Asynchronous Parallel (ASP) [50] or Stale Synchronous Parallel (SSP) [51] that relaxes BSP by allowing training to proceed to the next iteration once all  nodes' iterations are within a pre-defined limit with each other. For ASP and SSP, the calculation of aggregated noise (Eqs. (16) and (17)) needs to be revisited according to the K nodes' synchronization setting.
Results. By reducing aggregated noises, PrivateDL increases the model accuracy by an average of 131.88 percent in the evaluations of the LeNet-5, AlexNet, and ResNet-18 workloads.

Discussions
Differential privacy provides the notion of trade-off between model accuracy and data privacy via noise injection. Priva-teDL exploits such a trade-off in a distributed DL training, in which the intensity of noises is mainly influenced by the calculation of sensitivity (Eqs. (6) and (7)). In this section, we first discuss several practical factors that influence the sensitivity calculation. We then discuss the privacy preservation increased with the proposed approach, and the applicability of our approach to a mobile environment.
Discussion of Clip Bounds and Batch Size. In existing clip techniques, data sensivitity is calculated using pre-defined clip bounds (Eqs. (10) and (11)). In contrast, at each iteration of model training, PriviateDL dyanmically calculates the "clip bound" in terms of the standard deviation of gradients and directly uses the standard deviation to compute sensitivity. Fig. 11a shows that the standard deviation gradunally decreases across the training iterations, because the value of gradient becomes smaller when the training is close to converge. According to Lemmas 1 and 2, both L 1 -sensitivity and L 2 -sensitivity are linearly proportional to the standard deviation and thus Fig. 11b shows that both types of sensitivity have a similar trend as the standard deviation. Moreover, Fig. 11c displays the ratios of critical input data during the training process. According to Eq. (18), a smaller ratio brings a larger amplification in batch size, as shown in Fig. 11d, thus achiving more reductions in aggregated noises.
Discussion of Sampling Interval. In PrivateDL, the sampling interval is a key parameter that determines the range of gradients being used to calculate the sensitivity. The variations in the sampling interval therefore affect the noise injection and the mode accuracy during the training. In previous scenarios, the sample interval is set as (m À 3s, m þ 3s) that cover about 95 percent of the gradients. In this evaluation, we take LeNet-5 as an example and test five sampling intervals, ranging from (m À 1s, m þ 1s) to (m À 10s, m þ 10s), following the evaluation settings of the previous sections. The results in Fig. 12 show that smaller intervals bring slightly higher accuracies at early iterations of model training. This is because DL models usually have an even distribution of gradients at early training stages, as shown in Fig. 14a's gradient distribution at iteration 10. At the same time, there are some outliers of gradients. Hence a smaller sampling interval such as (m À 1s, m þ 1s) can avoid the disturbance of these outliers in sensitivity calculation. Figs. 14b and 14c also show that at later iterations, most of the gradients belong to a small interval around the mean and there are few outliers. This observation explains the results in Fig. 12 that different sampling intervals have small influences on model accuracy at later training stages, because all intervals cover most of the gradients.
Discussion of Integration With Clip Techniques. We integrate our approach with the two clip techniques for DL [2], [15]. At each iteration, we first define a bound of gradients and then only sample within the bound. In evaluation, five bounds ranging from 0.0000001 to 0.1 are tested and other settings follow the previous sections (the privacy budget is 0.3). Fig. 13 shows the two smallest bounds (0.0000001 and 0.00001) result in the lowest accuracies. This is because the values of gradients range from -0.02 to 0.02 (as shown in Fig. 14),  and these two bounds restrict the gradients to a much smaller range. In contrast, the largest bound (0.1) has no restrictions on the gradients. The two medium bounds (0.001 and 0.01) achieve the highest accuracy, because they set appropriate ranges for gradients while avoiding the possible outliers in gradients. We note that although setting proper bounds improves accuracy, it is difficult to know which bounds are appropriate in practical before training starts.
Discussion of Privacy Leakages in Inference Attacks. We compare prevacy preservation between PrivateDL and the clip techniques using a prevalent membership inference attack against DL models 1 [25]. In the attack, the attack model (a binary classifier) judges whether a data point belongs to the training sample used by the target model. Privacy leakage is used as the evaluation metric, which denotes the difference between the true positive rate (TPR) and the false positive rate (FPR) of the attack model [24]. In evaluation, we compare the privacy leakage of PrivateDL and the clip techniques, given that the target models trained using these techniques have the same model accuracies. Fig. 15 shows that PrivateDL always achieves less privacy leakages compared to the clip techniques, because it maintains the same model accuracies using smaller privacy budgets (i.e., higher privacy levels). In addition, the privacy leakage in the L 1 -norm is much larger than that of L 2 -norm, because the former mechanism injects much larger noises given the same privacy budget and thus requires much larger privacy budgets for the same model accuracies. In average, our approach reduces privacy leakage by 23.33 times under the same model accuracies. Fig. 16 further compares the theoratical upper bounds of the two techniques. In evaluation, parameter a is set to 1 (one algorithm is used in model training) and the y asix shows the logarithm of upper bounds. We can see that under the same model accuracies, PrivateDL has tigher bounds of privacy degradation than the clip techniques and this result is consistent with the privacy leakages in Fig. 15.
Discussion of Applicability to Mobile Scenarios. We apply Pri-vateDL in training MobileNet, a compact DL model designed for mobile and enbedded vision systems [52], on the Jetson node. In this evaluation, we tested PrivateDL's effectiveness in reducing local noises using three privacy budgets and three batch sizes for both clip-by-norm1 and clip-by-norm2 techniques. The results in Fig. 17 show that our approach consistently obtains higher model accuracies than the clip techniques under different evaluaion settings and achieves more accuracy improvement when the privacy budeget becomes smaller. Overall, PriviteDL improves model accuracy by 116.97 percent when using the same model training time.

RELATED WORK
Within the context of federated learning, numerous efforts are contributed to enhance privacy during the learning process, including security multi-party computation (SMC) [53], homomorphic encryption [54], and differential privacy [10], [11] Specifically, SMC enables multiple participants collaboratively and securely computing a common function without accessing any raw dataset of each other. This techniques has been applied at both training stage [53] and inference stage [55] of DL. In addition, homomorphic encryption aims to make the result computed on encrypted data similar with the one based on raw data. For example, Cryptonets replaces the conventional activation function (e.g., ReLu) with the square function to implement this technique [54]. In contrast, differential privacy is another prevalent approach to prevent privacy leakage during the model training process by adding noises to data or gradients [10], [11]. We now review its application in two major types of infrastructures.
Cloud-Based Machine Learning/DL Applications. Techniques in this category can be divided into interactive and no-interactive ones. The interactive technique is usually integrated with    1. https://github.com/spring-epfl/mia specific algorithms such as recommendation systems [56] and stream processing systems [57], or specific training algorithms such as ID3 decision tree [58]. In contrast, the non-interactive technique can generate a noise-perturbed dataset for different training algorithms [10]. For example, a black box scheme is developed to integrate any SGD-based training algorithm into big data systems such as Hadoop or Spark, and the scheme only adds noises at the end of each iteration [59].
Edge-Based DL Applications. In this scenario, models are trained using data from multiple participants (edge nodes). In a federated learning setting, current differential privacy techniques add noises to gradients before reporting to a third curator [60]. Two techniques are developed for DL applications: (1) -differential privacy [15]: this technique adds noise according to the Laplace mechanism, and often adopts the Manhattan distance (L 1 -norm) to compute the sensitivity. A fixed interval is used in computation: the gradients whose values are outside the interval are replaced with the corresponding value of bound and the gradients falling into the interval are kept unchanged. (2) (; d)-differential privacy [2]: this technique clips each gradient in the L 2 norm using an interval bounded by C, then adds noises to the clipped gradient with the Gaussian distribution. At each iteration, it scales down the gradients, whose moduli are greater than the interval, to be of norm C. Moreover, some approach is proposed to improve the noise injection mechanism of differential privacy [61]. Motivated by the factor that the default mechanism injects equal noises to all training samples, this approach injects more noises to samples that are less relevant to model accuracy, thus improving privacy level with less accuracy degradation [61]. Some other techniques combine differential privacy with homomorphic encryption or block chain. For example, -differential privacy is combined with an adaptive homomorphic encryption technique to prevent data privacy leakage in the central server [17]. Similarly, the block chain is combined to deal with the threat from both the malicious server and poisoning participants [18].
When applying to DL applications, existing differential privacy techniques rely on pre-specified bounds to clip gradients in noise calibration. Due to the heterogeneous and dynamic nature of gradients, these techniques suffer from setting oversize or undersize bounds, both of which may incur larger accuracy losses (due to larger noises) or worse training performance (due to lower convergency rate). In contrast, our approach needs no pre-defined bounds and thus can adapt to the changing gradients in various training scenarios.
Subsampling in Differential Privacy. Subsample techniques amplify privacy by selecting a subset from the entire input data using random/Poisson sampling with/without placement [62], [63], [64]. Our approach is orthogonal to these techniques such that it first samples more data points at an training iteration because large batch size mitigate noises, and then decreases the training cost by only using the critical points.

CONCLUSION
In this paper, we empirically and analytically show the new challenges of local and global statistical noise calibration for applying differentially privacy on distributed DL. We address the research question of how to improve the overall DL model accuracy while adhering to local privacy constraints and factoring the heterogeneity of training tasks and edge nodes. To such an end, we designed PrivateDL, a novel learning framework that can effectively calibrate local noise via sampling and minimise the impact of global noise by batch size amplification. We also implemented our approach in KubeEdge, and then extensively evaluated and demonstrated its practical effectiveness using Intel BigDL workloads, e.g., an accuracy improvement up to 5X compared to the state-of-the-art. Our future work includes implementing PrivateDL in Tensorflow [65], which provides convenient interfaces for our approach. In Tensorflow 2.2, the eager model supports injection of noises to gradients during model training; tf.GradientTap and tf.keras.optimizers provide interfaces to calculate gradients and update model parameters. Dong Li is working toward the master's degree from the School of Computer Science and Technology, Beijing Institute of Technology. His work focuses on security and privacy of computer systems.
Junyan Ouyang is working toward the graduate degree from the School of Computer Science and Technology, Beijing Institute of Technology. His work focuses on optimization of big data system for machine learning and deep learning workloads.
Chi Harold Liu (Senior Member, IEEE) receives the BEng degree from Tsinghua University, China, in 2006, and the PhD degree from Imperial College, U.K, in 2010. He is currently a full professor and vice dean with the School of Computer Science and Technology, Beijing Institute of Technology, China. He is also the director of IBM Mainframe Excellence Center (Beijing), director of IBM Big Data Technology Center. Before moving to academia, he joined IBM Research -China as a staff researcher and project manager, after working as a postdoctoral researcher at Deutsche Telekom Laboratories, Germany, and a visiting scholar at IBM T. J. Watson Research Center, USA. His current research interests include the big data analytics, mobile computing, and deep learning. He