Deep Learning With Differential Privacy
Deep Learning With Differential Privacy
ABSTRACT
arXiv:1607.00133v2 [stat.ML] 24 Oct 2016
θt+1 ← θt − ηt g̃t step is (ε, δ)-differentially private with respect to the lot.
Output θT and compute the overall privacy cost (ε, δ) Since the lot itself is a random sample from the database,
using a privacy accounting method. the privacy amplification theorem [33, 8] implies that each
step is (O(qε), qδ)-differentially private with respect to the
full database where q = L/N is the sampling ratio per lot
Norm clipping: Proving the differential privacy guarantee and ε ≤ 1. The result in the literature that yields the best
of Algorithm 1 requires bounding the influence of each indi- overall bound is the strong composition theorem [22].
vidual example on g̃t . Since there is no a priori bound on However, the strong composition theorem can be loose,
the size of the gradients, we clip each gradient in `2 norm; and does not take into account the particular noise distribu-
i.e., the gradient vector g is replaced by g/ max 1, kgk 2
, tion under consideration. In our work, we invent a stronger
C
for a clipping threshold C. This clipping ensures that if accounting method, which we call the moments accountant.
√
kgk2 ≤ C, then g is preserved, whereas if kgk2 > C, it gets It allows us to prove that Algorithm 1 is (O(qε T ), δ)-
scaled down to be of norm C. We remark that gradient clip- differentially private for appropriately chosen settings of the
ping of this form is a popular ingredient of SGD for deep noise scale and the clipping threshold. Compared to what
networks for non-privacy reasons, though in that setting it one would obtain by the strong composition p theorem, our
usually suffices to clip after averaging. bound is tighter in two ways: it saves a log(1/δ) factor in
the ε part and a T q factor in the δ part. Since we expect
Per-layer and time-dependent parameters: The pseu- δ to be small and T 1/q (i.e., each example is examined
docode for Algorithm 1 groups all the parameters into a multiple times), the saving provided by our bound is quite
single input θ of the loss function L(·). For multi-layer neu- significant. This result is one of our main contributions.
ral networks, we consider each layer separately, which allows
Theorem 1. There exist constants c1 and c2 so that given We state the properties of α that we use for the moments
the sampling probability q = L/N and the number of steps accountant.
T , for any ε < c1 q 2 T , Algorithm 1 is (ε, δ)-differentially
private for any δ > 0 if we choose Theorem 2. Let αM (λ) defined as above. Then
p
q T log(1/δ) 1. [Composability] Suppose that a mechanism M con-
σ ≥ c2 .
ε Qi−1 of adaptive mechanisms M1 , . . . , Mk
sists of a sequence
If we use the strong p composition theorem, we will then where Mi : j=1 Rj × D → Ri . Then, for any λ
need to choose σ =pΩ(q T log(1/δ) log(T /δ)/ε). Note that k
X
we save a factor of log(T /δ) in our asymptotic bound. The αM (λ) ≤ αMi (λ) .
moments accountant is beneficial in theory, as this result i=1
indicates, and also in practice, as can be seen from Figure 2
in Section 4. For example, with L = 0.01N , σ = 4, δ = 2. [Tail bound] For any ε > 0, the mechanism M is
10−5 , and T = 10000, we have ε ≈ 1.26 using the moments (ε, δ)-differentially private for
accountant. As a comparison, we would get a much larger δ = min exp(αM (λ) − λε) .
ε ≈ 9.34 using the strong composition theorem. λ
3.2 The Moments Accountant: Details In particular, Theorem 2.1 holds when the mechanisms
The moments accountant keeps track of a bound on the themselves are chosen based on the (public) output of the
moments of the privacy loss random variable (defined be- previous mechanisms.
low in Eq. (1)). It generalizes the standard approach of By Theorem 2, it suffices to compute, or bound, αMi (λ) at
tracking (ε, δ) and using the strong composition theorem. each step and sum them to bound the moments of the mech-
While such an improvement was known previously for com- anism overall. We can then use the tail bound to convert the
posing Gaussian mechanisms, we show that it applies also moments bound to the (ε, δ)-differential privacy guarantee.
for composing Gaussian mechanisms with random sampling The main challenge that remains is to bound the value
and can provide much tighter estimate of the privacy loss of αMt (λ) for each step. In the case of a Gaussian mechanism
Algorithm 1. with random sampling, it suffices to estimate the following
Privacy loss is a random variable dependent on the ran- moments. Let µ0 denote the probability density function
dom noise added to the algorithm. That a mechanism M (pdf) of N (0, σ 2 ), and µ1 denote the pdf of N (1, σ 2 ). Let µ
is (ε, δ)-differentially private is equivalent to a certain tail be the mixture of two Gaussians µ = (1 − q)µ0 + qµ1 . Then
bound on M’s privacy loss random variable. While the tail we need to compute α(λ) = log max(E1 , E2 ) where
bound is very useful information on a distribution, compos- E1 = Ez∼µ0 [(µ0 (z)/µ(z))λ ] , (3)
ing directly from it can result in quite loose bounds. We in- λ
stead compute the log moments of the privacy loss random E2 = Ez∼µ [(µ(z)/µ0 (z)) ] . (4)
variable, which compose linearly. We then use the moments In the implementation of the moments accountant, we
bound, together with the standard Markov inequality, to ob- carry out numerical integration to compute α(λ). In ad-
tain the tail bound, that is the privacy loss in the sense of dition, we can show the asymptotic bound
differential privacy.
More specifically, for neighboring databases d, d0 ∈ Dn , a α(λ) ≤ q 2 λ(λ + 1)/(1 − q)σ 2 + O(q 3 /σ 3 ) .
mechanism M, auxiliary input aux, and an outcome o ∈ R,
Together with Theorem 2, the above bound implies our
define the privacy loss at o as
main Theorem 1. The details can be found in the Appendix.
∆ Pr[M(aux, d) = o]
c(o; M, aux, d, d0 ) = log . (1) 3.3 Hyperparameter Tuning
Pr[M(aux, d0 ) = o]
We identify characteristics of models relevant for privacy
A common design pattern, which we use extensively in the
and, specifically, hyperparameters that we can tune in order
paper, is to update the state by sequentially applying differ-
to balance privacy, accuracy, and performance. In particu-
entially private mechanisms. This is an instance of adaptive
lar, through experiments, we observe that model accuracy
composition, which we model by letting the auxiliary input
is more sensitive to training parameters such as batch size
of the kth mechanism Mk be the output of all the previous
and noise level than to the structure of a neural network.
mechanisms.
If we try several settings for the hyperparameters, we can
For a given mechanism M, we define the λth moment
trivially add up the privacy costs of all the settings, possibly
αM (λ; aux, d, d0 ) as the log of the moment generating func-
via the moments accountant. However, since we care only
tion evaluated at the value λ:
about the setting that gives us the most accurate model, we
∆
αM (λ; aux, d, d0 ) = can do better, such as applying a version of a result from
Gupta et al. [27] restated as Theorem D.1 in the Appendix.
log Eo∼M(aux,d) [exp(λc(o; M, aux, d, d0 ))]. (2) We can use insights from theory to reduce the number of
In order to prove privacy guarantees of a mechanism, it is hyperparameter settings that need to be tried. While differ-
useful to bound all possible αM (λ; aux, d, d0 ). We define entially private optimization of convex objective functions
is best achieved using batch sizes as small as 1, non-convex
∆
αM (λ) = max 0 αM (λ; aux, d, d0 ) , learning, which is inherently less stable, benefits from ag-
aux,d,d
gregation into larger batches. At the same time, Theorem 1
where the maximum is taken over all possible aux and all suggests that making batches too large increases the pri-
the neighboring databases d, d0 . vacy cost, and a reasonable tradeoff is to take the number
class DPSGD_Optimizer():
of batches per epoch to be of the same order as the desired
def __init__(self, accountant, sanitizer):
number of epochs. The learning rate in non-private train- self._accountant = accountant
ing is commonly adjusted downwards carefully as the model self._sanitizer = sanitizer
converges to a local optimum. In contrast, we never need
to decrease the learning rate to a very small value, because def Minimize(self, loss, params,
differentially private training never reaches a regime where batch_size, noise_options):
# Accumulate privacy spending before computing
it would be justified. On the other hand, in our experi-
# and using the gradients.
ments, we do find that there is a small benefit to starting priv_accum_op =
with a relatively large learning rate, then linearly decaying self._accountant.AccumulatePrivacySpending(
it to a smaller value in a few epochs, and keeping it constant batch_size, noise_options)
afterwards. with tf.control_dependencies(priv_accum_op):
# Compute per example gradients
px_grads = per_example_gradients(loss, params)
4. IMPLEMENTATION # Sanitize gradients
We have implemented the differentially private SGD al- sanitized_grads = self._sanitizer.Sanitize(
gorithms in TensorFlow. The source code is available under px_grads, noise_options)
# Take a gradient descent step
an Apache 2.0 license from github.com/tensorflow/models. return apply_gradients(params, sanitized_grads)
For privacy protection, we need to “sanitize” the gradient
before using it to update the parameters. In addition, we def DPTrain(loss, params, batch_size, noise_options):
need to keep track of the “privacy spending” based on how accountant = PrivacyAccountant()
the sanitization is done. Hence our implementation mainly sanitizer = Sanitizer()
consists of two components: sanitizer, which preprocesses dp_opt = DPSGD_Optimizer(accountant, sanitizer)
sgd_op = dp_opt.Minimize(
the gradient to protect privacy, and privacy_accountant, loss, params, batch_size, noise_options)
which keeps track of the privacy spending over the course of eps, delta = (0, 0)
training. # Carry out the training as long as the privacy
Figure 1 contains the TensorFlow code snippet (in Python) # is within the pre-set limit.
of DPSGD_Optimizer, which minimizes a loss function us- while within_limit(eps, delta):
ing a differentially private SGD, and DPTrain, which itera- sgd_op.run()
eps, delta = accountant.GetSpentPrivacy()
tively invokes DPSGD_Optimizer using a privacy accountant
to bound the total privacy loss.
In many cases, the neural network model may benefit from Figure 1: Code snippet of DPSGD_Optimizer and DP-
the processing of the input by projecting it on the principal Train.
directions (PCA) or by feeding it through a convolutional
layer. We implement differentially private PCA and apply
pre-trained convolutional layers (learned on public data). integration. The first option would recover the generic ad-
Sanitizer. In order to achieve privacy protection, the sani- vanced composition theorem, and the latter two give a more
tizer needs to perform two operations: (1) limit the sensitiv- accurate accounting of the privacy loss.
ity of each individual example by clipping the norm of the For the Gaussian mechanism we use, α(λ) is defined ac-
gradient for each example; and (2) add noise to the gradient cording to Eqs. (3) and (4). In our implementation, we
of a batch before updating the network parameters. carry out numerical integration to compute both E1 and E2
In TensorFlow, the gradient computation in those equations. Also we compute α(λ) for a range of
Pis batched for λ’s so we can compute the best possible (ε, δ) values using
performance reasons, yielding gB = 1/|B| x∈B ∇θ L(θ, x)
for a batch B of training examples. To limit the sensitivity Theorem 2.2. We find that for the parameters of interest to
of updates, we need to access each individual ∇θ L(θ, x). To us, it suffices to compute α(λ) for λ ≤ 32.
this end, we implemented per_example_gradient operator At any point during training, one can query the privacy
in TensorFlow, as described by Goodfellow [25]. This opera- loss in the more interpretable notion of (ε, δ) privacy using
tor can compute a batch of individual ∇θ L(θ, x). With this Theorem 2.2. Rogers et al. [47] point out risks associated
implementation there is only a modest slowdown in train- with adaptive choice of privacy parameters. We avoid their
ing, even for larger batch size. Our current implementation attacks and negative results by fixing the number of iter-
supports batched computation for the loss function L, where ations and privacy parameters ahead of time. More gen-
each xi is singly connected to L, allowing us to handle most eral implementations of a privacy accountant must correctly
hidden layers but not, for example, convolutional layers. distinguish between two modes of operation—as a privacy
Once we have the access to the per-example gradient, it odometer or a privacy filter (see [47] for more details).
is easy to use TensorFlow operators to clip its norm and to
Differentially private PCA. Principal component analy-
add noise.
sis (PCA) is a useful method for capturing the main features
Privacy accountant. The main component in our imple- of the input data. We implement the differentially private
mentation is PrivacyAccountant which keeps track of pri- PCA algorithm as described in [23]. More specifically, we
vacy spending over the course of training. As discussed in take a random sample of the training examples, treat them
Section 3, we implemented the moments accountant that ad- as vectors, and normalize each vector to unit `2 norm to
ditively accumulates the log of the moments of the privacy form the matrix A, where each vector is a row in the ma-
loss at each step. Dependent on the noise distribution, one trix. We then add Gaussian noise to the covariance matrix
can compute α(λ) by either applying an asymptotic bound, AT A and compute the principal directions of the noisy co-
evaluating a closed-form expression, or applying numerical variance matrix. Then for each input example we apply the
projection to these principal directions before feeding it into
the neural network.
We incur a privacy cost due to running a PCA. However,
we find it useful for both improving the model quality and for
reducing the training time, as suggested by our experiments
on the MNIST data. See Section 4 for details.
Convolutional layers. Convolutional layers are useful for
deep neural networks. However, an efficient per-example
gradient computation for convolutional layers remains a chal-
lenge within the TensorFlow framework, which motivates
creating a separate workflow. For example, some recent
work argues that even random convolutions often suffice [46,
13, 49, 55, 14].
Alternatively, we explore the idea of learning convolu-
tional layers on public data, following Jarrett et al. [30].
Such convolutional layers can be based on GoogLeNet or
AlexNet features [54, 35] for image models or on pretrained
word2vec or GloVe embeddings in language models [41, 44].
Figure 2: The ε value as a function of epoch E for
q = 0.01, σ = 4, δ = 10−5 , using the strong composition
5. EXPERIMENTAL RESULTS theorem and the moments accountant respectively.
This section reports on our evaluation of the moments ac-
countant, and results on two popular image datasets: MNIST
and CIFAR-10. Differentially private model.
For the differentially private version, we experiment with
5.1 Applying the Moments Accountant the same architecture with a 60-dimensional PCA projection
As shown by Theorem 1, the moments accountant pro- layer, a single 1,000-unit ReLU hidden layer, and a lot size of
vides a tighter bound on the privacy loss compared to the 600. To limit sensitivity, we clip the gradient norm of each
generic strong composition theorem. Here we compare them layer at 4. We report results for three choices of the noise
using some concrete values. The overall privacy loss (ε, δ) scale, which we call small (σ = 2, σp = 4), medium (σ =
can be computed from the noise level σ, the sampling ra- 4, σp = 7), and large (σ = 8, σp = 16). Here σ represents
tio of each lot q = L/N (so each epoch consists of 1/q the noise level for training the neural network, and σp the
batches), and the number of epochs E (so the number of noise level for PCA projection. The learning rate is set at 0.1
steps is T = E/q). We fix the target δ = 10−5 , the value initially and linearly decreased to 0.052 over 10 epochs and
used for our MNIST and CIFAR experiments. then fixed to 0.052 thereafter. We have also experimented
In our experiment, we set q = 0.01, σ = 4, and δ = 10−5 , with multi-hidden-layer networks. For MNIST, we found
and compute the value of ε as a function of the training that one hidden layer combined with PCA works better than
epoch E. Figure 2 shows two curves corresponding to, re- a two-layer network.
spectively, using the strong composition theorem and the Figure 3 shows the results for different noise levels. In
moments accountant. We can see that we get a much tighter each plot, we show the evolution of the training and testing
estimation of the privacy loss by using the moments accoun- accuracy as a function of the number of epochs as well as
tant. For examples, when E = 100, the values are 9.34 the corresponding δ value, keeping ε fixed. We achieve 90%,
and 1.26 respectively, and for E = 400, the values are 24.22 95%, and 97% test set accuracy for (0.5, 10−5 ), (2, 10−5 ),
and 2.55 respectively. That is, using the moments bound, and (8, 10−5 )-differential privacy respectively.
we achieve (2.55, 10−5 )-differential privacy, whereas previ- One attractive consequence of applying differentially pri-
ous techniques only obtain the significantly worse guarantee vate SGD is the small difference between the model’s ac-
of (24.22, 10−5 ). curacy on the training and the test sets, which is consis-
tent with the theoretical argument that differentially private
5.2 MNIST training generalizes well [6]. In contrast, the gap between
We conduct experiments on the standard MNIST dataset training and testing accuracy in non-private training, i.e.,
for handwritten digit recognition consisting of 60,000 train- evidence of overfitting, increases with the number of epochs.
ing examples and 10,000 testing examples [36]. Each exam- By using the moments accountant, we can obtain a δ value
ple is a 28 × 28 size gray-level image. We use a simple feed- for any given ε. We record the accuracy for different (ε, δ)
forward neural network with ReLU units and softmax of 10 pairs in Figure 4. In the figure, each curve corresponds to the
classes (corresponding to the 10 digits) with cross-entropy best accuracy achieved for a fixed δ, as it varies between 10−5
loss and an optional PCA input layer. and 10−2 . For example, we can achieve 90% accuracy for
ε = 0.25 and δ = 0.01. As can be observed from the figure,
Baseline model. for a fixed δ, varying the value of ε can have large impact
Our baseline model uses a 60-dimensional PCA projection on accuracy, but for any fixed ε, there is less difference with
layer and a single hidden layer with 1,000 hidden units. Us- different δ values.
ing the lot size of 600, we can reach accuracy of 98.30% in
about 100 epochs. This result is consistent with what can Effect of the parameters.
be achieved with a vanilla neural network [36]. Classification accuracy is determined by multiple factors
(1) Large noise (2) Medium noise (3) Small noise
Figure 3: Results on the accuracy for different noise levels on the MNIST dataset. In all the experiments, the
network uses 60 dimension PCA projection, 1,000 hidden units, and is trained using lot size 600 and clipping
threshold 4. The noise levels (σ, σp ) for training the neural network and for PCA projection are set at (8, 16),
(4, 7), and (2, 4), respectively, for the three experiments.
(4) variable learning rate (5) variable gradient clipping norm (6) variable noise level
Figure 5: MNIST accuracy when one parameter varies, and the others are fixed at reference values.
direction from the true gradient. On the other hand, in- 5.3 CIFAR
creasing the norm bound C forces us to add more noise to We also conduct experiments on the CIFAR-10 dataset,
the gradients (and hence the parameters), since we add noise which consists of color images classified into 10 classes such
based on σC. In practice, a good way to choose a value for as ships, cats, and dogs, and partitioned into 50,000 training
C is by taking the median of the norms of the unclipped examples and 10,000 test examples [1]. Each example is a
gradients over the course of training. 32 × 32 image with three channels (RGB). For this learning
Noise level. By adding more noise, the per-step privacy task, nearly all successful networks use convolutional layers.
loss is proportionally smaller, so we can run more epochs The CIFAR-100 dataset has similar parameters, except that
within a given cumulative privacy budget. In Figure 5(5), images are classified into 100 classes; the examples and the
the x-axis is the noise level σ. The choice of this value has image classes are different from those of CIFAR-10.
a large impact on accuracy. We use the network architecture from the TensorFlow con-
volutional neural networks tutorial [2]. Each 32 × 32 image
From the experiments, we observe the following. is first cropped to a 24 × 24 one by taking the center patch.
1. The PCA projection improves both model accuracy The network architecture consists of two convolutional lay-
and training performance. Accuracy is quite stable ers followed by two fully connected layers. The convolutional
over a large range of choices for the projection dimen- layers use 5 × 5 convolutions with stride 1, followed by a
sions and the noise level used in the PCA stage. ReLU and 2 × 2 max pools, with 64 channels each. Thus
the first convolution outputs a 12 × 12 × 64 tensor for each
2. The accuracy is fairly stable over the network size. image, and the second outputs a 6×6×64 tensor. The latter
When we can only run smaller number of epochs, it is is flattened to a vector that gets fed into a fully connected
more beneficial to use a larger network. layer with 384 units, and another one of the same size.
This architecture, non-privately, can get to about 86% ac-
3. The training parameters, especially the lot size and
curacy in 500 epochs. Its simplicity makes it an appealing
the noise scale σ, have a large impact on the model
choice for our work. We should note however that by us-
accuracy. They both determine the “noise-to-signal”
ing deeper networks with different non-linearities and other
ratio of the sanitized gradients as well as the number
advanced techniques, one can obtain significantly better ac-
of epochs we are able to go through the data before
curacy, with the state-of-the-art being about 96.5% [26].
reaching the privacy limit.
As is standard for such image datasets, we use data aug-
Our framework allows for adaptive control of the training mentation during training. For each training image, we gen-
parameters, such as the lot size, the gradient norm bound erate a new distorted image by randomly picking a 24 × 24
C, and noise level σ. Our initial experiments with decreas- patch from the image, randomly flipping the image along
ing noise as training progresses did not show a significant the left-right direction, and randomly distorting the bright-
improvement, but it is interesting to consider more sophis- ness and the contrast of the image. In each epoch, these
ticated schemes for adaptively choosing these parameters.
(1) ε = 2 (2) ε = 4 (3) ε = 8
Figure 6: Results on accuracy for different noise levels on CIFAR-10. With δ set to 10−5 , we achieve accuracy
67%, 70%, and 73%, with ε being 2, 4, and 8, respectively. The first graph uses a lot size of 2,000, (2) and (3)
use a lot size of 4,000. In all cases, σ is set to 6, and clipping is set to 3.
distortions are done independently. We refer the reader to chine learning, has been a focus of active work in several
the TensorFlow tutorial [2] for additional details. research communities since the late 90s [5, 37]. The exist-
As the convolutional layers have shared parameters, com- ing literature can be broadly classified along several axes:
puting per-example gradients has a larger computational the class of models, the learning algorithm, and the privacy
overhead. Previous work has shown that convolutional lay- guarantees.
ers are often transferable: parameters learned from one data-
Privacy guarantees. Early works on privacy-preserving
set can be used on another one without retraining [30]. We
learning were done in the framework of secure function eval-
treat the CIFAR-100 dataset as a public dataset and use it
uation (SFE) and secure multi-party computations (MPC),
to train a network with the same architecture. We use the
where the input is split between two or more parties, and
convolutions learned from training this dataset. Retrain-
the focus is on minimizing information leaked during the
ing only the fully connected layers with this architecture for
joint computation of some agreed-to functionality. In con-
about 250 epochs with a batch size of 120 gives us approxi-
trast, we assume that data is held centrally, and we are
mately 80% accuracy, which is our non-private baseline.
concerned with leakage from the functionality’s output (i.e.,
the model).
Differentially private version. Another approach, k-anonymity and closely related no-
For the differentially private version, we use the same ar- tions [53], seeks to offer a degree of protection to underlying
chitecture. As discussed above, we use pre-trained convolu- data by generalizing and suppressing certain identifying at-
tional layers. The fully connected layers are initialized from tributes. The approach has strong theoretical and empirical
the pre-trained network as well. We train the softmax layer, limitations [4, 9] that make it all but inapplicable to de-
and either the top or both fully connected layers. Based on anonymization of high-dimensional, diverse input datasets.
looking at gradient norms, the softmax layer gradients are Rather than pursue input sanitization, we keep the under-
roughly twice as large as the other two layers, and we keep lying raw records intact and perturb derived data instead.
this ratio when we try clipping at a few different values be- The theory of differential privacy, which provides the an-
tween 3 and 10. The lot size is an additional knob that we alytical framework for our work, has been applied to a large
tune: we tried 600, 2,000, and 4,000. With these settings, collection of machine learning tasks that differed from ours
the per-epoch training time increases from approximately 40 either in the training mechanism or in the target model.
seconds to 180 seconds. The moments accountant is closely related to the notion of
In Figure 6, we show the evolution of the accuracy and Rényi differential privacy [42], which proposes (scaled) α(λ)
the privacy cost, as a function of the number of epochs, for as a means of quantifying privacy guarantees. In a concur-
a few different parameter settings. rent and independent work Bun and Steinke [10] introduce
The various parameters influence the accuracy one gets, in a relaxation of differential privacy (generalizing the work
ways not too different from that in the MNIST experiments. of Dwork and Rothblum [20]) defined via a linear upper
A lot size of 600 leads to poor results on this dataset and bound on α(λ). Taken together, these works demonstrate
we need to increase it to 2,000 or more for results reported that the moments accountant is a useful technique for theo-
in Figure 6. retical and empirical analyses of complex privacy-preserving
Compared to the MNIST dataset, where the difference in algorithms.
accuracy between a non-private baseline and a private model
is about 1.3%, the corresponding drop in accuracy in our Learning algorithm. A common target for learning with
CIFAR-10 experiment is much larger (about 7%). We leave privacy is a class of convex optimization problems amenable
closing this gap as an interesting test for future research in to a wide variety of techniques [18, 11, 34]. In concurrent
differentially private machine learning. work, Wu et al. achieve 83% accuracy on MNIST via con-
vex empirical risk minimization [57]. Training multi-layer
neural networks is non-convex, and typically solved by an
6. RELATED WORK application of SGD, whose theoretical guarantees are poorly
The problem of privacy-preserving data mining, or ma- understood.
For the CIFAR neural network we incorporate differen- 9. REFERENCES
tially private training of the PCA projection matrix [23],
which is used to reduce dimensionality of inputs.
[1] CIFAR-10 and CIFAR-100 datasets.
Model class. The first end-to-end differentially private sys- www.cs.toronto.edu/˜kriz/cifar.html.
tem was evaluated on the Netflix Prize dataset [39], a version [2] TensorFlow convolutional neural networks tutorial.
of a collaborative filtering problem. Although the problem www.tensorflow.org/tutorials/deep cnn.
shared many similarities with ours—high-dimensional in- [3] TensorFlow: Large-scale machine learning on
puts, non-convex objective function—the approach taken by heterogeneous systems, 2015. Software available from
McSherry and Mironov differed significantly. They identified tensorflow.org.
the core of the learning task, effectively sufficient statistics, [4] C. C. Aggarwal. On k-anonymity and the curse of
that can be computed in a differentially private manner via dimensionality. In VLDB, pages 901–909, 2005.
a Gaussian mechanism. In our approach no such sufficient
[5] R. Agrawal and R. Srikant. Privacy-preserving data
statistics exist.
mining. In SIGMOD, pages 439–450. ACM, 2000.
In a recent work Shokri and Shmatikov [50] designed and
evaluated a system for distributed training of a deep neural [6] R. Bassily, K. Nissim, A. Smith, T. Steinke,
network. Participants, who hold their data closely, commu- U. Stemmer, and J. Ullman. Algorithmic stability for
nicate sanitized updates to a central authority. The sani- adaptive data analysis. In STOC, pages 1046–1059.
ACM, 2016.
tization relies on an additive-noise mechanism, based on a
sensitivity estimate, which could be improved to a hard sen- [7] R. Bassily, A. D. Smith, and A. Thakurta. Private
sitivity guarantee. They compute privacy loss per param- empirical risk minimization: Efficient algorithms and
eter (not for an entire model). By our preferred measure, tight error bounds. In FOCS, pages 464–473. IEEE,
the total privacy loss per participant on the MNIST dataset 2014.
exceeds several thousand. [8] A. Beimel, H. Brenner, S. P. Kasiviswanathan, and
A different, recent approach towards differentially private K. Nissim. Bounds on the sample complexity for
deep learning is explored by Phan et al. [45]. This work private learning and private data release. Machine
focuses on learning autoencoders. Privacy is based on per- Learning, 94(3):401–437, 2014.
turbing the objective functions of these autoencoders. [9] J. Brickell and V. Shmatikov. The cost of privacy:
Destruction of data-mining utility in anonymized data
7. CONCLUSIONS publishing. In KDD, pages 70–78. ACM, 2008.
[10] M. Bun and T. Steinke. Concentrated differential
We demonstrate the training of deep neural networks with
privacy: Simplifications, extensions, and lower bounds.
differential privacy, incurring a modest total privacy loss, In TCC-B.
computed over entire models with many parameters. In
[11] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate.
our experiments for MNIST, we achieve 97% training accu-
Differentially private empirical risk minimization.
racy and for CIFAR-10 we achieve 73% accuracy, both with
J. Machine Learning Research, 12:1069–1109, 2011.
(8, 10−5 )-differential privacy. Our algorithms are based on a
differentially private version of stochastic gradient descent; [12] R. Collobert, K. Kavukcuoglu, and C. Farabet.
they run on the TensorFlow software library for machine Torch7: A Matlab-like environment for machine
learning. Since our approach applies directly to gradient learning. In BigLearn, NIPS Workshop, number
computations, it can be adapted to many other classical EPFL-CONF-192376, 2011.
and more recent first-order optimization methods, such as [13] D. D. Cox and N. Pinto. Beyond simple features: A
NAG [43], Momentum [48], AdaGrad [15], or SVRG [31]. large-scale feature search approach to unconstrained
A new tool, which may be of independent interest, is a face recognition. In FG 2011, pages 8–15. IEEE, 2011.
mechanism for tracking privacy loss, the moments accoun- [14] A. Daniely, R. Frostig, and Y. Singer. Toward deeper
tant. It permits tight automated analysis of the privacy loss understanding of neural networks: The power of
of complex composite mechanisms that are currently beyond initialization and a dual view on expressivity. CoRR,
the reach of advanced composition theorems. abs/1602.05897, 2016.
A number of avenues for further work are attractive. In [15] J. Duchi, E. Hazan, and Y. Singer. Adaptive
particular, we would like to consider other classes of deep subgradient methods for online learning and stochastic
networks. Our experience with MNIST and CIFAR-10 should optimization. J. Machine Learning Research,
be helpful, but we see many opportunities for new research, 12:2121–2159, July 2011.
for example in applying our techniques to LSTMs used for [16] C. Dwork. A firm foundation for private data analysis.
language modeling tasks. In addition, we would like to ob- Commun. ACM, 54(1):86–95, Jan. 2011.
tain additional improvements in accuracy. Many training [17] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov,
datasets are much larger than those of MNIST and CIFAR- and M. Naor. Our data, ourselves: Privacy via
10; accuracy should benefit from their size. distributed noise generation. In EUROCRYPT, pages
486–503. Springer, 2006.
8. ACKNOWLEDGMENTS [18] C. Dwork and J. Lei. Differential privacy and robust
We are grateful to Úlfar Erlingsson and Dan Ramage for statistics. In STOC, pages 371–380. ACM, 2009.
many useful discussions, and to Mark Bun and Thomas [19] C. Dwork, F. McSherry, K. Nissim, and A. Smith.
Steinke for sharing a draft of [10]. Calibrating noise to sensitivity in private data
analysis. In TCC, pages 265–284. Springer, 2006.
[20] C. Dwork and A. Roth. The algorithmic foundations
of differential privacy. Foundations and Trends in [40] F. D. McSherry. Privacy integrated queries: An
Theoretical Computer Science, 9(3–4):211–407, 2014. extensible platform for privacy-preserving data
[21] C. Dwork and G. N. Rothblum. Concentrated analysis. In SIGMOD, pages 19–30. ACM, 2009.
differential privacy. CoRR, abs/1603.01887, 2016. [41] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
[22] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting Efficient estimation of word representations in vector
and differential privacy. In FOCS, pages 51–60. IEEE, space. CoRR, abs/1301.3781, 2013.
2010. [42] I. Mironov. Rényi differential privacy. Private
[23] C. Dwork, K. Talwar, A. Thakurta, and L. Zhang. communication, 2016.
Analyze Gauss: Optimal bounds for [43] Y. Nesterov. Introductory Lectures on Convex
privacy-preserving principal component analysis. In Optimization. A Basic Course. Springer, 2004.
STOC, pages 11–20. ACM, 2014. [44] J. Pennington, R. Socher, and C. D. Manning. GloVe:
[24] M. Fredrikson, S. Jha, and T. Ristenpart. Model Global vectors for word representation. In EMNLP,
inversion attacks that exploit confidence information pages 1532–1543, 2014.
and basic countermeasures. In CCS, pages 1322–1333. [45] N. Phan, Y. Wang, X. Wu, and D. Dou. Differential
ACM, 2015. privacy preservation for deep auto-encoders: an
[25] I. Goodfellow. Efficient per-example gradient application of human behavior prediction. In AAAI,
computations. CoRR, abs/1510.01799v2, 2015. pages 1309–1316, 2016.
[26] B. Graham. Fractional max-pooling. CoRR, [46] N. Pinto, Z. Stone, T. E. Zickler, and D. Cox. Scaling
abs/1412.6071, 2014. up biologically-inspired computer vision: A case study
[27] A. Gupta, K. Ligett, F. McSherry, A. Roth, and in unconstrained face recognition on Facebook. In
K. Talwar. Differentially private combinatorial CVPR, pages 35–42. IEEE, 2011.
optimization. In SODA, pages 1106–1125, 2010. [47] R. M. Rogers, A. Roth, J. Ullman, and S. P. Vadhan.
[28] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep Privacy odometers and filters: Pay-as-you-go
into rectifiers: Surpassing human-level performance on composition. In NIPS.
ImageNet classification. In ICCV, pages 1026–1034. [48] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
IEEE, 2015. Learning representations by back-propagating errors.
[29] R. Ierusalimschy, L. H. de Figueiredo, and W. Filho. Nature, 323:533–536, Oct. 1986.
Lua—an extensible extension language. Software: [49] A. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh,
Practice and Experience, 26(6):635–652, 1996. and A. Ng. On random weights and unsupervised
[30] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and feature learning. In ICML, pages 1089–1096. ACM,
Y. LeCun. What is the best multi-stage architecture 2011.
for object recognition? In ICCV, pages 2146–2153. [50] R. Shokri and V. Shmatikov. Privacy-preserving deep
IEEE, 2009. learning. In CCS, pages 1310–1321. ACM, 2015.
[31] R. Johnson and T. Zhang. Accelerating stochastic [51] D. Silver, A. Huang, C. J. Maddison, A. Guez,
gradient descent using predictive variance reduction. L. Sifre, G. van den Driessche, J. Schrittwieser,
In NIPS, pages 315–323, 2013. I. Antonoglou, V. Panneershelvam, M. Lanctot,
[32] P. Kairouz, S. Oh, and P. Viswanath. The composition S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,
theorem for differential privacy. In ICML, pages I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu,
1376–1385. ACM, 2015. T. Graepel, and D. Hassabis. Mastering the game of
[33] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, Go with deep neural networks and tree search. Nature,
S. Raskhodnikova, and A. Smith. What can we learn 529(7587):484–489, 2016.
privately? SIAM J. Comput., 40(3):793–826, 2011. [52] S. Song, K. Chaudhuri, and A. Sarwate. Stochastic
[34] D. Kifer, A. D. Smith, and A. Thakurta. Private gradient descent with differentially private updates. In
convex optimization for empirical risk minimization GlobalSIP Conference, 2013.
with applications to high-dimensional regression. In [53] L. Sweeney. k-anonymity: A model for protecting
COLT, pages 25.1–25.40, 2012. privacy. International J. of Uncertainty, Fuzziness and
[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Knowledge-Based Systems, 10(05):557–570, 2002.
ImageNet classification with deep convolutional neural [54] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
networks. In NIPS, pages 1097–1105, 2012. D. Anguelov, D. Erhan, V. Vanhoucke, and
[36] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. A. Rabinovich. Going deeper with convolutions. In
Gradient-based learning applied to document CVPR, pages 1–9. IEEE, 2015.
recognition. Proceedings of the IEEE, 86(11), 1998. [55] S. Tu, R. Roelofs, S. Venkataraman, and B. Recht.
[37] Y. Lindell and B. Pinkas. Privacy preserving data Large scale kernel learning using block coordinate
mining. In CRYPTO, pages 36–54. Springer, 2000. descent. CoRR, abs/1602.05310, 2016.
[38] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver. [56] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever,
Move evaluation in Go using deep convolutional neural and G. E. Hinton. Grammar as a foreign language. In
networks. In ICLR, 2015. NIPS, pages 2773–2781, 2015.
[39] F. McSherry and I. Mironov. Differentially private [57] X. Wu, A. Kumar, K. Chaudhuri, S. Jha, and J. F.
recommender systems: Building privacy into the Naughton. Differentially private stochastic gradient
Netflix Prize contenders. In KDD, pages 627–636. descent for in-RDBMS analytics. CoRR,
ACM, 2009. abs/1606.04722, 2016.
APPENDIX Tail bound by moments. The proof is based on the stan-
dard Markov’s inequality argument used in proofs of mea-
A. PROOF OF THEOREM 2 sure concentration. We have
Here we restate and prove Theorem 2.
Pr [c(o) ≥ ε]
o∼M(d)
Theorem 2. Let αM (λ) defined as
= Pr [exp(λc(o)) ≥ exp(λε))]
∆
αM (λ) = max 0 αM (λ; aux, d, d0 ),
o∼M(d)
aux,d,d Eo∼M(d) [exp(λc(o))]
≤
where the maximum is taken over all auxiliary inputs and exp(λε)
neighboring databases d, d0 . Then ≤ exp(α − λε).
1. [Composability] Suppose that a mechanism M con- Let B = {o : c(o) ≥ ε}. Then for any S,
sists of a sequence of adaptive mechanisms M1 , . . . , Mk
where Mi : i−1
Q
j=1 Rj × D → Ri . Then, for any λ Pr[M (d) ∈ S]
k = Pr[M (d) ∈ S ∩ B c ] + Pr[M (d) ∈ S ∩ B]
X
αM (λ) ≤ αMi (λ) . ≤ exp(ε) Pr[M (d0 ) ∈ S ∩ B c ] + Pr[M (d) ∈ B]
i=1
≤ exp(ε) Pr[M (d0 ) ∈ S] + exp(α − λε).
2. [Tail bound] For any ε > 0, the mechanism M is
(ε, δ)-differentially private for The second part follows by an easy calculation.
δ = min exp(αM (λ) − λε) . The proof demonstrates a tail bound on the privacy loss,
λ
making it stronger than differential privacy for a fixed value
Proof. Composition of moments. For brevity, let of ε, δ.
M1:i denote (M1 , . . . , Mi ), and similarly let o1:i denote
(o1 , . . . , oi ). For neighboring databases d, d0 ∈ Dn , and a
sequence of outcomes o1 , . . . , ok we write
B. PROOF OF LEMMA 3
The proof of the main theorem relies on the following mo-
c(o1:k ; M1:k , o1:(k−1) , d, d0 ) ments bound on Gaussian mechanism with random sam-
Pr[M1:k (d; o1:(k−1) ) = o1:k ] pling.
= log
Pr[M1:k (d0 ; o1:(k−1) ) = o1:k ]
Lemma 3. Suppose that f : D → Rp with kf (·)k2 ≤ 1.
k
Y Pr[Mi (d) = oi | M1:(i−1) (d) = o1:(i−1) ] Let σ ≥ 1 and let J be a sample from [n] where each i ∈ [n]
= log is chosen independently with probability q < 16σ 1
. Then for
i=1
Pr[Mi (d0 ) = oi | M1:(i−1) (d0 ) = o1:(i−1) ]
any positive integer λ ≤ σ 2 ln qσ 1
, the mechanism M(d) =
k
Pr[Mi (d) = oi | M1:(i−1) (d) = o1:(i−1) ] P 2
i∈J f (di ) + N (0, σ I) satisfies
X
= log
i=1
Pr[Mi (d0 ) = oi | M1:(i−1) (d0 ) = o1:(i−1) ]
q 2 λ(λ + 1)
k
X αM (λ) ≤ + O(q 3 λ3 /σ 3 ).
= c(oi ; Mi , o1:(i−1) , d, d ). 0 (1 − q)σ 2
i=1
Proof. Fix d0 and let dP= d0 ∪ {dn }. Without loss of
Thus generality, f (dn ) = e1 and i∈J\[n] f (di ) = 0. Thus M(d)
and M(d0 ) are distributed identically except for the first
Eo01:k ∼M1:k (d) exp(λc(o01:k ; M1:k , d, d0 )) ∀i < k : o0i = oi
coordinate and hence we have a one-dimensional problem.
Let µ0 denote the pdf of N (0, σ 2 ) and let µ1 denote the pdf
" k
!#
X 0 0
= Eo01:k ∼M1:k (d) exp λ c(oi ; Mi , o1:(i−1) , d, d ) of N (1, σ 2 ). Thus:
i=1
" k
Y
# M(d0 ) ∼ µ0 ,
λc(o0i ; Mi , o1:(i−1) , d, d0 )
= Eo01:k ∼M1:k (d) exp ∆
i=1
M(d) ∼ µ = (1 − q)µ0 + qµ1 .
(by independence of noise)
We want to show that
k
Y
Eo0i ∼Mi (d) exp(λc(o0i ; Mi , o1:(i−1) , d, d0 ))
= Ez∼µ [(µ(z)/µ0 (z))λ ] ≤ α,
i=1
k and Ez∼µ0 [(µ0 (z)/µ(z))λ ] ≤ α,
Y 0
= exp αMi (λ; o1:(i−1) , d, d )
i=1
for some explicit α to be determined later.
k
! We will use the same method to prove both bounds. As-
= exp
X
αi (λ; o1:(i−1) , d, d0 ) . sume we have two distributions ν0 and ν1 , and we wish to
i=1
bound
The claim follows. Ez∼ν0 [(ν0 (z)/ν1 (z))λ ] = Ez∼ν1 [(ν0 (z)/ν1 (z))λ+1 ] .
Using binomial expansion, we have Thus the third term in the binomial expansion (5)
! " 2 #
Ez∼ν1 [(ν0 (z)/ν1 (z))λ+1 ] 1+λ µ0 (z) − µ(z) λ(λ + 1)q 2
Ez∈µ ≤ .
2 µ(z) (1 − q)σ 2
= Ez∼ν1 [(1 + (ν0 (z) − ν1 (z))/ν1 (z))λ+1 ]
= Ez∼ν1 [(1 + (ν0 (z) − ν1 (z))/ν1 (z))λ+1 ] To bound the remaining terms, we first note that by stan-
λ+1
! dard calculus, we get:
X λ+1
= Ez∼ν1 [((ν0 (z) − ν1 (z))/ν1 (z))t ] . (5) ∀z ≤ 0 : |µ0 (z) − µ1 (z)| ≤ −(z − 1)µ0 (z)/σ 2 ,
t=0
t
∀z ≥ 1 : |µ0 (z) − µ1 (z)| ≤ zµ1 (z)/σ 2 ,
The first term in (5) is 1, and the second term is ∀0 ≤ z ≤ 1 : |µ0 (z) − µ1 (z)| ≤ µ0 (z)(exp(1/2σ 2 ) − 1)
≤ µ0 (z)/σ 2 .
Z ∞
ν0 (z) − ν1 (z) ν0 (z) − ν1 (z)
Ez∼ν1 = ν1 (z) dz
ν1 (z) −∞ ν1 (z)
Z ∞ Z ∞
= ν0 (z) dz − ν1 (z) dz We can then write
" t #
−∞ −∞
µ0 (z) − µ(z)
= 1 − 1 = 0. Ez∼µ
µ(z)
t
To prove the lemma it suffices to show show that for both Z 0
µ0 (z) − µ(z)
ν0 = µ, ν1 = µ0 and ν0 = µ0 , ν1 = µ, the third term is ≤ µ(z) dz
µ(z)
bounded by q 2 λ(λ + 1)/(1 − q)σ 2 and that this bound dom- −∞
inates the sum of the remaining terms. We will prove the Z 1
µ0 (z) − µ(z)
t
more difficult second case (ν0 = µ0 , ν1 = µ); the proof of the + µ(z) dz
0 µ(z)
other case is similar.
To upper bound the third term in (5), we note that µ(z) ≥ Z ∞ t
µ0 (z) − µ(z)
(1 − q)µ0 (z), and write + µ(z) dz.
1 µ(z)
" 2 #
µ0 (z) − µ(z) We consider these terms individually. We repeatedly make
Ez∼µ use of three observations: (1) µ0 − µ = q(µ0 − µ1 ), (2)
µ(z)
" 2 # µ ≥ (1 − q)µ0 , and (3) Eµ0 [|z|t ] ≤ σ t (t − 1)!!. The first term
µ0 (z) − µ1 (z) can then be bounded by
= q 2 Ez∼µ
µ(z) qt
Z 0
Z ∞ µ0 (z)|z − 1|t dz
(µ0 (z) − µ1 (z))2 (1 − q)t−1 σ 2t −∞
= q2 dz
−∞ µ(z) (2q)t (t − 1)!!
∞ ≤ .
q2 (µ0 (z) − µ1 (z))2 2(1 − q)t−1 σ t
Z
≤ dz
1 − q −∞ µ0 (z) The second term is at most
" 2 #
q2
t
µ0 (z) − µ1 (z) qt
Z 1
µ0 (z) − µ1 (z)
= Ez∼µ0 . µ(z) dz
1−q µ0 (z) (1 − q)t 0 µ0 (z)
t Z 1
An easy fact is that for any a ∈ R, Ez∼µ0 exp(2az/2σ 2 ) = q 1
≤ µ(z) 2t dz
exp(a2 /2σ 2 ). Thus, (1 − q)t 0 σ
qt
" 2 # ≤ .
µ0 (z) − µ1 (z) (1 − q)t σ 2t
Ez∼µ0
µ0 (z) Similarly, the third term is at most
" 2 # t
Z ∞
2z − 1 qt
= Ez∼µ0 1 − exp( ) zµ1 (z)
µ 0 (z) dz
2σ 2 (1 − q)t−1 σ 2t 1 µ0 (z)
∞
qt
Z
2z − 1
= 1 − 2Ez∼µ0 exp( ) ≤ µ0 (z) exp((2tz − t)/2σ 2 )z t dz
2σ 2 (1 − q) σt−1 2t
1
q t exp((t2 − t)/2σ 2 ) ∞
Z
4z − 2
+ Ez∼µ0 exp( ) ≤ µ0 (z − t)z t dz
2σ 2 (1 − q) σ
t−1 2t
0
(2q)t exp((t2 − t)/2σ 2 )(σ t (t − 1)!! + tt )
1 −1
= 1 − 2 exp · exp ≤ .
2σ 2 2σ 2 2(1 − q)t−1 σ 2t
4 −2 Under the assumptions on q, σ, and λ, it is easy to check
+ exp · exp
2σ 2 2σ 2 that the three terms, and their sum, drop off geometrically
= exp(1/σ 2 ) − 1. fast in t for t > 3. Hence the binomial expansion (5) is
dominated by the t = 3 term, which is O(q 3 λ3 /σ 3 ). The
claim follows.
To derive Theorem 1, we use the above moments bound D. HYPERPARAMETER SEARCH
along with the tail bound from Theorem 2, optimizing over Here we state Theorem 10.2 from [27] that we use to ac-
the choice of λ. count for the cost of hyperparameter search.
Theorem 1. There exist constants c1 and c2 so that given Theorem D.1 (Gupta et al. [27]). Let M be an ε -
the sampling probability q = L/N and the number of steps differentially private mechanism such that for a query func-
T , for any ε < c1 q 2 T , Algorithm 1 is (ε, δ)-differentially tion q with sensitivity 1, and a parameter Q, it holds that
private for any δ > 0 if we choose Prr∼M (d) [q(d, r) ≥ Q] ≥ p for some p ∈ (0, 1). Then for any
p δ > 0 and any ε0 ∈ (0, 21 ), there is a mechanism M 0 which
q T log(1/δ) satisfies the following properties:
σ ≥ c2 .
ε h i
• Prr∼M 0 (d) q(d, r) ≥ Q − ε40 log( ε01δp ) ≥ 1 − δ.
Proof. Assume for now that σ, λ satisfy the conditions
in Lemma 3. By Theorem 2.1 and Lemma 3, the log mo- • M 0 makes ( ε01δp )2 log( ε01δp ) calls to M .
ment of Algorithm 1 can be bounded as follows α(λ) ≤
T q 2 λ2 /σ 2 . By Theorem 2, to guarantee Algorithm 1 to be • M 0 is (ε + 8ε0 )-differentially private.
(ε, δ)-differentially private, it suffices that
Suppose that we have a differentially private mechanism
T q 2 λ2 /σ 2 ≤ λε/2 , Mi for each of K choices of hyperparameters. Let M̃ be the
mechanism that picks a random choice of hyperparameters,
exp(−λε/2) ≤ δ .
and runs the corresponding Mi . Let q(d, r) denote the num-
In addition, we need ber of examples from the validation set the r labels correctly,
and let Q be a target accuracy. Assuming that one of the
λ ≤ σ 2 log(1/qσ) . hyperparameter settings gets accuracy at least Q, M̃ satis-
1
fies the pre-conditions of the theorem for p = K . Then with
It is now easy to verify that when ε = c1 q 2 T , we can satisfy high probability, the mechanism implied by the theorem gets
all these conditions by setting accuracy close to Q. We remark that the proof of Theo-
p
q T log(1/δ) rem D.1 actually implies a stronger max(ε, 8ε0 )-differential
σ = c2 privacy for the setting of interest here.
ε Putting in some numbers, for a target accuracy of 95% on
for some explicit constants c1 and c2 . a validation set of size 10,000, we get Q = 9, 500. Thus, if,
for instance, we allow ε0 = 0.5, and δ = 0.05, we lose at most
1% in accuracy as long as 100 > 8 ln 40 . This is satisfied as
C. FROM DIFFERENTIAL PRIVACY TO MO- 1
p
long as p ≥ 6700 . In other words, one can try 6,700 different
MENTS BOUNDS parameter settings at privacy cost ε = 4 for the validation
One can also translate a differential privacy guarantee into set. In our experiments, we tried no more than a hundred
a moment bound. settings, so that this bound is easily satisfied. In practice,
as our graphs show, p for our hyperparameter search is sig-
Lemma C.1. Let M be ε-differentially private. Then for nificantly larger than K1
, so that a slightly smaller ε0 should
any λ > 0, M satisfies suffice.
αλ ≤ λε(eε − 1) + λ2 ε2 e2ε /2.
Proof. Let Z denote the random variable c(M(d)). Then
differential privacy implies that
∆
• µ = E[Z] ≤ ε(exp(ε) − 1).
• |Z| ≤ ε, so that |Z − µ| ≤ ε exp(ε).
Then E[exp(λZ)] = exp(λµ) · E[exp(λ(Z − µ))]. Since Z is in
a bounded range [−ε exp(ε), ε exp(ε)] and f (x) = exp(λx) is
convex, we can bound f (x) by a linear interpolation between
the values at the two endpoints of the range. Basic calculus
then implies that
E[f (Z)] ≤ f (E[Z]) · exp(λ2 ε2 exp(2ε)/2),
which concludes the proof.