0% found this document useful (0 votes)
121 views14 pages

Deep Learning With Differential Privacy

Uploaded by

nothard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views14 pages

Deep Learning With Differential Privacy

Uploaded by

nothard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

A preliminary version of this paper appears in the proceedings of the 23rd ACM Conference on Computer and Communications Security

(CCS 2016). This is a full version.

Deep Learning with Differential Privacy


October 25, 2016

Martín Abadi∗ Andy Chu∗ Ian Goodfellow†


H. Brendan McMahan∗ Ilya Mironov∗ Kunal Talwar∗
Li Zhang∗

ABSTRACT
arXiv:1607.00133v2 [stat.ML] 24 Oct 2016

1. We demonstrate that, by tracking detailed information


Machine learning techniques based on neural networks are (higher moments) of the privacy loss, we can obtain
achieving remarkable results in a wide variety of domains. much tighter estimates on the overall privacy loss, both
Often, the training of models requires large, representative asymptotically and empirically.
datasets, which may be crowdsourced and contain sensitive 2. We improve the computational efficiency of differen-
information. The models should not expose private informa- tially private training by introducing new techniques.
tion in these datasets. Addressing this goal, we develop new These techniques include efficient algorithms for com-
algorithmic techniques for learning and a refined analysis of puting gradients for individual training examples, sub-
privacy costs within the framework of differential privacy. dividing tasks into smaller batches to reduce memory
Our implementation and experiments demonstrate that we footprint, and applying differentially private principal
can train deep neural networks with non-convex objectives, projection at the input layer.
under a modest privacy budget, and at a manageable cost in
software complexity, training efficiency, and model quality. 3. We build on the machine learning framework Tensor-
Flow [3] for training models with differential privacy.
1. INTRODUCTION We evaluate our approach on two standard image clas-
sification tasks, MNIST and CIFAR-10. We chose
Recent progress in neural networks has led to impressive
these two tasks because they are based on public data-
successes in a wide range of applications, including image
sets and have a long record of serving as benchmarks
classification, language representation, move selection for
in machine learning. Our experience indicates that
Go, and many more (e.g., [54, 28, 56, 38, 51]). These ad-
privacy protection for deep neural networks can be
vances are enabled, in part, by the availability of large and
achieved at a modest cost in software complexity, train-
representative datasets for training neural networks. These
ing efficiency, and model quality.
datasets are often crowdsourced, and may contain sensitive
information. Their use requires techniques that meet the Machine learning systems often comprise elements that
demands of the applications while offering principled and contribute to protecting their training data. In particular,
rigorous privacy guarantees. regularization techniques, which aim to avoid overfitting to
In this paper, we combine state-of-the-art machine learn- the examples used for training, may hide details of those
ing methods with advanced privacy-preserving mechanisms, examples. On the other hand, explaining the internal rep-
training neural networks within a modest (“single-digit”) pri- resentations in deep neural networks is notoriously difficult,
vacy budget. We treat models with non-convex objectives, and their large capacity entails that these representations
several layers, and tens of thousands to millions of param- may potentially encode fine details of at least some of the
eters. (In contrast, previous work obtains strong results on training data. In some cases, a determined adversary may
convex models with smaller numbers of parameters, or treats be able to extract parts of the training data. For example,
complex neural networks but with a large privacy loss.) For Fredrikson et al. demonstrated a model-inversion attack that
this purpose, we develop new algorithmic techniques, a re- recovers images from a facial recognition system [24].
fined analysis of privacy costs within the framework of dif- While the model-inversion attack requires only “black-
ferential privacy, and careful implementation strategies: box” access to a trained model (that is, interaction with the

Google. model via inputs and outputs), we consider adversaries with
† additional capabilities, much like Shokri and Shmatikov [50].
OpenAI. Work done while at Google.
Our approach offers protection against a strong adversary
with full knowledge of the training mechanism and access
Permission to make digital or hard copies of part or all of this work for personal or to the model’s parameters. This protection is attractive,
classroom use is granted without fee provided that copies are not made or distributed in particular, for applications of machine learning on mobile
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. phones, tablets, and other devices. Storing models on-device
For all other uses, contact the owner/author(s). enables power-efficient, low-latency inference, and may con-
CCS’16 October 24-28, 2016, Vienna, Austria tribute to privacy since inference does not require commu-

c 2016 Copyright held by the owner/author(s). nicating user data to a central server; on the other hand,
ACM ISBN 978-1-4503-4139-4/16/10. we must assume that the model parameters themselves may
DOI: http://dx.doi.org/10.1145/2976749.2978318 be exposed to hostile inspection. Furthermore, when we are
concerned with preserving the privacy of one record in the and ε < 1 [20, Theorem 3.22]. Note that this analysis of
training data, we allow for the possibility that the adversary the mechanism can be applied post hoc, and, in particular,
controls some or even all of the rest of the training data. In that there are infinitely many (ε, δ) pairs that satisfy this
practice, this possibility cannot always be excluded, for ex- condition.
ample when the data is crowdsourced. Differential privacy for repeated applications of additive-
The next section reviews background on deep learning and noise mechanisms follows from the basic composition theo-
on differential privacy. Sections 3 and 4 explain our ap- rem [17, 18], or from advanced composition theorems and
proach and implementation. Section 5 describes our exper- their refinements [22, 32, 21, 10]. The task of keeping track
imental results. Section 6 discusses related work, and Sec- of the accumulated privacy loss in the course of execution
tion 7 concludes. Deferred proofs appear in the Appendix. of a composite mechanism, and enforcing the applicable pri-
vacy policy, can be performed by the privacy accountant,
2. BACKGROUND introduced by McSherry [40].
The basic blueprint for designing a differentially private
In this section we briefly recall the definition of differential additive-noise mechanism that implements a given function-
privacy, introduce the Gaussian mechanism and composition ality consists of the following steps: approximating the func-
theorems, and overview basic principles of deep learning. tionality by a sequential composition of bounded-sensitivity
2.1 Differential Privacy functions; choosing parameters of additive noise; and per-
forming privacy analysis of the resulting mechanism. We
Differential privacy [19, 16, 20] constitutes a strong stan- follow this approach in Section 3.
dard for privacy guarantees for algorithms on aggregate data-
bases. It is defined in terms of the application-specific con-
cept of adjacent databases. In our experiments, for instance,
2.2 Deep Learning
each training dataset is a set of image-label pairs; we say Deep neural networks, which are remarkably effective for
that two of these sets are adjacent if they differ in a single many machine learning tasks, define parameterized func-
entry, that is, if one image-label pair is present in one set tions from inputs to outputs as compositions of many layers
and absent in the other. of basic building blocks, such as affine transformations and
simple nonlinear functions. Commonly used examples of the
Definition 1. A randomized mechanism M : D → R with latter are sigmoids and rectified linear units (ReLUs). By
domain D and range R satisfies (ε, δ)-differential privacy if varying parameters of these blocks, we can “train” such a pa-
for any two adjacent inputs d, d0 ∈ D and for any subset of rameterized function with the goal of fitting any given finite
outputs S ⊆ R it holds that set of input/output examples.
More precisely, we define a loss function L that represents
Pr[M(d) ∈ S] ≤ eε Pr[M(d0 ) ∈ S] + δ. the penalty for mismatching the training data. The loss L(θ)
on parameters θ is the average of the P loss over the training
The original definition of ε-differential privacy does not in- examples {x1 , . . . , xN }, so L(θ) = N1 i L(θ, xi ). Training
clude the additive term δ. We use the variant introduced by consists in finding θ that yields an acceptably small loss,
Dwork et al. [17], which allows for the possibility that plain hopefully the smallest loss (though in practice we seldom
ε-differential privacy is broken with probability δ (which is expect to reach an exact global minimum).
preferably smaller than 1/|d|). For complex networks, the loss function L is usually non-
Differential privacy has several properties that make it convex and difficult to minimize. In practice, the minimiza-
particularly useful in applications such as ours: composabil- tion is often done by the mini-batch stochastic gradient de-
ity, group privacy, and robustness to auxiliary information. scent (SGD) algorithm. In this algorithm, at each step,
Composability enables modular design of mechanisms: if all one forms a P batch B of random examples and computes
the components of a mechanism are differentially private, gB = 1/|B| x∈B ∇θ L(θ, x) as an estimation to the gra-
then so is their composition. Group privacy implies graceful dient ∇θ L(θ). Then θ is updated following the gradient
degradation of privacy guarantees if datasets contain cor- direction −gB towards a local minimum.
related inputs, such as the ones contributed by the same Several systems have been built to support the definition
individual. Robustness to auxiliary information means that of neural networks, to enable efficient training, and then
privacy guarantees are not affected by any side information to perform efficient inference (execution for fixed parame-
available to the adversary. ters) [29, 12, 3]. We base our work on TensorFlow, an open-
A common paradigm for approximating a deterministic source dataflow engine released by Google [3]. TensorFlow
real-valued function f : D → R with a differentially private allows the programmer to define large computation graphs
mechanism is via additive noise calibrated to f ’s sensitivity from basic operators, and to distribute their execution across
Sf , which is defined as the maximum of the absolute distance a heterogeneous distributed system. TensorFlow automates
|f (d) − f (d0 )| where d and d0 are adjacent inputs. (The the creation of the computation graphs for gradients; it also
restriction to a real-valued function is intended to simplify makes it easy to batch computation.
this review, but is not essential.) For instance, the Gaussian
noise mechanism is defined by
∆ 3. OUR APPROACH
M(d) = f (d) + N (0, Sf2 · σ 2 ),
This section describes the main components of our ap-
where N (0, Sf2 · σ 2 ) is the normal (Gaussian) distribution proach toward differentially private training of neural net-
with mean 0 and standard deviation Sf σ. A single applica- works: a differentially private stochastic gradient descent
tion of the Gaussian mechanism to function f of sensitivity (SGD) algorithm, the moments accountant, and hyperpa-
Sf satisfies (ε, δ)-differential privacy if δ ≥ 45 exp(−(σε)2 /2) rameter tuning.
3.1 Differentially Private SGD Algorithm setting different clipping thresholds C and noise scales σ for
One might attempt to protect the privacy of training data different layers. Additionally, the clipping and noise param-
by working only on the final parameters that result from the eters may vary with the number of training steps t. In results
training process, treating this process as a black box. Un- presented in Section 5 we use constant settings for C and σ.
fortunately, in general, one may not have a useful, tight Lots: Like the ordinary SGD algorithm, Algorithm 1 esti-
characterization of the dependence of these parameters on mates the gradient of L by computing the gradient of the
the training data; adding overly conservative noise to the pa- loss on a group of examples and taking the average. This av-
rameters, where the noise is selected according to the worst- erage provides an unbiased estimator, the variance of which
case analysis, would destroy the utility of the learned model. decreases quickly with the size of the group. We call such a
Therefore, we prefer a more sophisticated approach in which group a lot, to distinguish it from the computational group-
we aim to control the influence of the training data during ing that is commonly called a batch. In order to limit mem-
the training process, specifically in the SGD computation. ory consumption, we may set the batch size much smaller
This approach has been followed in previous works (e.g., [52, than the lot size L, which is a parameter of the algorithm.
7]); we make several modifications and extensions, in par- We perform the computation in batches, then group several
ticular in our privacy accounting. batches into a lot for adding noise. In practice, for efficiency,
Algorithm 1 outlines our basic method for training a model the construction of batches and lots is done by randomly per-
with parameters θ by minimizing the empirical loss function muting the examples and then partitioning them into groups
L(θ). At each step of the SGD, we compute the gradient of the appropriate sizes. For ease of analysis, however, we as-
∇θ L(θ, xi ) for a random subset of examples, clip the `2 norm sume that each lot is formed by independently picking each
of each gradient, compute the average, add noise in order to example with probability q = L/N , where N is the size of
protect privacy, and take a step in the opposite direction of the input dataset.
this average noisy gradient. At the end, in addition to out- As is common in the literature, we normalize the running
putting the model, we will also need to compute the privacy time of a training algorithm by expressing it as the number
loss of the mechanism based on the information maintained of epochs, where each epoch is the (expected) number of
by the privacy accountant. Next we describe in more detail batches required to process N examples. In our notation,
each component of this algorithm and our refinements. an epoch consists of N/L lots.
Algorithm 1 Differentially private SGD (Outline) Privacy accounting: For differentially private SGD, an
important issue is computing the overall privacy cost of the
1
P Examples {x1 , . . . , xN }, loss function L(θ) =
Input:
training. The composability of differential privacy allows
N i L(θ, xi ). Parameters: learning rate ηt , noise scale
σ, group size L, gradient norm bound C. us to implement an “accountant” procedure that computes
Initialize θ0 randomly the privacy cost at each access to the training data, and
for t ∈ [T ] do accumulates this cost as the training progresses. Each step
Take a random sample Lt with sampling probability of training typically requires gradients at multiple layers,
L/N and the accountant accumulates the cost that corresponds
Compute gradient to all of them.
For each i ∈ Lt , compute gt (xi ) ← ∇θt L(θt , xi ) Moments accountant: Much research has been devoted
Clip gradient to studying the privacy loss for a particular noise distribu-
ḡt (xi ) ← gt (xi )/ max 1, kgt (x i )k2

C tion as well as the composition of privacy losses. For the
Add noise Gaussian
2 2 q noise that we use, if we choose σ in Algorithm 1
g̃t ← L1
P
i ḡt (xi ) + N (0, σ C I) to be 2 log 1.25 /ε, then by standard arguments [20] each
Descent δ

θt+1 ← θt − ηt g̃t step is (ε, δ)-differentially private with respect to the lot.
Output θT and compute the overall privacy cost (ε, δ) Since the lot itself is a random sample from the database,
using a privacy accounting method. the privacy amplification theorem [33, 8] implies that each
step is (O(qε), qδ)-differentially private with respect to the
full database where q = L/N is the sampling ratio per lot
Norm clipping: Proving the differential privacy guarantee and ε ≤ 1. The result in the literature that yields the best
of Algorithm 1 requires bounding the influence of each indi- overall bound is the strong composition theorem [22].
vidual example on g̃t . Since there is no a priori bound on However, the strong composition theorem can be loose,
the size of the gradients, we clip each gradient in `2 norm; and does not take into account the particular noise distribu-
i.e., the gradient vector g is replaced by g/ max 1, kgk 2

, tion under consideration. In our work, we invent a stronger
C
for a clipping threshold C. This clipping ensures that if accounting method, which we call the moments accountant.

kgk2 ≤ C, then g is preserved, whereas if kgk2 > C, it gets It allows us to prove that Algorithm 1 is (O(qε T ), δ)-
scaled down to be of norm C. We remark that gradient clip- differentially private for appropriately chosen settings of the
ping of this form is a popular ingredient of SGD for deep noise scale and the clipping threshold. Compared to what
networks for non-privacy reasons, though in that setting it one would obtain by the strong composition p theorem, our
usually suffices to clip after averaging. bound is tighter in two ways: it saves a log(1/δ) factor in
the ε part and a T q factor in the δ part. Since we expect
Per-layer and time-dependent parameters: The pseu- δ to be small and T 1/q (i.e., each example is examined
docode for Algorithm 1 groups all the parameters into a multiple times), the saving provided by our bound is quite
single input θ of the loss function L(·). For multi-layer neu- significant. This result is one of our main contributions.
ral networks, we consider each layer separately, which allows
Theorem 1. There exist constants c1 and c2 so that given We state the properties of α that we use for the moments
the sampling probability q = L/N and the number of steps accountant.
T , for any ε < c1 q 2 T , Algorithm 1 is (ε, δ)-differentially
private for any δ > 0 if we choose Theorem 2. Let αM (λ) defined as above. Then
p
q T log(1/δ) 1. [Composability] Suppose that a mechanism M con-
σ ≥ c2 .
ε Qi−1 of adaptive mechanisms M1 , . . . , Mk
sists of a sequence
If we use the strong p composition theorem, we will then where Mi : j=1 Rj × D → Ri . Then, for any λ
need to choose σ =pΩ(q T log(1/δ) log(T /δ)/ε). Note that k
X
we save a factor of log(T /δ) in our asymptotic bound. The αM (λ) ≤ αMi (λ) .
moments accountant is beneficial in theory, as this result i=1
indicates, and also in practice, as can be seen from Figure 2
in Section 4. For example, with L = 0.01N , σ = 4, δ = 2. [Tail bound] For any ε > 0, the mechanism M is
10−5 , and T = 10000, we have ε ≈ 1.26 using the moments (ε, δ)-differentially private for
accountant. As a comparison, we would get a much larger δ = min exp(αM (λ) − λε) .
ε ≈ 9.34 using the strong composition theorem. λ

3.2 The Moments Accountant: Details In particular, Theorem 2.1 holds when the mechanisms
The moments accountant keeps track of a bound on the themselves are chosen based on the (public) output of the
moments of the privacy loss random variable (defined be- previous mechanisms.
low in Eq. (1)). It generalizes the standard approach of By Theorem 2, it suffices to compute, or bound, αMi (λ) at
tracking (ε, δ) and using the strong composition theorem. each step and sum them to bound the moments of the mech-
While such an improvement was known previously for com- anism overall. We can then use the tail bound to convert the
posing Gaussian mechanisms, we show that it applies also moments bound to the (ε, δ)-differential privacy guarantee.
for composing Gaussian mechanisms with random sampling The main challenge that remains is to bound the value
and can provide much tighter estimate of the privacy loss of αMt (λ) for each step. In the case of a Gaussian mechanism
Algorithm 1. with random sampling, it suffices to estimate the following
Privacy loss is a random variable dependent on the ran- moments. Let µ0 denote the probability density function
dom noise added to the algorithm. That a mechanism M (pdf) of N (0, σ 2 ), and µ1 denote the pdf of N (1, σ 2 ). Let µ
is (ε, δ)-differentially private is equivalent to a certain tail be the mixture of two Gaussians µ = (1 − q)µ0 + qµ1 . Then
bound on M’s privacy loss random variable. While the tail we need to compute α(λ) = log max(E1 , E2 ) where
bound is very useful information on a distribution, compos- E1 = Ez∼µ0 [(µ0 (z)/µ(z))λ ] , (3)
ing directly from it can result in quite loose bounds. We in- λ
stead compute the log moments of the privacy loss random E2 = Ez∼µ [(µ(z)/µ0 (z)) ] . (4)
variable, which compose linearly. We then use the moments In the implementation of the moments accountant, we
bound, together with the standard Markov inequality, to ob- carry out numerical integration to compute α(λ). In ad-
tain the tail bound, that is the privacy loss in the sense of dition, we can show the asymptotic bound
differential privacy.
More specifically, for neighboring databases d, d0 ∈ Dn , a α(λ) ≤ q 2 λ(λ + 1)/(1 − q)σ 2 + O(q 3 /σ 3 ) .
mechanism M, auxiliary input aux, and an outcome o ∈ R,
Together with Theorem 2, the above bound implies our
define the privacy loss at o as
main Theorem 1. The details can be found in the Appendix.
∆ Pr[M(aux, d) = o]
c(o; M, aux, d, d0 ) = log . (1) 3.3 Hyperparameter Tuning
Pr[M(aux, d0 ) = o]
We identify characteristics of models relevant for privacy
A common design pattern, which we use extensively in the
and, specifically, hyperparameters that we can tune in order
paper, is to update the state by sequentially applying differ-
to balance privacy, accuracy, and performance. In particu-
entially private mechanisms. This is an instance of adaptive
lar, through experiments, we observe that model accuracy
composition, which we model by letting the auxiliary input
is more sensitive to training parameters such as batch size
of the kth mechanism Mk be the output of all the previous
and noise level than to the structure of a neural network.
mechanisms.
If we try several settings for the hyperparameters, we can
For a given mechanism M, we define the λth moment
trivially add up the privacy costs of all the settings, possibly
αM (λ; aux, d, d0 ) as the log of the moment generating func-
via the moments accountant. However, since we care only
tion evaluated at the value λ:
about the setting that gives us the most accurate model, we

αM (λ; aux, d, d0 ) = can do better, such as applying a version of a result from
Gupta et al. [27] restated as Theorem D.1 in the Appendix.
log Eo∼M(aux,d) [exp(λc(o; M, aux, d, d0 ))]. (2) We can use insights from theory to reduce the number of
In order to prove privacy guarantees of a mechanism, it is hyperparameter settings that need to be tried. While differ-
useful to bound all possible αM (λ; aux, d, d0 ). We define entially private optimization of convex objective functions
is best achieved using batch sizes as small as 1, non-convex

αM (λ) = max 0 αM (λ; aux, d, d0 ) , learning, which is inherently less stable, benefits from ag-
aux,d,d
gregation into larger batches. At the same time, Theorem 1
where the maximum is taken over all possible aux and all suggests that making batches too large increases the pri-
the neighboring databases d, d0 . vacy cost, and a reasonable tradeoff is to take the number
class DPSGD_Optimizer():
of batches per epoch to be of the same order as the desired
def __init__(self, accountant, sanitizer):
number of epochs. The learning rate in non-private train- self._accountant = accountant
ing is commonly adjusted downwards carefully as the model self._sanitizer = sanitizer
converges to a local optimum. In contrast, we never need
to decrease the learning rate to a very small value, because def Minimize(self, loss, params,
differentially private training never reaches a regime where batch_size, noise_options):
# Accumulate privacy spending before computing
it would be justified. On the other hand, in our experi-
# and using the gradients.
ments, we do find that there is a small benefit to starting priv_accum_op =
with a relatively large learning rate, then linearly decaying self._accountant.AccumulatePrivacySpending(
it to a smaller value in a few epochs, and keeping it constant batch_size, noise_options)
afterwards. with tf.control_dependencies(priv_accum_op):
# Compute per example gradients
px_grads = per_example_gradients(loss, params)
4. IMPLEMENTATION # Sanitize gradients
We have implemented the differentially private SGD al- sanitized_grads = self._sanitizer.Sanitize(
gorithms in TensorFlow. The source code is available under px_grads, noise_options)
# Take a gradient descent step
an Apache 2.0 license from github.com/tensorflow/models. return apply_gradients(params, sanitized_grads)
For privacy protection, we need to “sanitize” the gradient
before using it to update the parameters. In addition, we def DPTrain(loss, params, batch_size, noise_options):
need to keep track of the “privacy spending” based on how accountant = PrivacyAccountant()
the sanitization is done. Hence our implementation mainly sanitizer = Sanitizer()
consists of two components: sanitizer, which preprocesses dp_opt = DPSGD_Optimizer(accountant, sanitizer)
sgd_op = dp_opt.Minimize(
the gradient to protect privacy, and privacy_accountant, loss, params, batch_size, noise_options)
which keeps track of the privacy spending over the course of eps, delta = (0, 0)
training. # Carry out the training as long as the privacy
Figure 1 contains the TensorFlow code snippet (in Python) # is within the pre-set limit.
of DPSGD_Optimizer, which minimizes a loss function us- while within_limit(eps, delta):
ing a differentially private SGD, and DPTrain, which itera- sgd_op.run()
eps, delta = accountant.GetSpentPrivacy()
tively invokes DPSGD_Optimizer using a privacy accountant
to bound the total privacy loss.
In many cases, the neural network model may benefit from Figure 1: Code snippet of DPSGD_Optimizer and DP-
the processing of the input by projecting it on the principal Train.
directions (PCA) or by feeding it through a convolutional
layer. We implement differentially private PCA and apply
pre-trained convolutional layers (learned on public data). integration. The first option would recover the generic ad-
Sanitizer. In order to achieve privacy protection, the sani- vanced composition theorem, and the latter two give a more
tizer needs to perform two operations: (1) limit the sensitiv- accurate accounting of the privacy loss.
ity of each individual example by clipping the norm of the For the Gaussian mechanism we use, α(λ) is defined ac-
gradient for each example; and (2) add noise to the gradient cording to Eqs. (3) and (4). In our implementation, we
of a batch before updating the network parameters. carry out numerical integration to compute both E1 and E2
In TensorFlow, the gradient computation in those equations. Also we compute α(λ) for a range of
Pis batched for λ’s so we can compute the best possible (ε, δ) values using
performance reasons, yielding gB = 1/|B| x∈B ∇θ L(θ, x)
for a batch B of training examples. To limit the sensitivity Theorem 2.2. We find that for the parameters of interest to
of updates, we need to access each individual ∇θ L(θ, x). To us, it suffices to compute α(λ) for λ ≤ 32.
this end, we implemented per_example_gradient operator At any point during training, one can query the privacy
in TensorFlow, as described by Goodfellow [25]. This opera- loss in the more interpretable notion of (ε, δ) privacy using
tor can compute a batch of individual ∇θ L(θ, x). With this Theorem 2.2. Rogers et al. [47] point out risks associated
implementation there is only a modest slowdown in train- with adaptive choice of privacy parameters. We avoid their
ing, even for larger batch size. Our current implementation attacks and negative results by fixing the number of iter-
supports batched computation for the loss function L, where ations and privacy parameters ahead of time. More gen-
each xi is singly connected to L, allowing us to handle most eral implementations of a privacy accountant must correctly
hidden layers but not, for example, convolutional layers. distinguish between two modes of operation—as a privacy
Once we have the access to the per-example gradient, it odometer or a privacy filter (see [47] for more details).
is easy to use TensorFlow operators to clip its norm and to
Differentially private PCA. Principal component analy-
add noise.
sis (PCA) is a useful method for capturing the main features
Privacy accountant. The main component in our imple- of the input data. We implement the differentially private
mentation is PrivacyAccountant which keeps track of pri- PCA algorithm as described in [23]. More specifically, we
vacy spending over the course of training. As discussed in take a random sample of the training examples, treat them
Section 3, we implemented the moments accountant that ad- as vectors, and normalize each vector to unit `2 norm to
ditively accumulates the log of the moments of the privacy form the matrix A, where each vector is a row in the ma-
loss at each step. Dependent on the noise distribution, one trix. We then add Gaussian noise to the covariance matrix
can compute α(λ) by either applying an asymptotic bound, AT A and compute the principal directions of the noisy co-
evaluating a closed-form expression, or applying numerical variance matrix. Then for each input example we apply the
projection to these principal directions before feeding it into
the neural network.
We incur a privacy cost due to running a PCA. However,
we find it useful for both improving the model quality and for
reducing the training time, as suggested by our experiments
on the MNIST data. See Section 4 for details.
Convolutional layers. Convolutional layers are useful for
deep neural networks. However, an efficient per-example
gradient computation for convolutional layers remains a chal-
lenge within the TensorFlow framework, which motivates
creating a separate workflow. For example, some recent
work argues that even random convolutions often suffice [46,
13, 49, 55, 14].
Alternatively, we explore the idea of learning convolu-
tional layers on public data, following Jarrett et al. [30].
Such convolutional layers can be based on GoogLeNet or
AlexNet features [54, 35] for image models or on pretrained
word2vec or GloVe embeddings in language models [41, 44].
Figure 2: The ε value as a function of epoch E for
q = 0.01, σ = 4, δ = 10−5 , using the strong composition
5. EXPERIMENTAL RESULTS theorem and the moments accountant respectively.
This section reports on our evaluation of the moments ac-
countant, and results on two popular image datasets: MNIST
and CIFAR-10. Differentially private model.
For the differentially private version, we experiment with
5.1 Applying the Moments Accountant the same architecture with a 60-dimensional PCA projection
As shown by Theorem 1, the moments accountant pro- layer, a single 1,000-unit ReLU hidden layer, and a lot size of
vides a tighter bound on the privacy loss compared to the 600. To limit sensitivity, we clip the gradient norm of each
generic strong composition theorem. Here we compare them layer at 4. We report results for three choices of the noise
using some concrete values. The overall privacy loss (ε, δ) scale, which we call small (σ = 2, σp = 4), medium (σ =
can be computed from the noise level σ, the sampling ra- 4, σp = 7), and large (σ = 8, σp = 16). Here σ represents
tio of each lot q = L/N (so each epoch consists of 1/q the noise level for training the neural network, and σp the
batches), and the number of epochs E (so the number of noise level for PCA projection. The learning rate is set at 0.1
steps is T = E/q). We fix the target δ = 10−5 , the value initially and linearly decreased to 0.052 over 10 epochs and
used for our MNIST and CIFAR experiments. then fixed to 0.052 thereafter. We have also experimented
In our experiment, we set q = 0.01, σ = 4, and δ = 10−5 , with multi-hidden-layer networks. For MNIST, we found
and compute the value of ε as a function of the training that one hidden layer combined with PCA works better than
epoch E. Figure 2 shows two curves corresponding to, re- a two-layer network.
spectively, using the strong composition theorem and the Figure 3 shows the results for different noise levels. In
moments accountant. We can see that we get a much tighter each plot, we show the evolution of the training and testing
estimation of the privacy loss by using the moments accoun- accuracy as a function of the number of epochs as well as
tant. For examples, when E = 100, the values are 9.34 the corresponding δ value, keeping ε fixed. We achieve 90%,
and 1.26 respectively, and for E = 400, the values are 24.22 95%, and 97% test set accuracy for (0.5, 10−5 ), (2, 10−5 ),
and 2.55 respectively. That is, using the moments bound, and (8, 10−5 )-differential privacy respectively.
we achieve (2.55, 10−5 )-differential privacy, whereas previ- One attractive consequence of applying differentially pri-
ous techniques only obtain the significantly worse guarantee vate SGD is the small difference between the model’s ac-
of (24.22, 10−5 ). curacy on the training and the test sets, which is consis-
tent with the theoretical argument that differentially private
5.2 MNIST training generalizes well [6]. In contrast, the gap between
We conduct experiments on the standard MNIST dataset training and testing accuracy in non-private training, i.e.,
for handwritten digit recognition consisting of 60,000 train- evidence of overfitting, increases with the number of epochs.
ing examples and 10,000 testing examples [36]. Each exam- By using the moments accountant, we can obtain a δ value
ple is a 28 × 28 size gray-level image. We use a simple feed- for any given ε. We record the accuracy for different (ε, δ)
forward neural network with ReLU units and softmax of 10 pairs in Figure 4. In the figure, each curve corresponds to the
classes (corresponding to the 10 digits) with cross-entropy best accuracy achieved for a fixed δ, as it varies between 10−5
loss and an optional PCA input layer. and 10−2 . For example, we can achieve 90% accuracy for
ε = 0.25 and δ = 0.01. As can be observed from the figure,
Baseline model. for a fixed δ, varying the value of ε can have large impact
Our baseline model uses a 60-dimensional PCA projection on accuracy, but for any fixed ε, there is less difference with
layer and a single hidden layer with 1,000 hidden units. Us- different δ values.
ing the lot size of 600, we can reach accuracy of 98.30% in
about 100 epochs. This result is consistent with what can Effect of the parameters.
be achieved with a vanilla neural network [36]. Classification accuracy is determined by multiple factors
(1) Large noise (2) Medium noise (3) Small noise

Figure 3: Results on the accuracy for different noise levels on the MNIST dataset. In all the experiments, the
network uses 60 dimension PCA projection, 1,000 hidden units, and is trained using lot size 600 and clipping
threshold 4. The noise levels (σ, σp ) for training the neural network and for PCA projection are set at (8, 16),
(4, 7), and (2, 4), respectively, for the three experiments.

we achieve better accuracy by training the PCA layer sep-


arately. By reducing the input size from 784 to 60, PCA
leads to an almost 10× reduction in training time. The re-
sult is fairly stable over a large range of the noise levels for
the PCA projection and consistently better than the accu-
racy using random projection, which is at about 92.5% and
shown as a horizontal line in the plot.
Number of hidden units. Including more hidden units
makes it easier to fit the training set. For non-private train-
ing, it is often preferable to use more units, as long as we
employ techniques to avoid overfitting. However, for differ-
entially private training, it is not a priori clear if more hidden
units improve accuracy, as more hidden units increase the
sensitivity of the gradient, which leads to more noise added
at each update.
Somewhat counterintuitively, increasing the number of
Figure 4: Accuracy of various (ε, δ) privacy values hidden units does not decrease accuracy of the trained model.
on the MNIST dataset. Each curve corresponds to One possible explanation that calls for further analysis is
a different δ value. that larger networks are more tolerant to noise. This prop-
erty is quite encouraging as it is common in practice to use
very large networks.
that must be carefully tuned for optimal performance. These
factors include the topology of the network, the number of Lot size. According to Theorem 1, we can run N/L epochs
PCA dimensions and the number of hidden units, as well as while staying within a constant privacy budget. Choosing
parameters of the training procedure such as the lot size and the lot size must balance two conflicting objectives. On the
the learning rate. Some parameters are specific to privacy, one hand, smaller lots allow running more epochs, i.e., passes
such as the gradient norm clipping bound and the noise level. over data, improving accuracy. On the other hand, for a
To demonstrate the effects of these parameters, we manip- larger lot, the added noise has a smaller relative effect.
ulate them individually, keeping the rest constant. We set Our experiments show that the lot size has a relatively
the reference values as follows: 60 PCA dimensions, 1,000 large impact
√ on accuracy. Empirically, the best lot size is
hidden units, 600 lot size, gradient norm bound of 4, ini- roughly N where N is the number of training examples.
tial learning rate of 0.1 decreasing to a final learning rate Learning rate. Accuracy is stable for a learning rate in
of 0.052 in 10 epochs, and noise σ equal to 4 and 7 respec- the range of [0.01, 0.07] and peaks at 0.05, as shown in Fig-
tively for training the neural network parameters and for the ure 5(4). However, accuracy decreases significantly if the
PCA projection. For each combination of values, we train learning rate is too large. Some additional experiments sug-
until the point at which (2, 10−5 )-differential privacy would gest that, even for large learning rates, we can reach similar
be violated (so, for example, a larger σ allows more epochs levels of accuracy by reducing the noise level and, accord-
of training). The results are presented in Figure 5. ingly, by training less in order to avoid exhausting the pri-
vacy budget.
PCA projection. In our experiments, the accuracy is
fairly stable as a function of the PCA dimension, with the Clipping bound. Limiting the gradient norm has two op-
best results achieved for 60. (Not doing PCA reduces ac- posing effects: clipping destroys the unbiasedness of the gra-
curacy by about 2%.) Although in principle the PCA pro- dient estimate, and if the clipping parameter is too small,
jection layer can be replaced by an additional hidden layer, the average clipped gradient may point in a very different
(1) variable projection dimensions (2) variable hidden units (3) variable lot size

(4) variable learning rate (5) variable gradient clipping norm (6) variable noise level

Figure 5: MNIST accuracy when one parameter varies, and the others are fixed at reference values.

direction from the true gradient. On the other hand, in- 5.3 CIFAR
creasing the norm bound C forces us to add more noise to We also conduct experiments on the CIFAR-10 dataset,
the gradients (and hence the parameters), since we add noise which consists of color images classified into 10 classes such
based on σC. In practice, a good way to choose a value for as ships, cats, and dogs, and partitioned into 50,000 training
C is by taking the median of the norms of the unclipped examples and 10,000 test examples [1]. Each example is a
gradients over the course of training. 32 × 32 image with three channels (RGB). For this learning
Noise level. By adding more noise, the per-step privacy task, nearly all successful networks use convolutional layers.
loss is proportionally smaller, so we can run more epochs The CIFAR-100 dataset has similar parameters, except that
within a given cumulative privacy budget. In Figure 5(5), images are classified into 100 classes; the examples and the
the x-axis is the noise level σ. The choice of this value has image classes are different from those of CIFAR-10.
a large impact on accuracy. We use the network architecture from the TensorFlow con-
volutional neural networks tutorial [2]. Each 32 × 32 image
From the experiments, we observe the following. is first cropped to a 24 × 24 one by taking the center patch.
1. The PCA projection improves both model accuracy The network architecture consists of two convolutional lay-
and training performance. Accuracy is quite stable ers followed by two fully connected layers. The convolutional
over a large range of choices for the projection dimen- layers use 5 × 5 convolutions with stride 1, followed by a
sions and the noise level used in the PCA stage. ReLU and 2 × 2 max pools, with 64 channels each. Thus
the first convolution outputs a 12 × 12 × 64 tensor for each
2. The accuracy is fairly stable over the network size. image, and the second outputs a 6×6×64 tensor. The latter
When we can only run smaller number of epochs, it is is flattened to a vector that gets fed into a fully connected
more beneficial to use a larger network. layer with 384 units, and another one of the same size.
This architecture, non-privately, can get to about 86% ac-
3. The training parameters, especially the lot size and
curacy in 500 epochs. Its simplicity makes it an appealing
the noise scale σ, have a large impact on the model
choice for our work. We should note however that by us-
accuracy. They both determine the “noise-to-signal”
ing deeper networks with different non-linearities and other
ratio of the sanitized gradients as well as the number
advanced techniques, one can obtain significantly better ac-
of epochs we are able to go through the data before
curacy, with the state-of-the-art being about 96.5% [26].
reaching the privacy limit.
As is standard for such image datasets, we use data aug-
Our framework allows for adaptive control of the training mentation during training. For each training image, we gen-
parameters, such as the lot size, the gradient norm bound erate a new distorted image by randomly picking a 24 × 24
C, and noise level σ. Our initial experiments with decreas- patch from the image, randomly flipping the image along
ing noise as training progresses did not show a significant the left-right direction, and randomly distorting the bright-
improvement, but it is interesting to consider more sophis- ness and the contrast of the image. In each epoch, these
ticated schemes for adaptively choosing these parameters.
(1) ε = 2 (2) ε = 4 (3) ε = 8

Figure 6: Results on accuracy for different noise levels on CIFAR-10. With δ set to 10−5 , we achieve accuracy
67%, 70%, and 73%, with ε being 2, 4, and 8, respectively. The first graph uses a lot size of 2,000, (2) and (3)
use a lot size of 4,000. In all cases, σ is set to 6, and clipping is set to 3.

distortions are done independently. We refer the reader to chine learning, has been a focus of active work in several
the TensorFlow tutorial [2] for additional details. research communities since the late 90s [5, 37]. The exist-
As the convolutional layers have shared parameters, com- ing literature can be broadly classified along several axes:
puting per-example gradients has a larger computational the class of models, the learning algorithm, and the privacy
overhead. Previous work has shown that convolutional lay- guarantees.
ers are often transferable: parameters learned from one data-
Privacy guarantees. Early works on privacy-preserving
set can be used on another one without retraining [30]. We
learning were done in the framework of secure function eval-
treat the CIFAR-100 dataset as a public dataset and use it
uation (SFE) and secure multi-party computations (MPC),
to train a network with the same architecture. We use the
where the input is split between two or more parties, and
convolutions learned from training this dataset. Retrain-
the focus is on minimizing information leaked during the
ing only the fully connected layers with this architecture for
joint computation of some agreed-to functionality. In con-
about 250 epochs with a batch size of 120 gives us approxi-
trast, we assume that data is held centrally, and we are
mately 80% accuracy, which is our non-private baseline.
concerned with leakage from the functionality’s output (i.e.,
the model).
Differentially private version. Another approach, k-anonymity and closely related no-
For the differentially private version, we use the same ar- tions [53], seeks to offer a degree of protection to underlying
chitecture. As discussed above, we use pre-trained convolu- data by generalizing and suppressing certain identifying at-
tional layers. The fully connected layers are initialized from tributes. The approach has strong theoretical and empirical
the pre-trained network as well. We train the softmax layer, limitations [4, 9] that make it all but inapplicable to de-
and either the top or both fully connected layers. Based on anonymization of high-dimensional, diverse input datasets.
looking at gradient norms, the softmax layer gradients are Rather than pursue input sanitization, we keep the under-
roughly twice as large as the other two layers, and we keep lying raw records intact and perturb derived data instead.
this ratio when we try clipping at a few different values be- The theory of differential privacy, which provides the an-
tween 3 and 10. The lot size is an additional knob that we alytical framework for our work, has been applied to a large
tune: we tried 600, 2,000, and 4,000. With these settings, collection of machine learning tasks that differed from ours
the per-epoch training time increases from approximately 40 either in the training mechanism or in the target model.
seconds to 180 seconds. The moments accountant is closely related to the notion of
In Figure 6, we show the evolution of the accuracy and Rényi differential privacy [42], which proposes (scaled) α(λ)
the privacy cost, as a function of the number of epochs, for as a means of quantifying privacy guarantees. In a concur-
a few different parameter settings. rent and independent work Bun and Steinke [10] introduce
The various parameters influence the accuracy one gets, in a relaxation of differential privacy (generalizing the work
ways not too different from that in the MNIST experiments. of Dwork and Rothblum [20]) defined via a linear upper
A lot size of 600 leads to poor results on this dataset and bound on α(λ). Taken together, these works demonstrate
we need to increase it to 2,000 or more for results reported that the moments accountant is a useful technique for theo-
in Figure 6. retical and empirical analyses of complex privacy-preserving
Compared to the MNIST dataset, where the difference in algorithms.
accuracy between a non-private baseline and a private model
is about 1.3%, the corresponding drop in accuracy in our Learning algorithm. A common target for learning with
CIFAR-10 experiment is much larger (about 7%). We leave privacy is a class of convex optimization problems amenable
closing this gap as an interesting test for future research in to a wide variety of techniques [18, 11, 34]. In concurrent
differentially private machine learning. work, Wu et al. achieve 83% accuracy on MNIST via con-
vex empirical risk minimization [57]. Training multi-layer
neural networks is non-convex, and typically solved by an
6. RELATED WORK application of SGD, whose theoretical guarantees are poorly
The problem of privacy-preserving data mining, or ma- understood.
For the CIFAR neural network we incorporate differen- 9. REFERENCES
tially private training of the PCA projection matrix [23],
which is used to reduce dimensionality of inputs.
[1] CIFAR-10 and CIFAR-100 datasets.
Model class. The first end-to-end differentially private sys- www.cs.toronto.edu/˜kriz/cifar.html.
tem was evaluated on the Netflix Prize dataset [39], a version [2] TensorFlow convolutional neural networks tutorial.
of a collaborative filtering problem. Although the problem www.tensorflow.org/tutorials/deep cnn.
shared many similarities with ours—high-dimensional in- [3] TensorFlow: Large-scale machine learning on
puts, non-convex objective function—the approach taken by heterogeneous systems, 2015. Software available from
McSherry and Mironov differed significantly. They identified tensorflow.org.
the core of the learning task, effectively sufficient statistics, [4] C. C. Aggarwal. On k-anonymity and the curse of
that can be computed in a differentially private manner via dimensionality. In VLDB, pages 901–909, 2005.
a Gaussian mechanism. In our approach no such sufficient
[5] R. Agrawal and R. Srikant. Privacy-preserving data
statistics exist.
mining. In SIGMOD, pages 439–450. ACM, 2000.
In a recent work Shokri and Shmatikov [50] designed and
evaluated a system for distributed training of a deep neural [6] R. Bassily, K. Nissim, A. Smith, T. Steinke,
network. Participants, who hold their data closely, commu- U. Stemmer, and J. Ullman. Algorithmic stability for
nicate sanitized updates to a central authority. The sani- adaptive data analysis. In STOC, pages 1046–1059.
ACM, 2016.
tization relies on an additive-noise mechanism, based on a
sensitivity estimate, which could be improved to a hard sen- [7] R. Bassily, A. D. Smith, and A. Thakurta. Private
sitivity guarantee. They compute privacy loss per param- empirical risk minimization: Efficient algorithms and
eter (not for an entire model). By our preferred measure, tight error bounds. In FOCS, pages 464–473. IEEE,
the total privacy loss per participant on the MNIST dataset 2014.
exceeds several thousand. [8] A. Beimel, H. Brenner, S. P. Kasiviswanathan, and
A different, recent approach towards differentially private K. Nissim. Bounds on the sample complexity for
deep learning is explored by Phan et al. [45]. This work private learning and private data release. Machine
focuses on learning autoencoders. Privacy is based on per- Learning, 94(3):401–437, 2014.
turbing the objective functions of these autoencoders. [9] J. Brickell and V. Shmatikov. The cost of privacy:
Destruction of data-mining utility in anonymized data
7. CONCLUSIONS publishing. In KDD, pages 70–78. ACM, 2008.
[10] M. Bun and T. Steinke. Concentrated differential
We demonstrate the training of deep neural networks with
privacy: Simplifications, extensions, and lower bounds.
differential privacy, incurring a modest total privacy loss, In TCC-B.
computed over entire models with many parameters. In
[11] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate.
our experiments for MNIST, we achieve 97% training accu-
Differentially private empirical risk minimization.
racy and for CIFAR-10 we achieve 73% accuracy, both with
J. Machine Learning Research, 12:1069–1109, 2011.
(8, 10−5 )-differential privacy. Our algorithms are based on a
differentially private version of stochastic gradient descent; [12] R. Collobert, K. Kavukcuoglu, and C. Farabet.
they run on the TensorFlow software library for machine Torch7: A Matlab-like environment for machine
learning. Since our approach applies directly to gradient learning. In BigLearn, NIPS Workshop, number
computations, it can be adapted to many other classical EPFL-CONF-192376, 2011.
and more recent first-order optimization methods, such as [13] D. D. Cox and N. Pinto. Beyond simple features: A
NAG [43], Momentum [48], AdaGrad [15], or SVRG [31]. large-scale feature search approach to unconstrained
A new tool, which may be of independent interest, is a face recognition. In FG 2011, pages 8–15. IEEE, 2011.
mechanism for tracking privacy loss, the moments accoun- [14] A. Daniely, R. Frostig, and Y. Singer. Toward deeper
tant. It permits tight automated analysis of the privacy loss understanding of neural networks: The power of
of complex composite mechanisms that are currently beyond initialization and a dual view on expressivity. CoRR,
the reach of advanced composition theorems. abs/1602.05897, 2016.
A number of avenues for further work are attractive. In [15] J. Duchi, E. Hazan, and Y. Singer. Adaptive
particular, we would like to consider other classes of deep subgradient methods for online learning and stochastic
networks. Our experience with MNIST and CIFAR-10 should optimization. J. Machine Learning Research,
be helpful, but we see many opportunities for new research, 12:2121–2159, July 2011.
for example in applying our techniques to LSTMs used for [16] C. Dwork. A firm foundation for private data analysis.
language modeling tasks. In addition, we would like to ob- Commun. ACM, 54(1):86–95, Jan. 2011.
tain additional improvements in accuracy. Many training [17] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov,
datasets are much larger than those of MNIST and CIFAR- and M. Naor. Our data, ourselves: Privacy via
10; accuracy should benefit from their size. distributed noise generation. In EUROCRYPT, pages
486–503. Springer, 2006.
8. ACKNOWLEDGMENTS [18] C. Dwork and J. Lei. Differential privacy and robust
We are grateful to Úlfar Erlingsson and Dan Ramage for statistics. In STOC, pages 371–380. ACM, 2009.
many useful discussions, and to Mark Bun and Thomas [19] C. Dwork, F. McSherry, K. Nissim, and A. Smith.
Steinke for sharing a draft of [10]. Calibrating noise to sensitivity in private data
analysis. In TCC, pages 265–284. Springer, 2006.
[20] C. Dwork and A. Roth. The algorithmic foundations
of differential privacy. Foundations and Trends in [40] F. D. McSherry. Privacy integrated queries: An
Theoretical Computer Science, 9(3–4):211–407, 2014. extensible platform for privacy-preserving data
[21] C. Dwork and G. N. Rothblum. Concentrated analysis. In SIGMOD, pages 19–30. ACM, 2009.
differential privacy. CoRR, abs/1603.01887, 2016. [41] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
[22] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting Efficient estimation of word representations in vector
and differential privacy. In FOCS, pages 51–60. IEEE, space. CoRR, abs/1301.3781, 2013.
2010. [42] I. Mironov. Rényi differential privacy. Private
[23] C. Dwork, K. Talwar, A. Thakurta, and L. Zhang. communication, 2016.
Analyze Gauss: Optimal bounds for [43] Y. Nesterov. Introductory Lectures on Convex
privacy-preserving principal component analysis. In Optimization. A Basic Course. Springer, 2004.
STOC, pages 11–20. ACM, 2014. [44] J. Pennington, R. Socher, and C. D. Manning. GloVe:
[24] M. Fredrikson, S. Jha, and T. Ristenpart. Model Global vectors for word representation. In EMNLP,
inversion attacks that exploit confidence information pages 1532–1543, 2014.
and basic countermeasures. In CCS, pages 1322–1333. [45] N. Phan, Y. Wang, X. Wu, and D. Dou. Differential
ACM, 2015. privacy preservation for deep auto-encoders: an
[25] I. Goodfellow. Efficient per-example gradient application of human behavior prediction. In AAAI,
computations. CoRR, abs/1510.01799v2, 2015. pages 1309–1316, 2016.
[26] B. Graham. Fractional max-pooling. CoRR, [46] N. Pinto, Z. Stone, T. E. Zickler, and D. Cox. Scaling
abs/1412.6071, 2014. up biologically-inspired computer vision: A case study
[27] A. Gupta, K. Ligett, F. McSherry, A. Roth, and in unconstrained face recognition on Facebook. In
K. Talwar. Differentially private combinatorial CVPR, pages 35–42. IEEE, 2011.
optimization. In SODA, pages 1106–1125, 2010. [47] R. M. Rogers, A. Roth, J. Ullman, and S. P. Vadhan.
[28] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep Privacy odometers and filters: Pay-as-you-go
into rectifiers: Surpassing human-level performance on composition. In NIPS.
ImageNet classification. In ICCV, pages 1026–1034. [48] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
IEEE, 2015. Learning representations by back-propagating errors.
[29] R. Ierusalimschy, L. H. de Figueiredo, and W. Filho. Nature, 323:533–536, Oct. 1986.
Lua—an extensible extension language. Software: [49] A. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh,
Practice and Experience, 26(6):635–652, 1996. and A. Ng. On random weights and unsupervised
[30] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and feature learning. In ICML, pages 1089–1096. ACM,
Y. LeCun. What is the best multi-stage architecture 2011.
for object recognition? In ICCV, pages 2146–2153. [50] R. Shokri and V. Shmatikov. Privacy-preserving deep
IEEE, 2009. learning. In CCS, pages 1310–1321. ACM, 2015.
[31] R. Johnson and T. Zhang. Accelerating stochastic [51] D. Silver, A. Huang, C. J. Maddison, A. Guez,
gradient descent using predictive variance reduction. L. Sifre, G. van den Driessche, J. Schrittwieser,
In NIPS, pages 315–323, 2013. I. Antonoglou, V. Panneershelvam, M. Lanctot,
[32] P. Kairouz, S. Oh, and P. Viswanath. The composition S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,
theorem for differential privacy. In ICML, pages I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu,
1376–1385. ACM, 2015. T. Graepel, and D. Hassabis. Mastering the game of
[33] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, Go with deep neural networks and tree search. Nature,
S. Raskhodnikova, and A. Smith. What can we learn 529(7587):484–489, 2016.
privately? SIAM J. Comput., 40(3):793–826, 2011. [52] S. Song, K. Chaudhuri, and A. Sarwate. Stochastic
[34] D. Kifer, A. D. Smith, and A. Thakurta. Private gradient descent with differentially private updates. In
convex optimization for empirical risk minimization GlobalSIP Conference, 2013.
with applications to high-dimensional regression. In [53] L. Sweeney. k-anonymity: A model for protecting
COLT, pages 25.1–25.40, 2012. privacy. International J. of Uncertainty, Fuzziness and
[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Knowledge-Based Systems, 10(05):557–570, 2002.
ImageNet classification with deep convolutional neural [54] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
networks. In NIPS, pages 1097–1105, 2012. D. Anguelov, D. Erhan, V. Vanhoucke, and
[36] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. A. Rabinovich. Going deeper with convolutions. In
Gradient-based learning applied to document CVPR, pages 1–9. IEEE, 2015.
recognition. Proceedings of the IEEE, 86(11), 1998. [55] S. Tu, R. Roelofs, S. Venkataraman, and B. Recht.
[37] Y. Lindell and B. Pinkas. Privacy preserving data Large scale kernel learning using block coordinate
mining. In CRYPTO, pages 36–54. Springer, 2000. descent. CoRR, abs/1602.05310, 2016.
[38] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver. [56] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever,
Move evaluation in Go using deep convolutional neural and G. E. Hinton. Grammar as a foreign language. In
networks. In ICLR, 2015. NIPS, pages 2773–2781, 2015.
[39] F. McSherry and I. Mironov. Differentially private [57] X. Wu, A. Kumar, K. Chaudhuri, S. Jha, and J. F.
recommender systems: Building privacy into the Naughton. Differentially private stochastic gradient
Netflix Prize contenders. In KDD, pages 627–636. descent for in-RDBMS analytics. CoRR,
ACM, 2009. abs/1606.04722, 2016.
APPENDIX Tail bound by moments. The proof is based on the stan-
dard Markov’s inequality argument used in proofs of mea-
A. PROOF OF THEOREM 2 sure concentration. We have
Here we restate and prove Theorem 2.
Pr [c(o) ≥ ε]
o∼M(d)
Theorem 2. Let αM (λ) defined as
= Pr [exp(λc(o)) ≥ exp(λε))]

αM (λ) = max 0 αM (λ; aux, d, d0 ),
o∼M(d)
aux,d,d Eo∼M(d) [exp(λc(o))]

where the maximum is taken over all auxiliary inputs and exp(λε)
neighboring databases d, d0 . Then ≤ exp(α − λε).
1. [Composability] Suppose that a mechanism M con- Let B = {o : c(o) ≥ ε}. Then for any S,
sists of a sequence of adaptive mechanisms M1 , . . . , Mk
where Mi : i−1
Q
j=1 Rj × D → Ri . Then, for any λ Pr[M (d) ∈ S]
k = Pr[M (d) ∈ S ∩ B c ] + Pr[M (d) ∈ S ∩ B]
X
αM (λ) ≤ αMi (λ) . ≤ exp(ε) Pr[M (d0 ) ∈ S ∩ B c ] + Pr[M (d) ∈ B]
i=1
≤ exp(ε) Pr[M (d0 ) ∈ S] + exp(α − λε).
2. [Tail bound] For any ε > 0, the mechanism M is
(ε, δ)-differentially private for The second part follows by an easy calculation.

δ = min exp(αM (λ) − λε) . The proof demonstrates a tail bound on the privacy loss,
λ
making it stronger than differential privacy for a fixed value
Proof. Composition of moments. For brevity, let of ε, δ.
M1:i denote (M1 , . . . , Mi ), and similarly let o1:i denote
(o1 , . . . , oi ). For neighboring databases d, d0 ∈ Dn , and a
sequence of outcomes o1 , . . . , ok we write
B. PROOF OF LEMMA 3
The proof of the main theorem relies on the following mo-
c(o1:k ; M1:k , o1:(k−1) , d, d0 ) ments bound on Gaussian mechanism with random sam-
Pr[M1:k (d; o1:(k−1) ) = o1:k ] pling.
= log
Pr[M1:k (d0 ; o1:(k−1) ) = o1:k ]
Lemma 3. Suppose that f : D → Rp with kf (·)k2 ≤ 1.
k
Y Pr[Mi (d) = oi | M1:(i−1) (d) = o1:(i−1) ] Let σ ≥ 1 and let J be a sample from [n] where each i ∈ [n]
= log is chosen independently with probability q < 16σ 1
. Then for
i=1
Pr[Mi (d0 ) = oi | M1:(i−1) (d0 ) = o1:(i−1) ]
any positive integer λ ≤ σ 2 ln qσ 1
, the mechanism M(d) =
k
Pr[Mi (d) = oi | M1:(i−1) (d) = o1:(i−1) ] P 2
i∈J f (di ) + N (0, σ I) satisfies
X
= log
i=1
Pr[Mi (d0 ) = oi | M1:(i−1) (d0 ) = o1:(i−1) ]
q 2 λ(λ + 1)
k
X αM (λ) ≤ + O(q 3 λ3 /σ 3 ).
= c(oi ; Mi , o1:(i−1) , d, d ). 0 (1 − q)σ 2
i=1
Proof. Fix d0 and let dP= d0 ∪ {dn }. Without loss of
Thus generality, f (dn ) = e1 and i∈J\[n] f (di ) = 0. Thus M(d)
and M(d0 ) are distributed identically except for the first
Eo01:k ∼M1:k (d) exp(λc(o01:k ; M1:k , d, d0 )) ∀i < k : o0i = oi

coordinate and hence we have a one-dimensional problem.
Let µ0 denote the pdf of N (0, σ 2 ) and let µ1 denote the pdf
" k
!#
X 0 0
= Eo01:k ∼M1:k (d) exp λ c(oi ; Mi , o1:(i−1) , d, d ) of N (1, σ 2 ). Thus:
i=1
" k
Y
# M(d0 ) ∼ µ0 ,
λc(o0i ; Mi , o1:(i−1) , d, d0 )

= Eo01:k ∼M1:k (d) exp ∆
i=1
M(d) ∼ µ = (1 − q)µ0 + qµ1 .
(by independence of noise)
We want to show that
k
Y
Eo0i ∼Mi (d) exp(λc(o0i ; Mi , o1:(i−1) , d, d0 ))

= Ez∼µ [(µ(z)/µ0 (z))λ ] ≤ α,
i=1
k and Ez∼µ0 [(µ0 (z)/µ(z))λ ] ≤ α,
Y 0
= exp αMi (λ; o1:(i−1) , d, d )
i=1
for some explicit α to be determined later.
k
! We will use the same method to prove both bounds. As-
= exp
X
αi (λ; o1:(i−1) , d, d0 ) . sume we have two distributions ν0 and ν1 , and we wish to
i=1
bound

The claim follows. Ez∼ν0 [(ν0 (z)/ν1 (z))λ ] = Ez∼ν1 [(ν0 (z)/ν1 (z))λ+1 ] .
Using binomial expansion, we have Thus the third term in the binomial expansion (5)
! " 2 #
Ez∼ν1 [(ν0 (z)/ν1 (z))λ+1 ] 1+λ µ0 (z) − µ(z) λ(λ + 1)q 2
Ez∈µ ≤ .
2 µ(z) (1 − q)σ 2
= Ez∼ν1 [(1 + (ν0 (z) − ν1 (z))/ν1 (z))λ+1 ]
= Ez∼ν1 [(1 + (ν0 (z) − ν1 (z))/ν1 (z))λ+1 ] To bound the remaining terms, we first note that by stan-
λ+1
! dard calculus, we get:
X λ+1
= Ez∼ν1 [((ν0 (z) − ν1 (z))/ν1 (z))t ] . (5) ∀z ≤ 0 : |µ0 (z) − µ1 (z)| ≤ −(z − 1)µ0 (z)/σ 2 ,
t=0
t
∀z ≥ 1 : |µ0 (z) − µ1 (z)| ≤ zµ1 (z)/σ 2 ,
The first term in (5) is 1, and the second term is ∀0 ≤ z ≤ 1 : |µ0 (z) − µ1 (z)| ≤ µ0 (z)(exp(1/2σ 2 ) − 1)
≤ µ0 (z)/σ 2 .
Z ∞
ν0 (z) − ν1 (z) ν0 (z) − ν1 (z)
Ez∼ν1 = ν1 (z) dz
ν1 (z) −∞ ν1 (z)
Z ∞ Z ∞
= ν0 (z) dz − ν1 (z) dz We can then write
" t #
−∞ −∞
µ0 (z) − µ(z)
= 1 − 1 = 0. Ez∼µ
µ(z)
t
To prove the lemma it suffices to show show that for both Z 0
µ0 (z) − µ(z)

ν0 = µ, ν1 = µ0 and ν0 = µ0 , ν1 = µ, the third term is ≤ µ(z) dz
µ(z)
bounded by q 2 λ(λ + 1)/(1 − q)σ 2 and that this bound dom- −∞

inates the sum of the remaining terms. We will prove the Z 1
µ0 (z) − µ(z)
t

more difficult second case (ν0 = µ0 , ν1 = µ); the proof of the + µ(z) dz
0 µ(z)
other case is similar.
To upper bound the third term in (5), we note that µ(z) ≥ Z ∞ t
µ0 (z) − µ(z)

(1 − q)µ0 (z), and write + µ(z) dz.
1 µ(z)
" 2 #
µ0 (z) − µ(z) We consider these terms individually. We repeatedly make
Ez∼µ use of three observations: (1) µ0 − µ = q(µ0 − µ1 ), (2)
µ(z)
" 2 # µ ≥ (1 − q)µ0 , and (3) Eµ0 [|z|t ] ≤ σ t (t − 1)!!. The first term
µ0 (z) − µ1 (z) can then be bounded by
= q 2 Ez∼µ
µ(z) qt
Z 0
Z ∞ µ0 (z)|z − 1|t dz
(µ0 (z) − µ1 (z))2 (1 − q)t−1 σ 2t −∞
= q2 dz
−∞ µ(z) (2q)t (t − 1)!!
∞ ≤ .
q2 (µ0 (z) − µ1 (z))2 2(1 − q)t−1 σ t
Z
≤ dz
1 − q −∞ µ0 (z) The second term is at most
" 2 #
q2
t
µ0 (z) − µ1 (z) qt
Z 1
µ0 (z) − µ1 (z)

= Ez∼µ0 . µ(z) dz
1−q µ0 (z) (1 − q)t 0 µ0 (z)
t Z 1
An easy fact is that for any a ∈ R, Ez∼µ0 exp(2az/2σ 2 ) = q 1
≤ µ(z) 2t dz
exp(a2 /2σ 2 ). Thus, (1 − q)t 0 σ
qt
" 2 # ≤ .
µ0 (z) − µ1 (z) (1 − q)t σ 2t
Ez∼µ0
µ0 (z) Similarly, the third term is at most
" 2 # t
Z ∞
2z − 1 qt

= Ez∼µ0 1 − exp( ) zµ1 (z)
µ 0 (z) dz
2σ 2 (1 − q)t−1 σ 2t 1 µ0 (z)

qt
Z
2z − 1
= 1 − 2Ez∼µ0 exp( ) ≤ µ0 (z) exp((2tz − t)/2σ 2 )z t dz
2σ 2 (1 − q) σt−1 2t
1
q t exp((t2 − t)/2σ 2 ) ∞
Z
4z − 2
+ Ez∼µ0 exp( ) ≤ µ0 (z − t)z t dz
2σ 2 (1 − q) σ
t−1 2t
0
(2q)t exp((t2 − t)/2σ 2 )(σ t (t − 1)!! + tt )

1 −1
= 1 − 2 exp · exp ≤ .
2σ 2 2σ 2 2(1 − q)t−1 σ 2t

4 −2 Under the assumptions on q, σ, and λ, it is easy to check
+ exp · exp
2σ 2 2σ 2 that the three terms, and their sum, drop off geometrically
= exp(1/σ 2 ) − 1. fast in t for t > 3. Hence the binomial expansion (5) is
dominated by the t = 3 term, which is O(q 3 λ3 /σ 3 ). The
claim follows.
To derive Theorem 1, we use the above moments bound D. HYPERPARAMETER SEARCH
along with the tail bound from Theorem 2, optimizing over Here we state Theorem 10.2 from [27] that we use to ac-
the choice of λ. count for the cost of hyperparameter search.
Theorem 1. There exist constants c1 and c2 so that given Theorem D.1 (Gupta et al. [27]). Let M be an ε -
the sampling probability q = L/N and the number of steps differentially private mechanism such that for a query func-
T , for any ε < c1 q 2 T , Algorithm 1 is (ε, δ)-differentially tion q with sensitivity 1, and a parameter Q, it holds that
private for any δ > 0 if we choose Prr∼M (d) [q(d, r) ≥ Q] ≥ p for some p ∈ (0, 1). Then for any
p δ > 0 and any ε0 ∈ (0, 21 ), there is a mechanism M 0 which
q T log(1/δ) satisfies the following properties:
σ ≥ c2 .
ε h i
• Prr∼M 0 (d) q(d, r) ≥ Q − ε40 log( ε01δp ) ≥ 1 − δ.
Proof. Assume for now that σ, λ satisfy the conditions
in Lemma 3. By Theorem 2.1 and Lemma 3, the log mo- • M 0 makes ( ε01δp )2 log( ε01δp ) calls to M .
ment of Algorithm 1 can be bounded as follows α(λ) ≤
T q 2 λ2 /σ 2 . By Theorem 2, to guarantee Algorithm 1 to be • M 0 is (ε + 8ε0 )-differentially private.
(ε, δ)-differentially private, it suffices that
Suppose that we have a differentially private mechanism
T q 2 λ2 /σ 2 ≤ λε/2 , Mi for each of K choices of hyperparameters. Let M̃ be the
mechanism that picks a random choice of hyperparameters,
exp(−λε/2) ≤ δ .
and runs the corresponding Mi . Let q(d, r) denote the num-
In addition, we need ber of examples from the validation set the r labels correctly,
and let Q be a target accuracy. Assuming that one of the
λ ≤ σ 2 log(1/qσ) . hyperparameter settings gets accuracy at least Q, M̃ satis-
1
fies the pre-conditions of the theorem for p = K . Then with
It is now easy to verify that when ε = c1 q 2 T , we can satisfy high probability, the mechanism implied by the theorem gets
all these conditions by setting accuracy close to Q. We remark that the proof of Theo-
p
q T log(1/δ) rem D.1 actually implies a stronger max(ε, 8ε0 )-differential
σ = c2 privacy for the setting of interest here.
ε Putting in some numbers, for a target accuracy of 95% on
for some explicit constants c1 and c2 . a validation set of size 10,000, we get Q = 9, 500. Thus, if,
for instance, we allow ε0 = 0.5, and δ = 0.05, we lose at most
1% in accuracy as long as 100 > 8 ln 40 . This is satisfied as
C. FROM DIFFERENTIAL PRIVACY TO MO- 1
p
long as p ≥ 6700 . In other words, one can try 6,700 different
MENTS BOUNDS parameter settings at privacy cost ε = 4 for the validation
One can also translate a differential privacy guarantee into set. In our experiments, we tried no more than a hundred
a moment bound. settings, so that this bound is easily satisfied. In practice,
as our graphs show, p for our hyperparameter search is sig-
Lemma C.1. Let M be ε-differentially private. Then for nificantly larger than K1
, so that a slightly smaller ε0 should
any λ > 0, M satisfies suffice.
αλ ≤ λε(eε − 1) + λ2 ε2 e2ε /2.
Proof. Let Z denote the random variable c(M(d)). Then
differential privacy implies that

• µ = E[Z] ≤ ε(exp(ε) − 1).
• |Z| ≤ ε, so that |Z − µ| ≤ ε exp(ε).
Then E[exp(λZ)] = exp(λµ) · E[exp(λ(Z − µ))]. Since Z is in
a bounded range [−ε exp(ε), ε exp(ε)] and f (x) = exp(λx) is
convex, we can bound f (x) by a linear interpolation between
the values at the two endpoints of the range. Basic calculus
then implies that
E[f (Z)] ≤ f (E[Z]) · exp(λ2 ε2 exp(2ε)/2),
which concludes the proof.

Lemma C.1 and Theorem 2 give a way of getting a compo-


sition theorem for differentially private mechanisms, which
is roughly equivalent to unrolling the proof of the strong
composition theorem of [22]. The power of the moments ac-
countant comes from the fact that, for many mechanisms of
choice, directly bounding in the moments gives a stronger
guarantee than one would get by establishing differential
privacy and applying Lemma C.1.

You might also like