1 R ⃝ Foundations and Trends in Theoretical Computer Science Vol. 9, Nos. 3–4 (2014) 211–407 c ⃝ 2014 C. Dwork and A. Roth DOI: 10.1561/0400000042 The Algorithmic Foundations of Differential Privacy Cynthia Dwork Aaron Roth Microsoft Research, USA University of Pennsylvania, USA [email protected] [email protected]

2 Contents Preface 3 5 1 The Promise of Differential Privacy . . . . . . . . . . . . . . . 6 1.1 Privacy-preserving data analysis 1.2 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . 10 11 2 Basic Terms . . . . . . . . . . . . . . . . . 11 2.1 The model of computation 2.2 Towards defining private data analysis . . . . . . . . . . . 12 2.3 Formalizing differential privacy . . . . . . . . . . . . . . . 15 2.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . 26 3 Basic Techniques and Composition Theorems 28 . . . . . . . . . . . . . . . . . . 28 3.1 Useful probabilistic tools 3.2 Randomized response 29 . . . . . . . . . . . . . . . . . . . . 3.3 The laplace mechanism . . . . . . . . . . . . . . . . . . . 30 3.4 The exponential mechanism . . . . . . . . . . . . . . . . . 37 3.5 Composition theorems . . . . . . . . . . . . . . . . . . . . 41 3.6 The sparse vector technique . . . . . . . . . . . . . . . . . 55 3.7 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . 64 ii

3 iii 4 Releasing Linear Queries with Correlated Error 66 SmallDB . . . . . . . . . . . . . . . 4.1 An offline algorithm: 70 76 . . . . 4.2 An online mechanism: private multiplicative weights . . . . . . . . . . . . . . . . . . . . 86 4.3 Bibliographical notes 5 Generalizations 88 5.1 Mechanisms via . . . . . . . . . . . . . . . . . . . 89 α -nets . . . . . . . . . . . 91 5.2 The iterative construction mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Connections 109 5.4 Bibliographical notes 115 . . . . . . . . . . . . . . . . . . . . 117 6 Boosting for Queries . . . . . . . . . . . . . 119 6.1 The boosting for queries algorithm . . . . . . . . . . . . . . . . . . 130 6.2 Base synopsis generators 6.3 Bibliographical notes 139 . . . . . . . . . . . . . . . . . . . . 7 When Worst-Case Sensitivity is Atypical 140 7.1 Subsample and aggregate . . . . . . . . . . . . . . . . . . 140 7.2 Propose-test-Release . . . . . . . . . . . . . . . . . . . . 143 7.3 Stability and privacy . . . . . . . . . . . . . . . . . . . . . 150 158 8 Lower Bounds and Separation Results . . . . . . . . . . . . . . . . . . . 159 8.1 Reconstruction attacks . . . . . . . . . . . . 164 8.2 Lower bounds for differential privacy 8.3 Bibliographic notes 170 . . . . . . . . . . . . . . . . . . . . . 9 Differential Privacy and Computational Complexity 172 9.1 Polynomial time curators . . . . . . . . . . . . . . . . . . 174 9.2 Some hard-to-Syntheticize distributions . . . . . . . . . . 177 9.3 Polynomial time adversaries 185 . . . . . . . . . . . . . . . . . 9.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . 187 10 Differential Privacy and Mechanism Design 189 10.1 Differential privacy as a solution concept . . . . . . . . . . 191 10.2 Differential privacy as a tool in mechanism design . . . . . 193 10.3 Mechanism design for privacy aware agents . . . . . . . . 204 10.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 213

4 iv 11 Differential Privacy and Machine Learning 216 11.1 The sample complexity of differentially private machine learning . . . . . . . . . . . . . . . . . . . . . . . 219 222 . . . . . . . . . . . . . 11.2 Differentially private online learning 11.3 Empirical risk minimization 227 . . . . . . . . . . . . . . . . . 11.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 230 12 Additional Models 231 12.1 The local model 232 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 12.2 Pan-private streaming model . . . . . . . . . . . . . . . . . . . . 240 12.3 Continual observation 12.4 Average case error for query release . . . . . . . . . . . . . 248 12.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . 252 13 Reflections 254 . . . . . . . . . . . . . . . . . . 254 13.1 Toward practicing privacy 13.2 The differential privacy lens . . . . . . . . . . . . . . . . . 258 Appendices 260 A The Gaussian Mechanism 261 A.1 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . 266 B Composition Theorems for ε, δ ) -DP 267 ( B.1 Extension of Theorem 3.16 . . . . . . . . . . . . . . . . . 267 Acknowledgments 269 References 270

5 Abstract The problem of privacy-preserving data analysis has a long history spanning multiple disciplines. As electronic data about individuals becomes increasingly detailed, and as technology enables ever more powerful collection and curation of these data, the need increases for a robust, meaningful, and mathematically rigorous definition of privacy, together with a computationally rich class of algorithms that satisfy this definition. Differential Privacy is such a definition. After motivating and discussing the meaning of differential privacy, the preponderance of this monograph is devoted to fundamental tech- niques for achieving differential privacy, and application of these tech- niques in creative combinations, using the query-release problem as an ongoing example. A key point is that, by rethinking the computational goal, one can often obtain far better results than would be achieved by methodically replacing each step of a non-private computation with a differentially private implementation. Despite some astonishingly pow- erful computational results, there are still fundamental limitations — not just on what can be achieved with differential privacy but on what can be achieved with any method that protects against a complete breakdown in privacy. Virtually all the algorithms discussed herein maintain differential privacy against adversaries of arbitrary compu- tational power. Certain algorithms are computationally intensive, oth- ers are efficient. Computational complexity for the adversary and the algorithm are both discussed. We then turn from fundamentals to applications other than query- release, discussing differentially private methods for mechanism design and machine learning. The vast majority of the literature on differen- tially private algorithms considers a single, static, database that is sub- ject to many analyses. Differential privacy in other models, including distributed databases and computations on data streams is discussed.

6 2 Finally, we note that this work is meant as a thorough introduc- tion to the problems and techniques of differential privacy, but is not intended to be an exhaustive survey — there is by now a vast amount of work in differential privacy, and we can cover only a small portion of it. C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy . Foun- R ⃝ in Theoretical Computer Science, vol. 9, nos. 3–4, pp. 211–407, dations and Trends 2014. DOI: 10.1561/0400000042.

7 Preface The problem of privacy-preserving data analysis has a long history spanning multiple disciplines. As electronic data about individuals becomes increasingly detailed, and as technology enables ever more powerful collection and curation of these data, the need increases for a robust, meaningful, and mathematically rigorous definition of privacy, together with a computationally rich class of algorithms that satisfy this definition. is such a definition. Differential Privacy After motivating and discussing the meaning of differential privacy, the preponderance of the book is devoted to fundamental techniques for achieving differential privacy, and application of these techniques in creative combinations (Sections 3 – 7 ), using the query-release problem as an ongoing example. A key point is that, by rethinking the com- putational goal, one can often obtain far better results than would be achieved by methodically replacing each step of a non-private compu- tation with a differentially private implementation. Despite some astonishingly powerful computational results, there are still fundamental limitations — not just on what can be achieved with differential privacy but on what can be achieved with any method that protects against a complete breakdown in privacy (Section 8 ). Virtually all the algorithms discussed in this book maintain differential privacy against adversaries of arbitrary computational power. Certain algorithms are computationally intensive, others are 3

8 4 efficient. Computational complexity for the adversary and the algo- 9 . rithm are both discussed in Section In Sections and 11 we turn from fundamentals to applications 10 other than query-release, discussing differentially private methods for mechanism design and machine learning. The vast majority of the lit- erature on differentially private algorithms considers a single, static, database that is subject to many analyses. Differential privacy in other models, including distributed databases and computations on data streams is discussed in Section 12 . Finally, we note that this book is meant as a thorough introduc- tion to the problems and techniques of differential privacy, but is not intended to be an exhaustive survey — there is by now a vast amount of work in differential privacy, and we can cover only a small portion of it.

9 1 The Promise of Differential Privacy “Differential privacy” describes a promise, made by a data holder, or curator , to a data subject: “You will not be affected, adversely or oth- erwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.” At their best, differentially private database mechanisms can make confidential data widely available for accurate data analysis, without resorting to data clean rooms, data usage agreements, data pro- tection plans, or restricted views. Nonetheless, data utility will eventu- ally be consumed: the Fundamental Law of Information Recovery states that overly accurate answers to too many questions will destroy privacy 1 The goal of algorithmic research on differential in a spectacular way. privacy is to postpone this inevitability as long as possible. Differential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population. A medical database may teach us that smoking causes cancer, affecting an insurance company’s view of a smoker’s long-term medical costs. Has the smoker been harmed by the analysis? Perhaps — his insurance 1 This result, proved in Section 8.1 , applies to all techniques for privacy-preserving data analysis, and not just to differential privacy. 5

10 6 The Promise of Differential Privacy premiums may rise, if the insurer knows he smokes. He may also be helped — learning of his health risks, he enters a smoking cessation program. Has the smoker’s privacy been compromised? It is certainly the case that more is known about him after the study than was known before, but was his information “leaked”? Differential privacy will take the view that it was not, with the rationale that the impact on the smoker is the same independent of whether or not he was in the study. conclusions reached in the study that affect the smoker, not It is the his presence or absence in the data set. Differential privacy ensures that the same conclusions, for example, smoking causes cancer, will be reached, independent of whether any individual opts into or opts out of the data set. Specifically, it ensures that any sequence of outputs (responses to queries) is “essentially” equally likely to occur, independent of the presence or absence of any individual. Here, the probabilities are taken over random choices made by the privacy mechanism (something controlled by the data curator), ε . A smaller and the term “essentially” is captured by a parameter, ε will yield better privacy (and less accurate responses). Differential privacy is a definition, not an algorithm. For a given computational task T and a given value of ε there will be many differ- entially private algorithms for achieving in an ε -differentially private T manner. Some will have better accuracy than others. When ε is small, finding a highly accurate ε -differentially private algorithm for T can be difficult, much as finding a numerically stable algorithm for a specific computational task can require effort. 1.1 Privacy-preserving data analysis Differential privacy is a definition of privacy tailored to the problem of privacy-preserving data analysis. We briefly address some concerns with other approaches to this problem. Data Cannot be Fully Anonymized and Remain Useful. Generally speaking, the richer the data, the more interesting and useful it is. This has led to notions of “anonymization” and “removal of person- ally identifiable information,” where the hope is that portions of the

11 1.1. Privacy-preserving data analysis 7 data records can be suppressed and the remainder published and used for analysis. However, the richness of the data enables “naming” an individual by a sometimes surprising collection of fields, or attributes, such as the combination of zip code, date of birth, and sex, or even the names of three movies and the approximate dates on which an indi- vidual watched these movies. This “naming” capability can be used in a linkage attack to match “anonymized” records with non-anonymized records in a different dataset. Thus, the medical records of the gover- nor of Massachussetts were identified by matching anonymized medical encounter data with (publicly available) voter registration records, and Netflix subscribers whose viewing histories were contained in a collec- tion of anonymized movie records published by Netflix as training data for a competition on recommendation were identified by linkage with the Internet Movie Database (IMDb). Differential privacy neutralizes linkage attacks: since being differ- entially private is a property of the data access mechanism, and is unrelated to the presence or absence of auxiliary information available to the adversary, access to the IMDb would no more permit a linkage attack to someone whose history is in the Netflix training set than to someone not in the training set. Re-Identification of “Anonymized” Records is Not the Only Risk. Re- identification of “anonymized” data records is clearly undesirable, not only because of the re-identification per se , which certainly reveals membership in the data set, but also because the record may contain compromising information that, were it tied to an individual, could cause harm. A collection of medical encounter records from a specific urgent care center on a given date may list only a small number of distinct complaints or diagnoses. The additional information that a neighbor visited the facility on the date in question gives a fairly nar- row range of possible diagnoses for the neighbor’s condition. The fact that it may not be possible to match a specific record to the neighbor provides minimal privacy protection to the neighbor. Queries Over Large Sets are Not Protective. Questions about specific individuals cannot be safely answered with accuracy, and indeed one

12 8 The Promise of Differential Privacy might wish to reject them out of hand (were it computationally fea- sible to recognize them). Forcing queries to be over large sets is not differencing attack . Suppose it a panacea, as shown by the following is known that Mr. X is in a certain medical database. Taken together, the answers to the two large queries “How many people in the database have the sickle cell trait?” and “How many people, not named X, in the database have the sickle cell trait?” yield the sickle cell status of Mr. X. One might be tempted to audit Query Auditing Is Problematic. the sequence of queries and responses, with the goal of interdicting any response if, in light of the history, answering the current query would compromise privacy. For example, the auditor may be on the lookout for pairs of queries that would constitute a differencing attack. There are two difficulties with this approach. First, it is possible that refusing to answer a query is itself disclosive. Second, query auditing can be computationally infeasible; indeed if the query language is sufficiently rich there may not even exist an algorithmic procedure for deciding if a pair of queries constitutes a differencing attack. Summary Statistics are Not “Safe.” In some sense, the failure of summary statistics as a privacy solution concept is immediate from the differencing attack just described. Other problems with summary reconstruction attacks against a database statistics include a variety of in which each individual has a “secret bit” to be protected. The utility goal may be to permit, for example, questions of the form “How many P have secret bit value 1?” The goal of the people satisfying property adversary, on the other hand, is to significantly increase his chance of guessing the secret bits of individuals. The reconstruction attacks described in Section 8.1 show the difficulty of protecting against even a linear number of queries of this type: unless sufficient inaccuracy is introduced almost all the secret bits can be reconstructed. A striking illustration of the risks of releasing summary statistics is in an application of a statistical technique, originally intended for confirming or refuting the presence of an individual’s DNA in a foren- sic mix, to ruling an individual in or out of a genome-wide association study. According to a Web site of the Human Genome Project, “Single nucleotide polymorphisms, or SNPs (pronounced “snips”), are DNA

13 1.1. Privacy-preserving data analysis 9 sequence variations that occur when a single nucleotide (A,T,C, or G) in the genome sequence is altered. For example a SNP might change the DNA sequence AAGGCTAA to ATGGCTAA.” In this case we say there are two alleles: A and T. For such a SNP we can ask, given a particular reference population, what are the frequencies of each of the two possible alleles? Given the allele frequencies for SNPs in the ref- erence population, we can examine how these frequencies may differ for a subpopulation that has a particular disease (the “case” group), looking for alleles that are associated with the disease. For this reason, genome-wide association studies may contain the allele frequencies of the case group for large numbers of SNPs. By definition, these allele frequencies are only aggregated statistics, and the (erroneous) assump- tion has been that, by virtue of this aggregation, they preserve privacy. However, given the genomic data of an individual, it is theoretically possible to determine if the individual is in the case group (and, there- fore, has the disease). In response, the National Institutes of Health and Wellcome Trust terminated public access to aggregate frequency data from the studies they fund. This is a challenging problem even for differential privacy, due to the large number — hundreds of thousands or even one million — of measurements involved and the relatively small number of individuals in any case group. “Ordinary” Facts are Not “OK.” Revealing “ordinary” facts, such as purchasing bread, may be problematic if a data subject is followed over time. For example, consider Mr. T, who regularly buys bread, year after year, until suddenly switching to rarely buying bread. An analyst might conclude Mr. T most likely has been diagnosed with Type 2 diabetes. The analyst might be correct, or might be incorrect; either way Mr. T is harmed. “Just a Few.” In some cases a particular technique may in fact provide privacy protection for “typical” members of a data set, or more gen- erally, “most” members. In such cases one often hears the argument that the technique is adequate, as it compromises the privacy of “just a few” participants. Setting aside the concern that outliers may be pre- cisely those people for whom privacy is most important, the “just a few”

14 10 The Promise of Differential Privacy philosophy is not intrinsically without merit: there is a social judgment, a weighing of costs and benefits, to be made. A well-articulated defini- tion of privacy consistent with the “just a few” philosophy has yet to be developed; however, for a single data set, “just a few” privacy can be achieved by randomly selecting a subset of rows and releasing them in 4.3 their entirety (Lemma 4 ). Sampling bounds describing the , Section quality of statistical analysis that can be carried out on random sub- samples govern the number of rows to be released. Differential privacy provides an alternative when the “just a few” philosophy is rejected. 1.2 Bibliographic notes Sweeney [ ] linked voter registration records to “anonymized” medical 81 encounter data; Narayanan and Shmatikov carried out a linkage attack ]. The work 65 against anonymized ranking data published by Netflix [ on presence in a forensic mix is due to Homer et al. [ 46 ]. The first reconstruction attacks were due to Dinur and Nissim [ 18 ].

15 2 Basic Terms This section motivates and presents the formal definition of differential privacy, and enumerates some of its key properties. The model of computation 2.1 We assume the existence of a trusted and trustworthy curator who individuals in a database holds the data of , typically comprised of D some number n of rows. The intuition is that each row contains the data of a single individual, and, still speaking intuitively, the privacy goal is to simultaneously protect every individual row while permitting statistical analysis of the database as a whole. In the non-interactive , or offline , model the curator produces some kind of object, such as a “synthetic database,” collection of summary release the statistics, or “sanitized database” once and for all. After this curator plays no further role and the original data may be destroyed. A query is a function to be applied to a database. The interactive , or online , model permits the data analyst to ask queries adaptively, deciding which query to pose next based on the observed responses to previous queries. 11

16 12 Basic Terms The trusted curator can be replaced by a protocol run by the set of individuals, using the cryptographic techniques for secure multi- party protocols, but for the most part we will not be appealing to cryptographic assumptions. Section 12 describes this and other models studied in the literature. When all the queries are known in advance the non-interactive model should give the best accuracy, as it is able to correlate noise knowing the structure of the queries. In contrast, when no information about the queries is known in advance, the non-interactive model poses severe challenges, as it must provide answers to all possible queries. As we will see, to ensure privacy, or even to prevent privacy catastro- phes, accuracy will necessarily deteriorate with the number of questions asked, and providing accurate answers to all possible questions will be infeasible. A privacy mechanism , or simply a mechanism , is an algorithm that takes as input a database, a universe X of data types (the set of all possible database rows), random bits, and, optionally, a set of queries, and produces an output string. The hope is that the output string can be decoded to produce relatively accurate answers to the queries, if the latter are present. If no queries are presented then we are in the non-interactive case, and the hope is that the output string can be interpreted to provide answers to future queries. In some cases we may require that the output string be a synthetic database . This is a multiset drawn from the universe X of possible database rows. The decoding method in this case is to carry out the query on the synthetic database and then to apply some sort of simple transformation, such as multiplying by a scaling factor, to obtain an approximation to the the true answer to the query. 2.2 Towards defining private data analysis A natural approach to defining privacy in the context of data analy- sis is to require that the analyst knows no more about any individual in the data set after the analysis is completed than she knew before the analysis was begun. It is also natural to formalize this goal by

17 2.2. Towards defining private data analysis 13 requiring that the adversary’s prior and posterior views about an indi- vidual (i.e., before and after having access to the database) shouldn’t be “too different,” or that access to the database shouldn’t change the adversary’s views about any individual “too much.” However, if the database teaches anything at all, this notion of privacy is unachiev- able. For example, suppose the adversary’s (incorrect) prior view is that everyone has 2 left feet. Access to the statistical database teaches that almost everyone has one left foot and one right foot. The adversary now has a very different view of whether or not any given respondent has two left feet. Part of the appeal of before/after, or “nothing is learned,” approach to defining privacy is the intuition that if nothing is learned about an individual then the individual cannot be harmed by the analysis. However, the “smoking causes cancer” example shows this intuition to be flawed; the culprit is auxiliary information (Mr. X smokes). The “nothing is learned” approach to defining privacy is reminiscent of semantic security for a cryptosystem. Roughly speaking, semantic security says that nothing is learned about the plaintext (the unen- crypted message) from the ciphertext. That is, anything known about the plaintext after seeing the ciphertext was known before seeing the ciphertext. So if there is auxiliary information saying that the cipher- text is an encryption of either “dog” or “cat,” then the ciphertext leaks no further information about which of “dog” or “cat” has been encrypted. Formally, this is modeled by comparing the ability of the eavesdropper to guess which of “dog” and “cat” has been encrypted to the ability of a so-called adversary simulator , who has the auxil- iary information but does not have access to the ciphertext, to guess the same thing. If for every eavesdropping adversary, and all auxiliary information (to which both the adversary and the simulator are privy), the adversary simulator has essentially the same odds of guessing as does the eavesdropper, then the system enjoys semantic security. Of course, for the system to be useful, the legitimate receiver must be able to correctly decrypt the message; otherwise semantic security can be achieved trivially. We know that, under standard computational assumptions, seman- tically secure cryptosystems exist, so why can we not build semantically

18 14 Basic Terms secure private database mechanisms that yield answers to queries while keeping individual rows secret? First, the analogy is not perfect: in a semantically secure cryp- tosystem there are three parties: the message sender (who encrypts the plaintext message), the message receiver (who decrypts the cipher- text), and the eavesdropper (who is frustrated by her inability to learn anything about the plaintext that she did not already know before it was sent). In contrast, in the setting of private data analysis there are only two parties: the curator, who runs the privacy mechanism (analogous to the sender) and the data analyst, who receives the infor- mative responses to queries (like the message receiver) and also tries to squeeze out privacy-compromising information about individuals (like the eavesdropper). Because the legitimate receiver is the same party as the snooping adversary, the analogy to encryption is flawed: denying all information to the adversary means denying all information to the data analyst. Second, as with an encryption scheme, we require the privacy mech- anism to be useful, which means that it teaches the analyst something she did not previously know. This teaching is unavailable to an adver- sary simulator; that is, no simulator can “predict” what the analyst has learned. We can therefore look at the database as a weak source of random (unpredictable) bits, from which we can extract some very high quality randomness to be used as a random pad . This can be used in an encryption technique in which a secret message is added to a random value (the “random pad”) in order to produce a string that information-theoretically hides the secret. Only someone knowing the random pad can learn the secret; any party that knows nothing about the pad learns nothing at all about the secret, no matter his or her computational power. Given access to the database, the analyst can learn the random pad, but the adversary simulator, not given access to the database, learns nothing at all about the pad. Thus, given as auxiliary information the encryption of a secret using the random pad, the analyst can decrypt the secret, but the adversary simulator learns nothing at all about the secret. This yields a huge disparity between the ability of the adversary/analyst to learn the secret and the ability

19 2.3. Formalizing differential privacy 15 of the adversary simulator to do the same thing, eliminating all hope of anything remotely resembling semantic security. The obstacle in both the smoking causes cancer example and the hope for semantic security is auxiliary information. Clearly, to be meaningful, a privacy guarantee must hold even in the context of “reasonable” auxiliary knowledge, but separating reasonable from arbi- trary auxiliary knowledge is problematic. For example, the analyst using a government database might be an employee at a major search engine company. What are “reasonable” assumptions about the auxil- iary knowledge information available to such a person? Formalizing differential privacy 2.3 We will begin with the technical definition of differential privacy, and then go on to interpret it. Differential privacy will provide privacy by process ; in particular it will introduce randomness. An early exam- ple of privacy by randomized process is , a tech- randomized response nique developed in the social sciences to collect statistical information about embarassing or illegal behavior, captured by having a property P . Study participants are told to report whether or not they have prop- erty P as follows: 1. Flip a coin. 2. If tails , then respond truthfully. 3. If , then flip a second coin and respond “Yes” if heads and heads “No” if tails. “Privacy” comes from the plausible deniability of any outcome; in par- P corresponds to engaging in illegal behavior, ticular, if having property even a “Yes” answer is not incriminating, since this answer occurs with 1 / 4 whether or not the respondent actually has probability at least property P . Accuracy comes from an understanding of the noise gener- ation procedure (the introduction of spurious “Yes” and “No” answers from the randomization): The expected number of “Yes” answers is 1 / 4 times the number of participants who do not have property P plus is the true fraction of / 4 the number having property P . Thus, if p 3

20 16 Basic Terms P , the expected number of “Yes” answers participants having property 4)+ / p )+(3 / 4) p = (1 / − p/ 2 . Thus, we can estimate p as twice is 4)(1 (1 the fraction answering “Yes” minus 2 , that is, 2((1 / 4) + p/ 2) − 1 / 2 . 1 / non-trivial privacy Randomization is essential; more precisely, any guarantee that holds regardless of all present or even future sources of auxiliary information, including other databases, studies, Web sites, on-line communities, gossip, newspapers, government statistics, and so on, requires randomization. This follows from a simple hybrid argu- ment, which we now sketch. Suppose, for the sake of contradiction, that we have a non-trivial deterministic algorithm. Non-triviality says that there exists a query and two databases that yield different out- puts under this query. Changing one row at a time we see there exists a pair of databases differing only in the value of a single row, on which the same query yields different outputs. An adversary knowing that the database is one of these two almost identical databases learns the value of the data in the unknown row. We will therefore need to discuss the input and output space of randomized algorithms. Throughout this monograph we work with dis- crete probability spaces. Sometimes we will describe our algorithms as sampling from continuous distributions, but these should always be discretized to finite precision in an appropriately careful way (see Remark 2.1 below). In general, a randomized algorithm with domain and (discrete) range A will be associated with a mapping from A to B the probability simplex over , denoted ∆( B ) : B (Probability Simplex) . Given a discrete set B , the prob- Definition 2.1 ability simplex over B , denoted ∆( B ) is defined to be: | B | ∑ B | | and x R ≥ 0 for all i ) = ∆( x ∈ : = 1 x B i i =1 i (Randomized Algorithm) . A randomized algorithm M Definition 2.2 with domain A and discrete range B is associated with a mapping ) = M A → ∆( B ) . On input a ∈ A , the algorithm M outputs M ( a : b for each with probability M ( a )) . The probability space is over ( b ∈ B b the coin flips of the algorithm M .

21 2.3. Formalizing differential privacy 17 x as being collections of records from a We will think of databases . It will often be convenient to represent databases by their X universe |X| N , in which each entry x represents the number of ∈ histograms: x i of type i elements in the database (we abuse notation slightly, let- x ∈X N ting the symbol denote the set of all non-negative integers, including zero). In this representation, a natural measure of the distance between x and two databases will be their ℓ distance: y 1 (Distance Between Databases) . The ℓ Definition 2.3 norm of a 1 database is denoted ∥ x ∥ x and is defined to be: 1 |X| ∑ x . | | x = ∥ ∥ i 1 i =1 ℓ ∥ distance between two databases x and The is ∥ x − y y 1 1 Note that ∥ x ∥ (i.e., the is a measure of the size of a database x 1 number of records it contains), and x − y ∥ ∥ is a measure of how many 1 . x and y between records differ rows (elements Databases may also be represented by multisets of X ) or even ordered lists of rows, which is a special case of a set, of where the row number becomes part of the name of the element. In this case distance between databases is typically measured by the Hamming distance, i.e., the number of rows on which they differ. However, unless otherwise noted, we will use the histogram representation described above. (Note, however, that even when the histogram notation is more mathematically convenient, in actual implementations, the multiset representation will often be much more concise). differential privacy , which intu- We are now ready to formally define itively will guarantee that a randomized algorithm behaves similarly on similar input databases. Definition 2.4 (Differential Privacy) . A randomized algorithm M with |X| S ⊆ N ( ε, δ ) -differentially private if for all is Range( M ) and domain |X| x, y ∈ N ∥ such that ∥ x − y for all : ≤ 1 1 δ, M ( x ) ∈S ] ≤ Pr[ ε ) Pr[ M ( y ) ∈S ] + exp(

22 18 Basic Terms M . where the probability space is over the coin flips of the mechanism = 0 δ is ε -differentially private. , we say that M If that are less than the Typically we are interested in values of δ inverse of any polynomial in the size of the database. In particular, on the order of 1 / ∥ x ∥ are very dangerous: they permit “pre- values of δ 1 serving privacy” by publishing the complete records of a small number of database participants — precisely the “just a few” philosophy dis- cussed in Section . 1 δ is negligible, however, there are theoretical distinc- Even when ε, tions between - and ( ε, δ ) -differential privacy. Chief among these ( 0) ( is what amounts to a switch of quantification order. 0) -differential ε, privacy ensures that, for run of the mechanism M ( x ) , the out- every every neigh- put observed is (almost) equally likely to be observed on ε, δ ) boring database, simultaneously. In contrast ( -differential privacy says that for every pair of neighboring databases x, y , it is extremely ex post facto the observed value M ( unlikely that, ) will be much more x or much less likely to be generated when the database is than when x y . However, given an output the database is ∼M ( x ) it may be possi- ξ ble to find a database y such that ξ is much more likely to be produced on than it is when the database is . That is, the mass of ξ in the y x ( may be substantially larger than its mass in the ) M distribution y M x ) . distribution ( The quantity ( ) ] ξ ) = x ( M Pr[ ( ) ξ = ln L y ( M ) x ( ) ∥M M ( y ) = ξ ] Pr[ is important to us; we refer to it as the incurred by observ- privacy loss ing ξ . This loss might be positive (when an event is more likely under x than under y ) or it might be negative (when an event is more likely under y than under x ). As we will see in Lemma 3.17 , ( ε, δ ) -differential privacy ensures that for all adjacent , the absolute value of the pri- x, y vacy loss will be bounded by ε with probability at least 1 − δ . As always, the probability space is over the coins of the mechanism M . Differential privacy is immune to post-processing : A data analyst, without additional knowledge about the private database, cannot com- pute a function of the output of a private algorithm M and make it

23 2.3. Formalizing differential privacy 19 less differentially private. That is, if an algorithm protects an individ- ual’s privacy, then a data analyst cannot increase privacy loss — either under the formal definition or even in any intuitive sense — simply by thinking about the output of the algorithm. For- sitting in a corner and ( ) - mally, the composition of a data-independent mapping ε, δ with an f differentially private algorithm ε, δ ) differentially private: M is also ( |X| Let Proposition 2.1 : N (Post-Processing) → R be a randomized . M ′ ) -differentially private. Let f : ε, δ → R ( be an algorithm that is R ′ |X| : is ◦ M → R f arbitrary randomized mapping. Then ( ε, δ ) - N differentially private. Proof. We prove the proposition for a deterministic function ′ R → R f . The result then follows because any randomized mapping : can be decomposed into a convex combination of deterministic func- tions, and a convex combination of differentially private mechanisms is differentially private. x, y with ∥ x − y ∥ Fix any pair of neighboring databases ≤ 1 , and 1 ′ R R S . Let T = { r fix any event ⊆ : f ( r ) ∈ S } . We then have: ∈ Pr[ f ( M ( x )) ∈ S ] = Pr[ M ( x ) ∈ T ] ≤ exp( ) Pr[ M ( y ) ∈ T ] + δ ε ε S f ( M ( y )) ∈ = exp( ] + δ ) Pr[ which was what we wanted. It follows immediately from Definition that ( ε, 0) -differential pri- 2.4 ( vacy composes in a straightforward way: the composition of two 0) - ε, differentially private mechanisms is (2 ε, 0) -differentially private. More generally (Theorem 3.16 ), “the epsilons and the deltas add up”: the composition of i th mech- differentially private mechanisms, where the k ∑ ∑ ) , δ δ ) -differentially private, for 1 ≤ - ≤ k , is ( anism is ( ε ε , i i i i i i i differentially private. 0) for ( ε, Group privacy -differentially private mechanisms also fol- lows immediately from Definition 2.4 , with the strength of the privacy guarantee drops linearly with the size of the group.

24 20 Basic Terms Any ( 0) -differentially private mechanism M is ( kε, 0) - Theorem 2.2. ε, k . That is, for all k − y ∥ differentially private for groups of size ≤ ∥ x 1 and all S ⊆ Range( M ) Pr[ ( x ) ∈S ] ≤ exp( kε ) Pr[ M ( y ) ∈S ] , M where the probability space is over the coin flips of the mechanism . M This addresses, for example, the question of privacy in surveys that 1 include multiple family members. More generally, composition and group privacy are not the same thing and the improved composition bounds in Section 3.5.2 (Theo- rem 3.20 ), which substantially improve upon the factor of k , do not — and cannot — yield the same gains for group privacy, even when δ . = 0 What differential privacy promises 2.3.1 Differential privacy promises to protect individ- An Economic View. uals from any additional harm that they might face due to their data being in the private database x that they would not have faced had x . Although individuals may indeed face their data not been part of harm once the results M x ) of a differentially private mechanism M ( have been released, differential privacy promises that the probability of harm was not significantly increased by their choice to participate. This is a very utilitarian definition of privacy, because when an individual is deciding whether or not to include her data in a database that will be used in a differentially private manner, it is exactly this difference that she is considering: the probability of harm given that she participates, as compared to the probability of harm given that she does not partic- ipate. She has no control over the remaining contents of the database. Given the promise of differential privacy, she is assured that she should 1 However, as the group gets larger, the privacy guarantee deteriorates, and this is what we want: clearly, if we replace an entire surveyed population, say, of cancer patients, with a completely different group of respondents, say, healthy teenagers, should get different answers to queries about the fraction of respondents who we ( ε, δ ) - regularly run three miles each day. Although something similar holds for differential privacy, the approximation term δ takes a big hit, and we only obtain ε ( k − 1) ( kε, ke δ ) -differential privacy for groups of size k .

25 2.3. Formalizing differential privacy 21 be almost indifferent between participating and not, from the point of view of future harm. Given any incentive — from altruism to monetary reward — differential privacy may convince her to allow her data to be used. This intuition can be formalized in a utility-theoretic sense, which we here briefly sketch. who has arbitrary preferences over the Consider an individual i A . These pref- set of all possible future events, which we denote by : erences are expressed by a utility function R u A → , and we 0 ≥ i experiences utility u ( say that individual i ) in the event that a ∈ A a i |X| N comes to pass. Suppose that is a data-set containing indi- ∈ x s private data, and that is an i ε -differentially private algo- vidual M be a data-set that is identical to x except that it does not rithm. Let y i (in particular, ∥ x − y ∥ include the data of individual = 1 ), and let 1 f ) → ∆( A ) be the (arbitrary) function that determines the M : Range( , conditioned on the output of mech- distribution over future events A . By the guarantee of differential privacy, together with the M anism 2.1 , resilience to arbitrary post-processing guaranteed by Proposition we have: ∑ Pr · E ) a ( u [ u a ( a )] = [ ] i i )) x ( M f ∼ a ( ( M ( x )) f ∈A a ∑ exp( a ( a ) · u ε ) Pr ] ≤ [ i )) y f ( M ( a ∈A ) E = exp( [ ε )] ( a u i )) a ∼ f ( M ( y Similarly, − [ u )] ( a )] ≥ exp( . ε ) E [ u a ( E i i ( a ∼ f ( M ( y )) M ( x )) a ∼ f Hence, by promising a guarantee of -differential privacy, a data analyst ε can promise an individual that his expected future utility will not be harmed by more than an exp( ε ) ≈ (1+ ε ) factor. Note that this promise u holds of the individual i s utility function independently , and holds i simultaneously for multiple individuals who may have completely dif- ferent utility functions.

26 22 Basic Terms What differential privacy does not promise 2.3.2 As we saw in the Smoking Causes Cancer example, while differential privacy is an extremely strong guarantee, it does not promise uncon- ditional freedom from harm. Nor does it create privacy where none previously exists. More generally, differential privacy does not guaran- tee that what one believes to be one’s secrets will remain secret. It merely ensures that one’s participation in a survey will not in itself be disclosed, nor will participation lead to disclosure of any specifics that one has contributed to the survey. It is very possible that conclu- sions drawn from the survey may reflect statistical information about an individual. A health survey intended to discover early indicators of a particular ailment may produce strong, even conclusive results; that these conclusions hold for a given individual is not evidence of a differ- ential privacy violation; the individual may not even have participated in the survey (again, differential privacy ensures that these conclusive results would be obtained with very similar probability whether or not the individual participated in the survey). In particular, if the survey teaches us that specific private attributes correlate strongly with pub- licly observable attributes, this is not a violation of differential privacy, since this same correlation would be observed with almost the same probability independent of the presence or absence of any respondent. Qualitative Properties of Differential Privacy. Having introduced and formally defined differential privacy, we recaptiluate its key desir- able qualities. 1. Protection against arbitrary risks , moving beyond protection against re-identification. 2. Automatic neutralization of linkage attacks , including all those attempted with all past, present, datasets and other and future forms and sources of auxiliary information. 3. Quantification of privacy loss. Differential privacy is not a binary concept, and has a measure of privacy loss. This permits compar- isons among different techniques: for a fixed bound on privacy loss, which technique provides better accuracy? For a fixed accu- racy, which technique provides better privacy?

27 2.3. Formalizing differential privacy 23 Composition. Perhaps most crucially, the quantification of loss 4. also permits the analysis and control of cumulative privacy loss over multiple computations. Understanding the behavior of differ- entially private mechanisms under composition enables the design and analysis of complex differentially private algorithms from simpler differentially private building blocks. 5. Group Privacy. Differential privacy permits the analysis and con- trol of privacy loss incurred by groups, such as families. Closure Under Post-Processing 6. Differential privacy is immune to post-processing: A data analyst, without additional knowledge about the private database, cannot compute a function of the output of a differentially private algorithm and make it less M differentially private. That is, a data analyst cannot increase pri- vacy loss, either under the formal definition or even in any intu- itive sense, simply by sitting in a corner and thinking about the output of the algorithm, no matter what auxiliary information is available . These are the signal attributes of differential privacy. Can we prove a converse? That is, do these attributes, or some subset thereof, imply differential privacy? Can differential privacy be weakened in these respects and still be meaningful? These are open questions. Final remarks on the definition 2.3.3 Claims of differential privacy should be The Granularity of Privacy. carefully scrutinized to ascertain the level of granularity at which pri- vacy is being promised. Differential privacy promises that the behavior of an algorithm will be roughly unchanged even if a single entry in the database is modified. But what constitutes a single entry in the database? Consider for example a database that takes the form of a . Such a database might encode a social network: each individual graph i ∈ [ n ] is represented by a vertex in the graph, and friendships between individuals are represented by edges. We could consider differential privacy at a level of granularity cor- responding to individuals: that is, we could require that differentially

28 24 Basic Terms ver- private algorithms be insensitive to the addition or removal of any tex from the graph. This gives a strong privacy guarantee, but might in fact be stronger than we need. the addition or removal of a single vertex n edges in the graph. Depending could after all add or remove up to on what it is we hope to learn from the graph, insensitivity to edge n removals might be an impossible constraint to meet. We could on the other hand consider differential privacy at a level of granularity corresponding to edges, and ask our algorithms to be insensitive only to the addition or removal of single, or small numbers edges of, from the graph. This is of course a weaker guarantee, but might still be sufficient for some purposes. Informally speaking, if we promise -differential privacy at the level of a single edge, then no data ε analyst should be able to conclude anything about the existence of any subset of /ε edges in the graph. In some circumstances, large groups 1 of social contacts might not be considered sensitive information: for example, an individual might not feel the need to hide the fact that the majority of his contacts are with individuals in his city or workplace, because where he lives and where he works are public information. On the other hand, there might be a small number of social contacts whose existence is highly sensitive (for example a prospective new employer, or an intimate friend). In this case, edge privacy should be sufficient to protect sensitive information, while still allowing a fuller analysis of the data than vertex privacy. Edge privacy will protect such an indi- 1 /ε vidual’s sensitive information provided that he has fewer than such friends. As another example, a differentially private movie recommendation system can be designed to protect the data in the training set at the “event” level of single movies, hiding the viewing/rating of any single movie but not, say, hiding an individual’s enthusiasm for cowboy west- erns or gore, or at the “user” level of an individual’s entire viewing and rating history. 0) When ε is small, ( ε, All Small Epsilons Are Alike. -differential privacy asserts that for all pairs of adjacent databases x, y and all outputs o , an adversary cannot distinguish which is the true database

29 2.3. Formalizing differential privacy 25 o . When is small, failing to be ( ε, 0) - on the basis of observing ε differentially private is not necessarily alarming — for example, the (2 0) -differentially private. The nature of the pri- mechanism may be ε, vacy guarantees with differing but small epsilons are quite similar. ? Failure to be (15 , 0) -differentially pri- But what of large values for ε vate merely says there exist neighboring databases and an output o for which the ratio of probabilities of observing o conditioned on the or o , is large. An output of x might be database being, respectively, y ε, δ ) -differential privacy); databases very unlikely (this is addressed by ( and y x might be terribly contrived and ulikely to occur in the “real world”; the adversary may not have the right auxiliary information to recognize that a revealing output has occurred; or may not know enough about the database(s) to determine the value of their symmetric differ- ence. Thus, much as a weak cryptosystem may leak anything from only the least significant bit of a message to the complete decryption key, ( 0) - or ( ε, δ ) -differentially private may range from ε, the failure to be effectively meaningless privacy breaches to complete revelation of the entire database. A large epsilon is large after its own fashion. A Few Additional Formalisms. Our privacy mechanism M will w as input, in addition to the often take some auxiliary parameters x database w may specify a query q . For example, on the database x , w ( might (respec- of queries. The mechanism M or a collection w, x ) Q w tively) respond with a differentially private approximation to q or ( x ) w to some or all of the queries in Q . For all δ ≥ 0 , we say that a mech- w ) , ) ( ( w M anism w, ) -differential privacy if for every · , satisfies ( ε, δ · · M ( satisfies ) -differential privacy. ε, δ w Another example of a parameter that may be included in is a κ to govern how small δ security parameter δ ( κ ) should be. That = is, M ( κ, · ) should be ( ε, δ ( κ )) differentially private for all κ . Typically, and throughout this monograph, we require that δ be a negligible func- (1) ω − δ = δ tion in , i.e., . Thus, we think of κ as being cryptograph- κ ically small, whereas ε is typically thought of as a moderately small constant. In the case where the auxiliary parameter specifies a collec- w n Q M = { q a X tion → R } of queries, we call the mechanism : w

30 26 Basic Terms . A synopsis generator outputs a (differentially synopsis generator A which can be used to compute answers to all the private) synopsis queries in Q . That is, we require that there exists a reconstruction w R such that for each input v specifying a query q , ∈ Q procedure w v R ( A , v ) ∈ R the reconstruction procedure outputs . Typically, we will require that with high probability M produces a synopsis A such that the reconstruction procedure, using A , computes accurate answers. That is, for all or most (weighted by some distribution) of the queries q will be bounded. We will ∈ Q | , the error | R ( A , v ) − q ) ( x v w v occasionally abuse notation and refer to the reconstruction procedure taking as input the actual query v q (rather than some representation ( R , q ) . of it), and outputting A synthetic database A special case of a synopsis is a . As the name suggests, the rows of a synthetic database are of the same type as rows of the original database. An advantage to synthetic databases is that they may be analyzed using the same software that the analyst would use on the original database, obviating the need for a special reconstruction procedure R . Remark 2.1. Considerable care must be taken when programming real- valued mechanisms, such as the Laplace mechanism, due to subtleties in the implementation of floating point numbers. Otherwise differential privacy can be destroyed, as outputs with non-zero probability on a database x , may, because of rounding, have zero probability on adja- y . This is just one way in which the implementation of cent databases floating point requires scrutiny in the context of differential privacy, and it is not unique. 2.4 Bibliographic notes The definition of differential privacy is due to Dwork et al. [ 23 ]; the ] 20 precise formulation used here and in the literature first appears in [ and is due to Dwork and McSherry. The term “differential privacy” was coined by Michael Schroeder. The impossibility of semantic secu- rity is due to Dwork and Naor [ 25 ]. Composition and group privacy ]. ( ε, 0) -differentially private mechanisms is first addressed in [ 23 for

31 2.4. Bibliographic notes 27 Composition for ( ε, δ ) -differential privacy was first addressed in [ 21 ] (but see the corrected proof in Appendix B , due to Dwork and Lei [ 22 ]). The vulnerability of differential privacy to inappropriate implementa- tions of floating point numbers was observed by Mironov, who proposed a mitigation [ 63 ].

32 3 Basic Techniques and Composition Theorems After reviewing a few probabilistic tools, we present the Laplace mech- anism, which gives differential privacy for real (vector) valued queries. An application of this leads naturally to the exponential mechanism, which is a method for differentially private selection from a discrete set of candidate outputs. We then analyze the cumulative privacy loss incurred by composing multiple differentially private mechanisms. Finally we give a method — the sparse vector technique — for pri- vately reporting the outcomes of a potentially very large number of computations, provided that only a few are “significant.” In this section, we describe some of the most basic techniques in differential privacy that we will come back to use again and again. The techniques described here form the basic building blocks for all of the other algorithms that we will develop. 3.1 Useful probabilistic tools The following concentration inequalities will frequently be useful. We state them in easy to use forms rather than in their strongest forms. 28

33 3.2. Randomized response 29 (Additive Chernoff Bound) Theorem 3.1 X . , . . . , X Let be indepen- m 1 X 0 ≤ 1 for all ≤ . Let dent random variables bounded such that i i ∑ m 1 E denote their S X S denote their mean, and let μ = = [ ] i i =1 m expected mean. Then: 2 − 2 mε S > μ e ε + ≤ ] Pr[ 2 − 2 mε e S < μ Pr[ ε ] ≤ − (Multiplicative Chernoff Bound) . Let X be inde- , . . . , X Theorem 3.2 m 1 0 ≤ X pendent random variables bounded such that ≤ 1 for all i . Let i ∑ m 1 = S [ μ denote their denote their mean, and let ] = E X S i i =1 m expected mean. Then: 2 mμε 3 − / ] μ e ) ε (1 + S > Pr[ ≤ 2 / − mμε 2 μ ] ≤ Pr[ S < − ε (1 ) e When we do not have independent random variables, all is not lost. We may still apply Azuma’s inequality: Theorem 3.3 (Azuma’s Inequality) . f be a function of m random Let variables X such that , . . . , X A , each X taking values from a set i 1 i m is bounded. Let — i.e., f denote the maximum effect of X ] on f [ E c i i ′ for all , a ∈ A : a i i i ∣ ∣ ′ ∣ ∣ , . . . , X E , X = a ] − E [ f | X , . . . , X , X = a [ f ] | X ≤ c i i i − i 1 1 1 i − 1 i i Then: ) ( 2 2 t ∑ f ) ≥ E [ f ] + t ] ≤ exp ( − X , . . . , X Pr [ m 1 m 2 c =1 i i . Theorem 3.4 ! can be approximated by (Stirling’s Approximation) n √ n ( n/e ) 2 : nπ √ √ 1 / (12 n +1) n n 1 / (12 n ) ! < 2 2 nπ ( n/e ) nπ e ( n/e ) e < n . 3.2 Randomized response Let us recall the simple randomized response mechanism, described in Section 2 , for evaluating the frequency of embarrassing or illegal

34 30 Basic Techniques and Composition Theorems behaviors. Let XYZ be such an activity. Faced with the query, “Have you engaged in XYZ in the past week?” the respondent is instructed to perform the following steps: 1. Flip a coin. 2. tails , then respond truthfully. If heads 3. If , then flip a second coin and respond “Yes” if heads and “No” if tails. The intuition behind randomized response is that it provides “plau- sible deniability.” For example, a response of “Yes” may have been offered because the first and second coin flips were both Heads, which 1 / 4 . In other words, privacy is obtained by pro- occurs with probability , there are no “good” or “bad” responses. The process by which cess the responses are obtained affects how they may legitimately be inter- preted. As the next claim shows, randomized response is differentially private. Claim 3.5. The version of randomized response described above is (ln 3 , 0) -differentially private. Proof. Pr[Response = Fix a respondent. A case analysis shows that Yes | Truth = Yes] = 3 / 4 . Specifically, when the truth is “Yes” the outcome will be “Yes” if the first coin comes up tails (probabil- ity 1 / 2 ) or the first and second come up heads (probability 1 / 4 )), while Pr[Response = Yes / 4 (first comes up heads and | Truth = No] = 1 / 4 ). Applying similar reasoning to second comes up tails; probability 1 the case of a “No” answer, we obtain: | Truth = Yes] Pr[Response = Yes Pr[Response = Yes | Truth = No] 3 / 4 Pr[Response = No | Truth = No] = = = 3 . 1 Pr[Response = No | Truth = Yes] / 4 3.3 The laplace mechanism k |X| : N f → R Numeric queries, functions , are one of the most fun- damental types of database queries. These queries map databases to k

35 3.3. The laplace mechanism 31 real numbers. One of the important parameters that will determine just ℓ how accurately we can answer such queries is their sensitivity: 1 ℓ ℓ -sensitivity) . The Definition 3.1 : ( f -sensitivity of a function 1 1 |X| k N → R is: f = max ) . ∥ ) y ( f − ∆ x ( ∥ f 1 |X| x,y N ∈ ∥ =1 y − x ∥ 1 The sensitivity of a function f captures the magnitude by which ℓ 1 in the worst case, a single individual’s data can change the function f and therefore, intuitively, the uncertainty in the response that we must introduce in order to hide the participation of a single individual. Indeed, we will formalize this intuition: the sensitivity of a function gives an upper bound on how much we must perturb its output to pre- serve privacy. One noise distribution naturally lends itself to differential privacy. Definition 3.2 The Laplace Distribution . (The Laplace Distribution) is the distribution with probability density (centered at 0) with scale b function: ( ) | | x 1 − exp . ) = Lap( x | b b b 2 2 2 b σ . We will sometimes write The variance of this distribution is = 2 ) to denote the Laplace distribution with scale b , and will some- Lap( b times abuse notation and write b ) simply to denote a random vari- Lap( able X ∼ Lap( b ) . The Laplace distribution is a symmetric version of the exponential distribution. We will now define the Laplace Mechanism . As its name suggests, the Laplace mechanism will simply compute f , and perturb each coor- dinate with noise drawn from the Laplace distribution. The scale of the 1 noise will be calibrated to the sensitivity of f ε ). (divided by 1 Alternately, using Gaussian noise with variance calibrated to ∆ f ln(1 /δ ) /ε , A ). Use of the Laplace ε, δ ) -differential privacy (see Appendix ( one can achieve mechanism is cleaner and the two mechanisms behave similarly under composition (Theorem 3.20 ).

36 32 Basic Techniques and Composition Theorems (The Laplace Mechanism) Definition 3.3 : Given any function . f k |X| R , the Laplace mechanism is defined as: N → x, f ( · M , ε ) = ) ( x ) + ( Y ( , . . . , Y ) f 1 L k ) are i.i.d. random variables drawn from Lap (∆ f /ε where . Y i The Laplace mechanism preserves ε, 0) -differential ( Theorem 3.6. privacy. |X| |X| N N and y ∈ Proof. Let be such that ∥ x − y ∥ x ≤ 1 , and ∈ 1 |X| k · ) be some function f : N let → R ( . Let p denote the probabil- f x M ) , and let p denote the probability x, f, ε ity density function of ( y L ( y, f, ε density function of . We compare the two at some arbitrary M ) L k ∈ R z point − ) x ( | z f | ε i i k ∏ − exp( ) z ( p ) x ∆ f = ε | z ) y ( f − | i i ) z ( p y − exp( ) =1 i f ∆ ( ) k ∏ − ε ( ) f ( y ) | | z z |−| f ( x ) − i i i i exp = ∆ f =1 i ) ( k ∏ ) ε | | ( x ) − f ( y f i i ≤ exp ∆ f =1 i ( ) f ·∥ ε x ) − f ( y ) ∥ ( 1 = exp ∆ f ≤ ε , ) exp( where the first inequality follows from the triangle inequality, and the last follows from the definition of sensitivity and the fact that ) z p ( x ε − exp( ≥ ) follows by symmetry. . That 1 ∥ y − x ∥ ≤ 1 ( z p ) y (Counting Queries) . Counting queries are queries of the Example 3.1 form “How many elements in the database satisfy Property P ?” We will return to these queries again and again, sometimes in this pure form, sometimes in fractional form (“What fraction of the elements in the databases...?”), sometimes with weights (linear queries), and |X| sometimes in slightly more complex forms (e.g., apply h : N → [0 , 1] to each element in the database and sum the results). Counting is an

37 3.3. The laplace mechanism 33 extremely powerful primitive. It captures everything learnable in the statistical queries learning model, as well as many standard datamining tasks and basic statistics. Since the sensitivity of a counting query is 1 (the addition or deletion of a single individual can change a count by 3.6 ( ε, 0) - that at most 1), it is an immediate consequence of Theorem differential privacy can be achieved for counting queries by the addition /ε , that is, by adding noise drawn from Lap (1 /ε ) . of noise scaled to 1 1 The expected distortion, or error, is , independent of the size of the /ε database. m A fixed but arbitrary list of counting queries can be viewed as a vector-valued query. Absent any further information about the set of queries a worst-case bound on the sensitivity of this vector-valued query is m , as a single individual might change every count. In this case ε, 0) -differential privacy can be achieved by adding noise scaled ( to m/ε to the true answer to each query. We sometimes refer to the problem of responding to large numbers of (possibly arbitrary) queries as the query release problem . Example 3.2 (Histogram Queries) In the special (but common) case in . which the queries are structurally disjoint we can do much better — we don’t necessarily have to let the noise scale with the number of histogram query . In this type of query the queries. An example is the |X| is partitioned into cells, and the query asks how many universe N database elements lie in each of the cells. Because the cells are disjoint, the addition or removal of a single database element can affect the count in exactly one cell, and the difference to that cell is bounded by 1, so histogram queries have sensitivity 1 and can be answered by adding independent draws from Lap (1 /ε ) to the true count in each cell. To understand the accuracy of the Laplace mechanism for general queries we use the following useful fact: If Y ∼ Lap ( b ) , then: Fact 3.7. Pr[ . Y |≥ t · b ] = exp( − t ) |

38 34 Basic Techniques and Composition Theorems This fact, together with a union bound, gives us a simple bound on the accuracy of the Laplace mechanism: |X| k M Theorem 3.8. → R f , and let y : Let N ( x, f ( · ) , ε ) . Then = L δ ∈ (0 , 1] : ∀ ( [ )] ( ) k ∆ f ) ≥ ln f − ∥ y · x ∥ δ ( ≤ Pr ∞ δ ε We have: Proof. ] [ ) ( ) ( [ )] ( ) ( k ∆ f f ∆ k Pr ≥ ln ∥ − = Pr ) max ∥ · f · ( | Y y |≥ ln x ∞ i δ ε ε δ k ∈ ] [ i [ ( ( ) )] k ∆ f | ≤ Pr Y · ln k |≥ · i ε δ ( ) δ k · = k = δ where the second to last inequality follows from the fact that each . 3.7 ∼ Lap (∆ f /ε ) and Fact Y i Example 3.3 . Suppose we wish to calculate which first (First Names) names, from a list of 10,000 potential names, were the most common among participants of the 2010 census. This question can be repre- 10000 |X| N → f R sented as a query . This is a histogram query, and so : ∆ f = 1 , since every person can only have at most one has sensitivity first name. Using the above theorem, we see that we can simultaneously calculate the frequency of all 10 , 000 names with (1 , 0) -differential pri- vacy, and with probability 95%, no estimate will be off by more than an additive error of (10000 /. 05) ln 12 . 2 . That’s pretty low error for a ≈ nation of more than 300 , 000 , 000 people! Differentially Private Selection. The task in Example 3.3 is one of differentially private selection : the space of outcomes is discrete and the task is to produce a “best” answer, in this case the most populous histogram cell.

39 3.3. The laplace mechanism 35 . Example 3.4 Suppose we wish to (Most Common Medical Condition) know which condition is (approximately) the most common in the med- ical histories of a set of respondents, so the set of questions is, for each condition under consideration, whether the individual has ever received a diagnosis of this condition. Since individuals can experience many conditions, the sensitivity of this set of questions can be high. Nonetheless, as we next describe, this task can be addressed using addi- tion of Lap noise to each of the counts (note the small scale of the (1 /ε ) noise, which is independent of the total number of conditions). Cru- noisy counts themselves will not cially, the m be released (although the “winning” count can be released at no extra privacy cost). Report Noisy Max. Consider the following simple algorithm to deter- counting queries has the highest value: Add indepen- m mine which of /ε ) to each count and return the dently generated Laplace noise Lap (1 index of the largest noisy count (we ignore the possibility of a tie). Call this algorithm Report Noisy Max. Note the “information minimization” principle at work in the Report Noisy Max algorithm: rather than releasing all the noisy counts and allowing the analyst to find the max and its index, only the index corresponding to the maximum is made public. Since the data of an individual can affect all counts, the vector of counts has high - ℓ 1 ∆ = m , and much more noise would be needed sensitivity, specifically, f if we wanted to release all of the counts using the Laplace mechanism. ( ε, 0) -differentially Claim 3.9. The Report Noisy Max algorithm is private. ′ ′ . Let D Fix ∪{ a } D c , respectively c = , denote the vector of Proof. ′ D , respectively D counts when the database is . We use two properties: ′ 1. j ∈ [ m ] , c For all ≥ c Monotonicity of Counts. ; and j j ′ For all j ∈ [ m ] 2. 1 + c Lipschitz Property. . ≥ c , j j Fix any i ∈ [ m ] . We will bound from above and below the ratio of ′ the probabilities that D and with D is selected with . i 1 m − Lap , a draw from [ Fix (1 /ε )] r used for all the noisy counts i − except the i th count. We will argue for each r independently. We − i

40 36 Basic Techniques and Composition Theorems | ξ ] to mean the probability that the output of the Pr[ use the notation i ξ Report Noisy Max algorithm is i , conditioned on . ε ′ ] We first argue that Pr[ ≤ e i | i | D D, r , r . Define ] Pr[ i − i − ∗ r ∀ : c i. + r = > c ̸ + r j = min j i j i r i i will be the output (the argmax noisy , r Note that, having fixed i − ∗ count) when the database is ≥ r D . if and only if r i ≤ j ̸ = i ≤ m : We have, for all 1 ∗ r > c c + + r j j i ′ ∗ ′ ∗ c (1 + ⇒ ≥ ) + r + r r > c + + r c ≥ c j i j j i j ′ ∗ ′ c + 1) > c + ( ⇒ + r r . j i j ∗ ≥ r Thus, if r , then the i + 1 th count will be the maximum when the i ′ and the noise vector is ( r database is , r D ) . The probabilities below i − i r ∼ are over the choice of (1 /ε ) . Lap i ∗ − ε ∗ − ε e ≥ e ≥ 1 + Pr[ r Pr[ ≥ r r ] = ] r ] Pr[ i | D, r − i i i ε ′ − ε ∗ − ∗ ⇒ ] ] ≥ Pr[ r D, r ≥ 1 + r Pr[ ] ≥ e i | Pr[ r | ≥ r D ] = e , , r Pr[ i i i − i − i ε e which, after multiplying through by , yields what we wanted to show: ε ′ . Pr[ ] ≤ e i Pr[ i | D | , r D, r ] i − i − ε ′ | D , r We now argue that Pr[ ] ≤ e i Pr[ i | D, r . Define ] i − i − ∗ ′ ′ i. : c + = = min r ̸ > c r j + r ∀ j i i j r i r , i will be the output (argmax noisy count) Note that, having fixed i − ′ ∗ r when the database is ≥ r D . if and only if i 1 ≤ j ̸ = i ≤ m : We have, for all ′ ∗ ′ r r > c + c + j i j ′ ∗ ′ ⇒ 1 + c r r > 1 + c + + j j i ∗ ′ ′ c + 1) > (1 + + ( r ) + r c ⇒ j i j ∗ ′ ∗ ′ ≥ + 1) ⇒ c . r + ( r + ( + 1) > (1 + c r + ) + r c ≥ c j j j i j i ∗ r will be the output (the argmax noisy ≥ r , then + 1 Thus, if i i count) on database ( r D , r with randomness . We therefore have, ) i i − r with probabilities taken over choice of : i ′ ∗ − ε ∗ − ε ≥ Pr[ r , Pr[ r i + 1] ≥ e | D, r Pr[ r D ≥ r ] ] = e , r ] Pr[ i | ≥ i i i − i −

41 3.4. The exponential mechanism 37 ε e which, after multiplying through by , yields what we wanted to show: ε ′ | , r Pr[ . ] ≤ e i Pr[ D | D, r ] i i − i − 3.4 The exponential mechanism In both the “most common name” and “most common condition” exam- ples the “utility” of a response (name or medical condition, respec- tively) we estimated counts using Laplace noise and reported the noisy maximum. In both examples the utility of the response is directly related to the noise values generated; that is, the popularity of the name or condition is appropriately measured on the same scale and in the same units as the magnitude of the noise. exponential mechanism was designed for situations in which The we wish to choose the “best” response but adding noise directly to the computed quantity can completely destroy its value, such as setting a price in an auction, where the goal is to maximize revenue, and adding a small amount of positive noise to the optimal price (in order to protect the privacy of a bid) could dramatically reduce the resulting revenue. Example 3.5 . Suppose we have an abundant supply of (Pumpkins.) pumpkins and four bidders: A, F, I, K , where A, F, I each bid $1 . 00 and K bids $3 . 01 . What is the optimal price? At $3 . 01 the revenue $3 is 01 , at $3 . 00 and at $1 . 00 the revenue is . . 00 , but at $3 . 02 the $3 revenue is zero! The exponential mechanism is the natural building block for answering queries with arbitrary utilities (and arbitrary non-numeric range), while preserving differential privacy. Given some arbitrary range R , the exponential mechanism is defined with respect to some |X| R N utility function ×R → : , which maps database/output pairs u to utility scores. Intuitively, for a fixed database x , the user prefers that the mechanism outputs some element of R with the maximum possible utility score. Note that when we talk about the sensitivity of the utility |X| u : N ×R → score R , we care only about the sensitivity of u with respect to its database argument; it can be arbitrarily sensitive in its

42 38 Basic Techniques and Composition Theorems range argument: . ≡ max u ∆ max ) y, r ( u − ) x, r ( u | | ∈R r ∥ y − x ∥ : x,y 1 ≤ 1 The intuition behind the exponential mechanism is to output each pos- r ∈ R with probability proportional to exp( εu ( x, r ) / ∆ u ) sible and so the privacy loss is approximately: ( ) / x, r exp( εu ∆ u ) ( ) ε. ≤ ) u ∆ / )] y, r ( u = ε [ u ( x, r ) − ln ) u ) y, r ( εu exp( ∆ / This intuitive view overlooks some effects of a normalization term which arises when an additional person in the database causes the utilities of some elements to decrease and others to increase. The actual r ∈ R mechanism, defined next, reserves half the privacy budget for changes in the normalization term. (The Exponential Mechanism) . The exponential mech- Definition 3.4 M anism ( x, u, R ) selects and outputs an element r ∈ R with E εu ( x,r ) probability proportional to exp( ) . 2∆ u The exponential mechanism can define a complex distribution over a large arbitrary domain, and so it may not be possible to implement u the exponential mechanism efficiently when the range of is super- polynomially large in the natural parameters of the problem. p Returning to the pumpkin example, utility for a price on database x is simply the profit obtained when the price is p and the demand curve is as described by x . It is important that the range of potential prices is independent of the actual bids. Otherwise there would exist a price with non-zero weight in one dataset and zero weight in a neighboring set, violating differential privacy. Theorem 3.10. The exponential mechanism preserves ( ε, 0) - differential privacy. Proof. For clarity, we assume the range R of the exponential mecha- nism is finite, but this is not necessary. As in all differential privacy proofs, we consider the ratio of the probability that an instantiation

43 3.4. The exponential mechanism 39 r ∈ R of the exponential mechanism outputs some element on two |X| |X| − y ∈ N ∈ (i.e., ∥ x and y ∥ N ≤ 1 ). neighboring databases x 1 ( ) ) x,r ( εu exp( ) u 2∆ ∑ ′ εu ( x,r ) ) exp( ′ r ] Pr[ M ( x, u, R ) = u 2∆ r ∈R E ) ( = y,r εu ( ) M r y, u, R ) = ( Pr[ ] E ) exp( 2∆ u ∑ ′ y,r ) εu ( exp( ) ′ u 2∆ ∈R r ( ) ′ ∑ εu ) ) x,r y,r ( εu ( exp( ) ) exp( ′ r ∈R u 2∆ u 2∆ = · ∑ ′ εu y,r ) ) ( εu ( x,r exp( ) ) exp( ′ ∈R r 2∆ 2∆ u u ( ) ′ ′ ( x, r y, r ) − u ( ε ( )) u = exp u 2∆ ′ ∑ ) y,r ( εu exp( ) ′ ∈R r u 2∆ · ∑ ′ ( x,r ) εu exp( ) ′ ∈R r 2∆ u ′ ∑ ) ( ) ( ) x,r ( εu exp( ) ′ ε ε r ∈R 2∆ u · exp · exp ≤ ∑ ′ ) ( x,r εu 2 2 exp( ) ′ r ∈R 2∆ u ε . = exp( ) Pr[ M ] ( y,u )= r E by symmetry. Similarly, ≥ exp( − ε ) r Pr[ M ] ( x,u )= E The exponential mechanism can often give strong utility guarantees, because it discounts outcomes exponentially quickly as their quality score falls off. For a given database and a given utility measure u : x |X| ×R → R , let OPT denote the maximum ( x ) = max ) N u ( x, r u ∈R r r with respect to database x . We will utility score of any element ∈R bound the probability that the exponential mechanism returns a “good” element of , where good will be measured in terms of OPT ( x ) . The R u result is that it will be highly unlikely that the returned element r has a utility score that is inferior to OPT ) by more than an additive ( x u ((∆ O ) log |R| ) . factor of u/ε ) = x , let R Theorem 3.11. = { r ∈ R : u ( x, r Fixing a database OPT OPT ( x ) } denote the set of elements in R which attain utility score u

44 40 Basic Techniques and Composition Theorems OPT ) . Then: ( x u ( )] ) ( [ 2∆ u |R| t − ≤ ( x ) − )) R u ( ln x, u, ( OPT Pr e ≤ + t M u E | ε |R OPT Proof. u εc/ ) exp( |R| 2∆ x, u, )) ≤ c ] ≤ ( R M ( u Pr[ E |R exp( ε OPT ) ( x | / 2∆ u ) u OPT ) ( |R| ( )) ε ( c − OPT x u = . exp |R 2∆ u | OPT The inequality follows from the observation that each r ∈ R u has un-normalized probability mass at most x, r ) ≤ c with ( exp( 2∆ u ) , and hence the entire set of such “bad” elements r has εc/ total un-normalized probability mass at most |R| exp( εc/ 2∆ u ) . In contrast, we know that there exist at least 1 elements |R | ≥ OPT x ( OPT with ( ) = ) , and hence un-normalized probability mass u x, r u OPT exp( ( ε ) / 2∆ u ) , and so this is a lower bound on the normalization x u term. The theorem follows from plugging in the appropriate value for c . |R | ≥ Since we always have 1 , we can more commonly make OPT use of the following simple corollary: x , we have: Corollary 3.12. Fixing a database [ ] 2∆ u t − x, u, R )) ≤ e u ( x ) − ( M Pr (ln ( |R| ) + t ) ( ≤ OPT u E ε 3.11 3.12 , the Expo- As seen in the proofs of Theorem and Corollary nential Mechanism can be particularly easy to analyze. (Best of Two) Example 3.6 Consider the simple question of determin- . ing which of exactly two medical conditions A and B is more common. Let the two true counts be 0 for condition A and c > 0 for condition B . Our notion of utility will be tied to the actual counts, so that conditions ∆ u = 1 . Thus, the utility with bigger counts have higher utility and of . Using the Exponential Mechanism is 0 and the utility of B is c A

45 3.5. Composition theorems 41 3.12 to see that the probability of we can immediately apply Corollary )) u 2 − c ( ε/ (2∆ cε/ − = 2 e 2 observing (wrong) outcome e . A is at most Analyzing Report Noisy Max appears to be more complicated, as 1 it requires understanding what happens in the (probability 4 ) case / when the noise added to the count for is positive and the noise added A B to the count for is negative. A function is monotonic in the data set if the addition of an element to the data set cannot cause the value of the function to decrease. Counting queries are monotonic; so is the revenue obtained by offering a fixed price to a collection of buyers. Report One-Sided Noisy Arg-Max mechanism, which Consider the adds noise to the utility of each potential output drawn from the one- sided exponential distribution with parameter ε/ ∆ u in the case of a monotonic utility, or parameter ε/ u for the case of a non-monotonic 2∆ utility, and reports the resulting arg-max. With this algorithm, whose privacy proof is almost identical to that of Report Noisy Max (but loses a factor of two when the utility is 3.6 above that non-monotonic), we immediately obtain in Example A is exponentially in c ( outcome ∆ u ) = cε less likely to be selected ε/ than outcome B . Theorem 3.13. Report One-Sided Noisy Arg-Max, when run with parameter ε/ u yields the same distribution on outputs as the expo- 2∆ nential mechanism. Composition theorems 3.5 Now that we have several building blocks for designing differentially private algorithms, it is important to understand how we can combine them to design more sophisticated algorithms. In order to use these tools, we would like that the combination of two differentially private algorithms be differentially private itself. Indeed, as we will see, this is will necessarily degrade — ε and δ the case. Of course the parameters consider repeatedly computing the same statistic using the Laplace mechanism, scaled to give ε -differential privacy each time. The average of the answer given by each instance of the mechanism will eventually

46 42 Basic Techniques and Composition Theorems converge to the true value of the statistic, and so we cannot avoid that the strength of our privacy guarantee will degrade with repeated use. In this section we give theorems showing how exactly the parameters compose when differentially private subroutines are combined. δ ε and Let us first begin with an easy warm up: we will see that the 0) -differentially private algorithm and an ( independent use of an , ε 1 0) ε ( ε , + ( -differentially private algorithm, when taken together, is 1 2 0) -differentially private. ε , 2 |X| : N M → R Theorem 3.14. Let ε -differentially private be an 1 1 1 |X| : N M → R algorithm, and let be an ε -differentially private 2 2 2 |X| →R : N M algorithm. Then their combination, defined to be × 1 2 , 1 M ( x ) = ( R by the mapping: ( x ) , M ( x )) is ε + ε -differentially M , 2 1 2 1 2 2 1 private. |X| ∈ N Proof. be such that ∥ x − y ∥ Let ≤ 1 x, y ( r ∈ , r ) . Fix any 2 1 1 R ×R . Then: 1 2 M ] r ) = ( x ) = ( r x , r ( )] Pr[ M Pr[ M ] Pr[ ( x ) = r 1 1 2 2 1 2 , 2 1 = ( y ) = ( r y , r ( )] Pr[ M ] r ) = Pr[ M M ( y ) = r ] Pr[ 1 1 2 1 2 , 2 1 2 ( ) )( ( x ) = r r ] Pr[ ] M M ) = ( x Pr[ 2 1 1 1 = ( y ) = r ] ] Pr[ M M r ( y Pr[ ) = 2 1 1 1 exp( ) exp( ε ) ε ≤ 1 2 ε ε + = exp( ) 2 1 M )] ,r r ( Pr[ )=( x 2 1 , 2 1 exp( )) ε + ε ( − . ≥ By symmetry, 1 2 y ( )] M r Pr[ )=( ,r 2 2 , 1 1 The composition theorem can be applied repeatedly to obtain the following corollary: |X| M -differentially private : N Corollary 3.15. →R Let be an ( ε 0) , i i i ∏ k |X| is defined to be R algorithm for M ∈ i k : N ] → [ . Then if i ] k [ i =1 ∑ k , ε -differentially 0) ) = ( M M ( x ) , . . . , M ( ( x )) , then M x ( is i 1 k [ k k [ ] ] =1 i private. A proof of the generalization of this theorem to ( ε, δ ) -differential privacy appears in Appendix B :

47 3.5. Composition theorems 43 |X| M →R be an ( : N , δ ) -differentially private Let Theorem 3.16. ε i i i i ∏ k |X| ] . Then if M algorithm for i is defined to : N ∈ → [ k R i ] k [ =1 i ∑ ∑ k k ( x ) = ( M δ ( x ) be M M ( x )) , then M , . . . , , ε is ( - ) i 1 i k ] [ k ] k [ =1 i =1 i differentially private. It is a strength of differential privacy that composition is “automatic,” in that the bounds obtained hold without any special effort by the database curator. 3.5.1 Composition: some technicalities In the remainder of this section, we will prove a more sophisticated composition theorem. To this end, we will need some definitions and lemmas, rephrasing differential privacy in terms of distance measures between distributions. In the fractional quantities below, if the denom- inator is zero, then we define the value of the fraction to be infinite (the numerators will always be positive). Definition 3.5 (KL-Divergence) . The KL-Divergence, or Relative Y and Z taking values from Entropy, between two random variables the same domain is defined to be: [ ] y = Y Pr[ ] ln ) = Z ∥ Y ( E D . y Y ∼ ] = Z y Pr[ ( Y ∥ Z ) ≥ It is known that , with equality if and only if Y and D 0 are identically distributed. However, D is not symmetric, does not Z satisfy the triangle inequality, and can even be infinite, specifically when Y ) is not contained in Supp( Z ) . Supp( (Max Divergence) The Max Divergence between two . Definition 3.6 random variables Y and Z taking values from the same domain is defined to be: ] [ Pr[ Y ∈ S ] ( ln Y ∥ Z D . ) = max ∞ Pr[ Z ∈ S ] Supp ( Y ) S ⊆ The -Approximate Max Divergence between Y and Z is defined to be: δ [ ] Y δ − ] S ∈ Pr[ δ D ∥ ( Y max ) = Z ln ∞ Pr[ Z ∈ S ] Y ):Pr[ Y ∈ S ] S δ ⊆ Supp ( ≥

48 44 Basic Techniques and Composition Theorems M Remark 3.1. is Note that a mechanism ε 1. -differentially private if and only if on every two neigh- and , D boring databases ( M ( x ) ∥M ( y )) and ε x y ≤ ∞ M ( y ) ∥M ( ( )) ≤ ε ; and is D x ∞ ε, δ ) -differentially private if and only if on every two neigh- 2. ( δ δ x, y boring databases ( M ( x ) ∥M ( y )) ≤ ε and : D ∥ ( M ( y ) D ∞ ∞ M ( x )) ≤ ε . statistical One other distance measure that will be useful is the between two random variables distance Z , defined as Y and def − = max ∆( | Pr[ Y Y, Z S ] ) Pr[ Z ∈ S ] | . ∈ S Y and Z are δ -close if ∆( Y, Z ) ≤ δ . We say that We will use the following reformulations of approximate max- divergence in terms of exact max-divergence and statistical distance: Lemma 3.17. δ ′ ( Y ∥ Z 1. ≤ ε if and only if there exists a random variable Y ) D ∞ ′ ′ such that ) ≤ δ and D ∆( ( Y Y, Y ∥ Z ) ≤ ε . ∞ δ δ D 2. We have both ) ≤ ε ∥ D ( Z Y and Z ∥ Y ) ≤ ε if and only if ( ∞ ∞ ′ ′ ′ ε ∆( Y, Y there exist random variables ) ≤ δ/ ( e , Z + Y such that ε ′ ′ ′ ) 1) δ/ ( e , + 1) , and D ∆( ( Y Z, Z ∥ Z ≤ ) ≤ ε . ∞ ′ δ Y Proof. 1 -close to Y such that , suppose there exists For Part ( D Y ∥ Z ) ≤ ε . Then for every S , ∞ ′ ε ∈ S ] ≤ Pr[ Y Pr[ ∈ S ] + δ ≤ e Y · Pr[ Z ∈ S ] + δ, δ ( Y ∥ Z ) ≤ ε . and thus D ∞ δ S > ( Y ∥ Z ) ≤ ε . Let ] = { y : Pr[ Y = y D Conversely, suppose that ∞ ε Z · Pr[ e = y ] } . Then ∑ ε ε ] (Pr[ Y = y ] − e δ. · Pr[ Z = S ]) = Pr[ Y ∈ S ] − e ≤ · Pr[ Z ∈ y y S ∈

49 3.5. Composition theorems 45 = { : Pr[ Y = y ] < Pr[ Z = y ] } , then we have T Moreover, if we let y ∑ ∑ (Pr[ y ] Pr[ Y = y ]) = − = Z (Pr[ Y = y ] − Pr[ Z = y ]) ∈ T y ∈ y / T ∑ − ≥ Y = y ] (Pr[ Pr[ Z = y ]) ∈ y S ∑ ε ]) (Pr[ Y = y ] − e ≥ · Pr[ Z = y / ∈ y S ′ and Y by lowering the probabilities on S Thus, we can obtain from Y to satisfy: T raising the probabilities on ′ ε ∈ S , Pr[ Y 1. = y ] = e For all · Pr[ Z = y ] < Pr[ Y = y ] . y ′ For all T , Pr[ Y = y ] ≤ Pr[ Y ∈ = y ] ≤ Pr[ Z = y ] . 2. y ε ′ Z ∈ ∪ T , Pr[ Y y / = y ] = Pr[ Y = y ] ≤ e For all · Pr[ S = y ] . 3. ′ ) ( Y D ∥ Z Then ≤ ε by inspection, and ∞ ′ ′ ε ) = Pr[ Y ∈ S ] − Pr[ Y ∆( ∈ S ] = Pr[ Y ∈ Y, Y ] − e S · Pr[ Z ∈ S ] ≤ δ. ′ We now prove Part Y 2 . Suppose there exist random variables ′ as stated. Then, for every set , S Z and δ ′ ≤ Pr[ Y Y ∈ S ] + ∈ S ] Pr[ ε e + 1 δ ε ′ Z ≤ ∈ S ] + e · Pr[ ε + 1 e ( ) δ δ ε Z e S ] + ∈ · ≤ Pr[ + ε ε + 1 e + 1 e ε · = Z ∈ S ] + δ. e Pr[ δ δ Thus D Y ∥ Z ) ≤ ε , and by symmetry, D ( . ( Z ∥ Y ) ≤ ε ∞ ∞ δ Y and Z such that D Conversely, given ( ∥ Z ) ≤ ε and Y ∞ δ 1 . However, instead of D Y ) ≤ ( , we proceed similarly to Part ε Z ∥ ∞ ′ on S to obtain simply decreasing the probability mass of Y and Y ε eliminate the gap with · Z , we also increase the probability mass of e Z S . Specifically, for every y ∈ S , we’ll take on ′ ε ′ Pr[ = e Y · Pr[ Z ] = = y ] y ε e ]) y = Z · (Pr[ Y = y ] + Pr[ = ε 1 + e ε [ e . · Pr[ ∈ = y ] , Pr[ Y = y ]] Z

50 46 Basic Techniques and Composition Theorems y S This also implies that for ∈ , we have: ′ y Pr[ Y y = − ] Pr[ = ] Y ε Pr[ ] − e Y · Pr[ Z = y ] = y ′ ] − Pr[ Z = y ] = Pr[ Z , = y ε + 1 e and thus ∑ ) ( def ′ Y = y ] − Pr[ Y α = y ] = Pr[ ∈ y S ∑ ( ) ′ Pr[ y ] Pr[ Z Z = y ] = = − ∈ S y ε Y ∈ S ] − ] Pr[ · Pr[ Z ∈ S e = ε e + 1 δ . ≤ ε + 1 e ′ ε { y : Pr[ Z = y ] Similarly on the set S · Pr[ Y = y ] } , we can > e = and increase the probability mass decrease the probability mass of Z ′ ε ′ + 1) ≤ δ/ ( e of Y by a total of some y ∈ S α , we so that for every ′ ε ′ = y ] = e Z · Pr[ Y have = y ] . Pr[ ′ ′ y If , then we can take Pr[ Z α = = ] = Pr[ Z = y ] and α ′ ′ Y = y ] = Pr[ Y = y ] for all y / ∈ S ∪ S Pr[ , giving D ε ( Y ∥ Z ) ≤ ∞ ′ ′ ′ ′ and ) = α . If α ̸ = α ) = ∆( , say α > α Z, Z , then we need to ∆( Y, Y ′ ′ Z and decrease the mass of still increase the probability mass of Y ′ ′ S α = β S ∪ α − in order to ensure by a total of on points outside of that the probabilities sum to 1. That is, if we try to take the “mass ′ ′ ] = y ] and Pr[ Z functions” = y Pr[ as defined above, then while we Y ′ ε ′ Z = y ] ≤ e Y · Pr[ Pr[ y = y ] do have the property that for every , ∑ ε ′ ′ ′ and ≤ e Pr[ · Pr[ Y Z = y ] we also have ] = y Y Pr[ = y ] = 1 − β y ∑ ′ . However, this means that if we let Pr[ Z and = y ] = 1 + β y ′ ′ y : Pr[ Y y = y ] < Pr[ Z { = = ] } , then R ∑ ∑ ) ( ( ) ′ ′ ′ ′ Y Pr[ − ] β. y ] y = 2 Z = Pr[ = Pr[ ≥ y = Y Pr[ − ] y = Z ] y R ∈ y ′ So we can increase the probability mass of Y on points in R by a total of ′ and decrease the probability mass of β on points in R by a total of β , Z ′ ′ ∈ R , . Y while retaining the property that for all = y ] ≤ Pr[ Z y = y ] Pr[

51 3.5. Composition theorems 47 ′ ′ ′ ′ D The resulting ( Y and Z have the properties we want: ) ≤ ε Y , Z ∞ ′ ′ ) , ∆( Z, Z ∆( ) ≤ α . Y, Y and Y and Z satisfy Suppose that random variables Lemma 3.18. ε ( Y ∥ Z ) ≤ ε and D . ( Z ∥ Y ) ≤ ε . Then D ( Y ∥ Z ) ≤ D · ( e ε − 1) ∞ ∞ ≥ Y Z it is the case that D ( Y ∥ Z We know that for any and 0 Proof. ) D ( Y ∥ Z ) + (via the “log-sum inequality”), and so it suffices to bound ( Z ∥ Y ) . We get: D ( ( ∥ Z ) ≤ D ( Y ∥ Z ) + D Y Z ∥ Y ) D ( ) ∑ ] y = Pr[ Y y = Z Pr[ ] ln · ] = Pr[ Y = y + ln y = Pr[ Pr[ Y = y ] Z ] y ) ( = Z Pr[ ] y = ln ]) · Y Pr[ − ] y = Z y + (Pr[ = ] Y Pr[ y ∑ ≤ [0 + | Pr[ Z = y ] − Pr[ Y = y ] |· ε ] y ∑ ε · = } [max { Pr[ Y = y ] , Pr[ Z = y ] y − { Pr[ Y = y ] , min Z = y ] } ] Pr[ ∑ ε · = ] [( e ≤ − 1) · min { Pr[ Y = y ] , Pr[ Z ε y ] } y ε · ( e − 1) . ε ≤ (Azuma’s Inequality) . Let C be real-valued ran- , . . . , C Lemma 3.19 1 k Pr[ dom variables such that for every ∈ [ k ] , i | C ] = 1 | ≤ α , and for i

52 48 Basic Techniques and Composition Theorems c every , . . . , c , we have ) , . . . , C ) ∈ ( C Supp( 1 1 i − 1 − i 1 [ C = | C β. = c ≤ , . . . , C ] c E 1 − 1 i 1 1 i i − 0 , we have Then for every z > [ ] k √ ∑ 2 z / − 2 z k · α > kβ + e C . Pr ≤ i =1 i Advanced composition 3.5.2 In addition to allowing the parameters to degrade more slowly, we would like our theorem to be able to handle more complicated forms of composition. However, before we begin, we must discuss what exactly we mean by composition. We would like our definitions to cover the following two interesting scenarios: 1. Repeated use of differentially private algorithms on the same database. This allows both the repeated use of the same mechanism multiple times, as well as the modular construction of differentially private algorithms from arbitrary private building blocks. 2. Repeated use of differentially private algorithms on different databases that may nevertheless contain information relating to the same individual. This allows us to reason about the cumu- lative privacy loss of a single individual whose data might be spread across multiple data sets, each of which may be used inde- pendently in a differentially private way. Since new databases are created all the time, and the adversary may actually influence the makeup of these new databases, this is a fundamentally different problem than repeatedly querying a single, fixed, database. We want to model composition where the adversary can adaptively affect the databases being input to future mechanisms, as well as the queries to those mechanisms. Let F be a family of database access mechanisms. (For example F could be the set of all ε -differentially private mechanisms.) For a probabilistic adversary A , we consider two experiments, Experiment 0 and Experiment 1, defined as follows.

53 3.5. Composition theorems 49 b for family and adversary A : Experiment F = 1 , . . . , k i For : 1 0 1. outputs two adjacent databases x x A , a mechanism and i i M , and parameters w ∈F . i i receives y 2. ∈ . M ) A w , x ( i i i R i,b A above to be stateful throughout the exper- We allow the adversary iment, and thus it may choose the databases, mechanisms, and the parameters adaptively depending on the outputs of previous mecha- A view of the experiment to be A ’s coin tosses and ’s nisms. We define j y ’s , . . . , y all of the mechanism outputs ) . (The x ( w ’s, M ’s, and i i 1 k i can all be reconstructed from these.) 0 For intuition, consider an adversary who always chooses x to hold i 1 to differ only in that Bob’s data are deleted. Then Bob’s data and x i experiment 0 can be thought of as the “real world,” where Bob allows his data to be used in many data releases, and Experiment 1 as an “ideal world,” where the outcomes of these data releases do not depend on Bob’s data. Our definitions of privacy still require these two exper- iments to be “close” to each other, in the same way as required by the definitions of differential privacy. The intuitive guarantee to Bob is that the adversary “can’t tell”, given the output of all k mechanisms, whether Bob’s data was ever used. Definition 3.7. We say that the family F of database access mecha- nisms satisfies ε k -fold adaptive composition -differential privacy under b 0 1 , we have ( V A ∥ V D ) ≤ ε where V if for every adversary denotes ∞ the view of in k -fold Composition Experiment b above. A ε, δ ( -differential privacy under k -fold adaptive composition instead ) 0 1 δ ≤ ( V . ∥ V ε ) D requires that ∞ ′ (Advanced Composition) For all ε, δ, δ Theorem 3.20 . 0 , the class of ≥ ′ ′ ) -differentially private mechanisms satisfies ( ε ) , kδ + δ ( ε, δ -differential privacy under -fold adaptive composition for: k √ ε ′ ′ + 2 ln(1 /δ = ) ε k kε ( e ε − 1) .

54 50 Basic Techniques and Composition Theorems consists of a tuple of the form = Proof. A view of the adversary A v , . . . , y ( ) , where r is the coin tosses of A and y r, y , . . . , y are the 1 1 k k , . . . , . Let M outputs of the mechanisms M 1 k ′ 0 1 ε . = v ] > e B = · Pr[ V { = v ] } v : Pr[ V 0 V B ] ≤ δ ∈ S , we have We will show that Pr[ , and hence for every set ′ 1 0 ε 0 0 + V ∈ B ] + Pr[ V ≤ ∈ ( S \ B )] ≤ δ Pr[ e ] S · Pr[ V ∈ ∈ S ] . V Pr[ δ ′ 1 0 . V ) ≤ ε D ( ∥ V This is equivalent to saying that ∞ 0 0 B ] It remains to show δ. Let random variable V Pr[ = V ≤ ∈ 0 0 0 1 , Y R ( ) denote the view of A in Experiment 0 and V , . . . , Y = 1 k 1 1 1 R ( , Y ) the view of A in Experiment 1. Then for a fixed view , . . . , Y 1 k r, y v , . . . , y = ( ) , we have 1 k ) ( 0 Pr[ V = v ] ln 1 Pr[ ] V = v ( ) k 0 0 0 0 0 ∏ Y Pr[ y r, Y = R | = y , . . . , Y = y ] = = r ] Pr[ R i 1 i − 1 1 i − 1 i = ln · 1 1 1 1 1 r = Pr[ ] R , . . . , Y R Pr[ = r, Y = y = y ] Y | y = i − 1 i 1 1 1 − i i =1 i ) ( k 0 0 0 0 ∑ = Y | R r, Y Pr[ = y y = y , . . . , Y ] = i i 1 1 − 1 i − 1 i ln = 1 1 1 1 = y ] | R Y = r, Y = y = y Pr[ , . . . , Y − 1 i 1 i 1 − 1 i i =1 i k ∑ def = c ( r, y . , . . . , y ) i i 1 i =1 0 0 r, y r , . . . , y Y , Now for every prefix ) we condition on R ( = = 1 1 i − 1 0 y , . . . , Y , and analyze the expectation and maximum = y 1 i 1 − 1 − i 0 0 0 possible value of the random variable , Y c , . . . , . . . , Y ( , ) = c r, y ( R i 1 i 1 i 0 0 y , Y and x . Once the prefix is fixed, the next pair of databases ) i 1 − i i 1 A are also deter- , the mechanism M x , and parameter w output by i i i 0 mined (in both Experiment 0 and 1). Thus Y is distributed according i 0 y ( w , we have , x to M ) . Moreover for any value i i i i ) ( 0 w ) = y , x M Pr[ ] ( i i i i . ) = ln , y c ( r, y , . . . , y 1 i i − i 1 1 Pr[ M , x ] ( ) = y w i i i i

55 3.5. Composition theorems 51 -differential privacy this is bounded by ε . We can also reason as By ε follows: , y r, y c , . . . , y | ) ( | 1 i − i 1 i 0 1 ( ( M , ( w )) , x ≤ max ) ∥M { D w , x i ∞ i i i i i 1 0 ( ( w } , x ( )) M ∥M D , x w ) i i i ∞ i i i ε. = By Lemma 3.18 , we have: 0 0 0 0 0 0 , Y E , . . . , Y ( R ) | R ] = r, Y c y = y = , . . . , Y [ 1 1 i − i 1 i 1 1 i − 0 1 ) D , x ( = ( ∥M M ( w , x w )) i i i i i i ε ( e ≤ − 1) . ε C = Thus we can apply Azuma’s Inequality to the random variables i √ 0 0 0 ) R c 2 ln(1 · z = with , = ( , to deduce /δ , Y , . . . , Y ε ε α , and β = ) ε i 0 1 i that [ ] ∑ 2 ′ 0 − z / 2 = > ε ∈ B < e ] = Pr Pr[ V δ, C i i as desired. ε, δ ( ) To extend the proof to composition of -differentially private , we use the characterization of approximate max- 0 δ > mechanisms, for divergence from Lemma (Part 2 ) to reduce the analysis to the same 3.17 ε, 0) situation as in the case of ( -indistinguishable sequences. Specifi- 3.17 2 for each of the differentially private cally, using Lemma , Part and the triangle inequality A mechanisms selected by the adversary 0 V is kδ -close to a random for statistical distance, it follows that that W = ( R, Z ) , . . . , Z , variable such that for every prefix r, y , . . . , y 1 1 − 1 i k 1 1 1 = r, Z = if we condition on R = = y R , . . . , Z = Y y , = Y 1 i − 1 − 1 1 i 1 1 − i 1 1 . ( Z ε ∥ Y then it holds that ≤ ) ≤ ε and D ) ( Y D Z ∥ i i ∞ ∞ i i ′ 1 δ ′ 0 ≤ ∥ This suffices to show that D ) -close to ε ( . Since V W is kδ V ∞ ′ δ + kδ 0 ′ V . 1 gives ( , Part D ∥ W ) ≤ ε 3.17 W , Lemma An immediate and useful corollary tells us a safe choice of ε for each ′ ′ mechanisms if we wish to ensure ( ε of , kδ + δ k ) -differential privacy ′ ′ for a given ε , δ .

56 52 Basic Techniques and Composition Theorems ′ ′ > 1 and δ Given target privacy parameters < 0 , 0 < ε Corollary 3.21. ′ ′ , kδ + δ to ensure ) cumulative privacy loss over k ( ε mechanisms, it ε, δ suffices that each mechanism is ( ) -differentially private, where ′ ε √ = . ε ′ /δ k ) 2 2 ln(1 ∗ ′ tells us the composition will be ( ε Proof. , kδ + δ 3.20 ) for Theorem √ ′ ∗ ′ 2 ′ , we have that 2 k ln(1 /δ 1 ) · ε + kε < . When ε δ all ε , where = ′ ∗ ε ≤ ε as desired. Note that the above corollary gives a rough guide for how to set ε to get desired privacy parameters under composition. When one cares about optimizing constants (which one does when dealing with actual can be set more tightly by appealing directly to ε implementations), the composition theorem. Example 3.7. Suppose, over the course of his lifetime, Bob is a mem- = 10 , 000 ( ε , ber of 0) -differentially private databases. Assuming k 0 no coordination among these databases — the administrator of any given database may not even be aware of the existence of the other databases — what should be the value of ε so that, over the course 0 ε = 1 with of his lifetime, Bob’s cumulative privacy loss is bounded by ′ − 32 − 32 e 3.20 says that, taking δ = e − ? Theorem 1 probability at least ε it suffices to have ≤ 1 / 801 . This turns out to be essentially opti- 0 mal against an arbitrary adversary, assuming no coordination among distinct differentially private databases. So how many queries can we answer with non-trivial accuracy? On a database of size n let us say the accuracy is non-trivial if the error o ( n ) . Theorem 3.20 says that for fixed values of ε and δ , is of order 2 n it is possible to answer close to counting queries with non-trivial accuracy. Similarly, one can answer close to queries while still having n √ o ( ) n noise — that is, noise less than the sampling error. We will see that it is possible to dramatically improve on these results, handling, in some cases, even an exponential number of queries with noise only √ n , by coordinating the noise added to the individ- slightly larger than ual responses. It turns out that such coordination is essential: without

57 3.5. Composition theorems 53 coordination the bound in the advanced composition theorem is almost tight. 3.5.3 Laplace versus Gauss An alternative to adding Laplacian noise is to add Gaussian noise. In sensitivity f , we ℓ ∆ this case, rather than scaling the noise to the 1 sensitivity: instead scale to the ℓ 2 ℓ Definition 3.8 -sensitivity) . The ℓ : -sensitivity of a function f ( 2 2 |X| k → R is: N f f ∥ ) y ( − . ) ∥ x ( f ) = max ∆ ( 2 2 |X| N ∈ x,y ∥ ∥ x − y =1 1 The Gaussian Mechanism with parameter b adds zero-mean Gaus- sian noise with variance b k coordinates. The following in each of the A theorem is proved in Appendix . 2 ε ∈ (0 , 1) be arbitrary. For Theorem 3.22. Let > 2 ln(1 . 25 /δ ) , c the Gaussian Mechanism with parameter σ ≥ c ∆ - ( f ) /ε is ( ε, δ ) 2 differentially private. Among the advantages to Gaussian noise is that the noise added for privacy is of the same type as other sources of noise; moreover, the sum of two Gaussians is a Gaussian, so the effects of the privacy mechanism on the statistical analysis may be easier to understand and correct for. The two mechanisms yield the same cumulative loss under composition, so even though the privacy guarantee is weaker for each individual computation, the cumulative effects over many computations δ is sufficiently (e.g., subpolynomially) small, are comparable. Also, if in practice we will never experience the weakness of the guarantee. That said, there is a theoretical disadvantage to Gaussian noise, relative to what we experience with Laplace noise. Consider Report Noisy Max (with Laplace noise) in a case in which every candidate output has the same quality score on database as on its neighbor y . x Independent of the number of candidate outputs, the mechanism yields ( ε, 0) -differential privacy. If instead we use Gaussian noise and report the max, and if the number of candidates is large compared to 1 /δ ,

58 54 Basic Techniques and Composition Theorems then we will exactly select for the events with large Gaussian noise — noise that occurs with probability less than δ . When we are this far out on the tail of the Gaussian we no longer have a guarantee that the ε ± . x observation is within an y e factor as likely to occur on as on Remarks on composition 3.5.4 The ability to analyze cumulative privacy loss under composition gives us a handle on what a world of differentially private databases can offer. A few observations are in order. 0 Assume that the adversary always chooses Weak Quantification. x i 1 to hold Bob’s data, and x to be the same database but with Bob’s data i deleted. Theorem , with appropriate choise of parameters, tells us 3.20 that an adversary — including one that knows or even selects(!) the database pairs — has little advantage in determining the value of b ∈ { 0 , 1 } . This is an inherently weak quantification. We can ensure that the adversary is unlikely to distinguish reality from any given alternative, but we cannot ensure this simultaneously for all alternatives. If there are one zillion databases but Bob is a member of only 10,000 of these, then we are not simultaneously protecting Bob’s from all zillion absence minus ten thousand. This is analogous to the quantification in the ( definition of ) -differential privacy, where we fix in advance a pair ε, δ of adjacent databases and argue that with high probability the output will be almost equally likely with these two databases. Humans and Ghosts. Intuitively, an ( ε, 0) -differentially private database with a small number of bits per record is less protective than ε that con- a differentially private database with the same choice of tains our entire medical histories. So in what sense is our principle privacy measure, ε , telling us the same thing about databases that dif- fer radically in the complexity and sensitivity of the data they store? The answer lies in the composition theorems. Imagine a world inhab- ited by two types of beings: ghosts and humans. Both types of beings behave the same, interact with others in the same way, write, study, work, laugh, love, cry, reproduce, become ill, recover, and age in the same fashion. The only difference is that ghosts have no records in

59 3.6. The sparse vector technique 55 databases, while humans do. The goal of the privacy adversary is to determine whether a given 50-year old, the “target,” is a ghost or a human. Indeed, the adversary is given all 50 years to do so. The adver- sary does not need to remain passive, for example, she can organize clinical trials and enroll patients of her choice, she can create humans to populate databases, effectively creating the worst-case (for privacy) databases, she can expose the target to chemicals at age 25 and again at 35, and so on. She can know everything about the target that could possibly be entered into any database. She can know which databases the target would be in, were the target human. The composition theo- rems tell us that the privacy guarantees of each database — regardless of the data type, complexity, and sensitivity — give comparable pro- tection for the human/ghost bit. 3.6 The sparse vector technique The Laplace mechanism can be used to answer adaptively chosen low sensitivity queries, and we know from our composition theorems that the privacy parameter degrades proportionally to the number of queries answered (or its square root). Unfortunately, it will often happen that we have a very large number of questions to answer — too many to yield a reasonable privacy guarantee using independent perturbation tech- niques, even with the advanced composition theorems of Section 3.5 . In some situations however, we will only care to know the identity of the queries that lie above a certain threshold. In this case, we can hope to gain over the naïve analysis by discarding the numeric answer to queries that lie significantly below the threshold, and merely report- ing that they do indeed lie below the threshold. (We will be able to get the numeric values of the above-threshold queries as well, at lit- tle additional cost, if we so choose). This is similar to what we did in the Report Noisy Max mechanism in section 3.3 , and indeed iterating either that algorithm or the exponential mechanism would be an option for the non-interactive, or offline, case. In this section, we show how to analyze a method for this in the online setting. The technique is simple — add noise and report only

60 56 Basic Techniques and Composition Theorems whether the noisy value exceeds the threshold — and our emphasis is on the analysis, showing that privacy degrades only with the number of queries which actually lie above the threshold, rather than with the total number of queries. This can be a huge savings if we know that the set of queries that lie above the threshold is much smaller than the total number of queries — that is, if the answer vector is . sparse In a little more detail, we will consider a sequence of events — one for each query — which occur if a query evaluated on the database exceeds a given (known, public) threshold. Our goal will be to release a bit vector indicating, for each event, whether or not it has occurred. As each query is presented, the mechanism will compute a noisy response, compare it to the (publicly known) threshold, and, if the threshold is exceeded, reveal this fact. For technical reasons in the proof of privacy ˆ ), the algorithm works with a noisy version T (Theorem 3.24 of the ˆ . While T is public the noisy version threshold T is not. T Rather than incurring a privacy loss for each query, the possible analysis below will result in a privacy cost only for the query values that are near or above the threshold. The Setting. Let m denote the total number of sensitivity 1 queries, which may be chosen adaptively. Without loss of generality, there is a single threshold T fixed in advance (alternatively, each query can have its own threshold, but the results are unchanged). We will be adding noise to query values and comparing the results to . A positive T outcome means that a noisy query value exceeds the threshold. We expect a small number of noisy values to exceed the threshold, and we c are releasing only the noisy values above the threshold. The algorithm will use in its stopping condition. c We will first analyze the case in which the algorithm halts after c = 1 above-threshold query, and show that this algorithm is ε -differentially private no matter how long the sequence of queries is. We will then total analyze the case of c > 1 by using our composition theorems, and derive bounds both for ( ε, 0) and ( ε, δ ) -differential privacy. We first argue that AboveThreshold, the algorithm specialized to the case of only one above-threshold query, is private and accurate.

61 3.6. The sparse vector technique 57 D , an adaptively chosen Input is a private database Algorithm 1 . Output is a , and a threshold T 1 queries , . . . f stream of sensitivity 1 , . . . stream of responses a 1 { f AboveThreshold } , T, ε ) ( D, i ) ( 2 ˆ Lap Let T T = . + ε Each query do for i 4 Let Lap ( ν ) = i ε ˆ f D ) + ν ≥ T then ( if i i = ⊤ . a Output i Halt . else a = Output ⊥ . i end if end for Theorem 3.23. AboveThreshold is ( ε, 0) -differentially private. ′ D and D Proof. . Let A denote Fix any two neighboring databases AboveThresh- the random variable representing the output of ′ ( D, { f denote the random variable representing } , T, ε ) and let A old i ′ ( D the output of , { f AboveThreshold } , T, ε ) . The output of the algo- i k rithm is some realization of these random variables, ∈ {⊤ , ⊥} and a i < k , a ⊤ = ⊥ and a . There are two = has the form that for all i k types of random variables internal to the algorithm: the noisy thresh- k ˆ ν k queries, old T and the perturbations to each of the } { . For the i i =1 following analysis, we will fix the (arbitrary) values of ν and , . . . , ν 1 k − 1 ˆ ν and T . Define the fol- take probabilities over the randomness of k lowing quantity representing the maximum noisy value of any query f , . . . , f evaluated on D : 1 k − 1 g ( D ) = max ) ( f ν ( D ) + i i i

62 58 Basic Techniques and Composition Theorems of , . . . , ν ν a deterministic quantity), we have: ) (which makes g ( D 1 1 − k ˆ ˆ [ a ] = Pr A Pr = ] [ T T > g ( D ) and f ≥ ( D ) + ν k k ˆ ˆ T ,ν T ,ν k k ˆ = Pr ]] T ∈ ( g ( D ) [ ν ( D ) + , f k k ˆ T ,ν k ∫ ∫ ∞ ∞ = Pr[ ν v ] = k −∞ −∞ ˆ D T = t ] 1 [ t ∈ ( g ( D ) , f · ( Pr[ ) + v ]] dvdt k . = ∗ We now make a change of variables. Define: ′ ′ + g ( D ) − g ( D − ) + f v ( D = ) v ˆ ) ( D f k k ′ ˆ g g ( D ) = t ( D + ) t − ′ ˆ , | ˆ v − v | ≤ 2 and | D, D t − t | ≤ 1 . This follows and note that for any -sensitive, and hence the quantity f D ) is 1 ( g ( D ) because each query i 1 -sensitive as well. Applying this change of variables, we have: is ∫ ∫ ∞ ∞ ′ ˆ ˆ = ∗ ν D = ˆ v )) · Pr[ ] T = Pr[ t ] 1 [( t + g ( D ) − g ( k −∞ −∞ ′ ′ D ) , f ) ( D ( ) + v + g ( D ∈ − g ( D g )]] dvdt ( k ∫ ∫ ∞ ∞ ′ ′ ˆ ˆ = = ˆ v ] · Pr[ T dvdt = Pr[ t ] 1 [( t ∈ ( g ( D ν ) , f v ( D ]] ) + k k −∞ −∞ ∫ ∫ ∞ ∞ ≤ ] exp( ε/ 2) Pr[ ν v = k −∞ −∞ ′ ′ ˆ 2) Pr[ = T · t ] 1 [( t ∈ ( g ( D exp( ) , f dvdt ( D ε/ ) + v ]] k ′ ′ ˆ ˆ ε = exp( ) Pr T > g f D [ ) and ] T ( D ( ) + ν ≥ k k ˆ T ,ν k ′ a = A ] [ ) Pr ε = exp( ˆ T ,ν k ˆ | ˆ v − v | and | t t − where the inequality comes from our bounds on | and the form of the pdf of the Laplace distribution. (Accuracy) . We will say that an algorithm which outputs Definition 3.9 ∗ a k , . . . , ∈ {⊤ , ⊥} a stream of answers in response to a stream of 1

63 3.6. The sparse vector technique 59 , . . . , f is ( α, β ) -accurate with respect to a threshold T if f queries 1 k , the algorithm does not halt before except with probability at most β a f ⊤ : , and for all = i k f ( ) ≥ T − α D i and for all = ⊥ : a i T f ) ≤ ( + α. D i ˆ ? The noisy threshold can be T What can go wrong in Algorithm 1 ˆ ( T T − T | > α . In addition a small count f very far from | D ) < , say, i T − α can have so much noise added to it that it is reported as above threshold (even when the threshold is close to correct), and a large f ) ( D count > T + α can be reported as below threshold. All of these i happen with probability exponentially small in . In summary, we can α have a problem with the choice of the noisy threshold or we can have a problem with one or more of the individual noise values ν . Of course, i we could have both kinds of errors, so in the analysis below we allocate 2 to each type. α/ For any sequence of k queries f , . . . , f Theorem 3.24. such that 1 k |{ i < k f : ( D ) ≥ T − α }| = 0 (i.e. the only query close to being i above threshold is possibly the last one), AboveThreshold D, { f ( } , T, ε ) i is α, β ) accurate for: ( )) 8(log k + log(2 /β . α = ε Proof. Observe that the theorem will be proved if we can show that β : except with probability at most ˆ max |≤ T | − | ν T α + | i ] ∈ i k [ If this is the case, then for any = ⊤ , we have: a i ˆ ˆ ( D ) + ν T ≥ T T ≥ T −| f − | i i or in other words: ˆ − ( D ) ≥ T −| α − f T |−| ν T |≥ T i i

64 60 Basic Techniques and Composition Theorems a Similarly, for any ⊥ we have: = i ˆ ˆ ) < | T ≤ T + f T − ( T | + | ν D |≤ T + α i i ˆ : − f ( D ) < T − α < T −| ν , |−| T i < k We will also have that for any T | i i ˆ and so: ( D ) + ν . Therefore the algorithm does ≤ a T , meaning f ⊥ = i i i not halt before k queries are answered. We now complete the proof. Y ∼ Lap ( b ) , then: Recall that if | Y |≥ t · b ] = exp( − t ) . Therefore Pr[ we have: ) ( α εα ˆ | T − Pr[ T |≥ − ] = exp 2 4 2 α ≥ β/ , we find that we require Setting this quantity to be at most ) 4 log(2 /β ε Similarly, by a union bound, we have: ( ) εα | exp Pr[max ν · |≥ α/ − ≤ k 2] i 8 k ∈ i ] [ β/ 2 , we find that we require α Setting this quantity to be at most ≥ 8(log(2 /β )+log k ) These two claims combine to prove the theorem. ε We now show how to handle multiple “above threshold” queries using composition. The Sparse algorithm can be thought of as follows: As queries come in, it makes repeated calls to AboveThreshold. Each time an above threshold query is reported, the algorithm simply restarts the remain- ing stream of queries on a new instantiation of AboveThreshold. It halts after it has restarted AboveThreshold times (i.e. after c above c threshold queries have appeared). Each instantiation of AboveThresh- ( ε, 0) old is -private, and so the composition theorems apply. Theorem 3.25. Sparse is ( ε, δ ) -differentially private. Proof. We observe that Sparse is exactly equivalent to the following ′ on our stream of ( f { } , T, ε procedure: We run AboveThreshold ) D, i queries { f } setting i ε = 0 ; δ , If c ′ ε = ε √ Otherwise. , 1 ln 8 c δ

65 3.6. The sparse vector technique 61 Input is a private database D Algorithm 2 , an adaptively chosen queries f , . . . , a threshold T , and a cutoff point 1 stream of sensitivity 1 a , . . . c . Output is a stream of answers 1 D, { f ) } , T, c, ε, δ Sparse ( i √ 1 32 c ln 2 c δ = σ = 0 = If Let σ . Else Let δ ε ε ˆ Let T = T + Lap ( σ ) 0 Let = 0 count Each query i do for Lap = Let (2 σ ) ν i ˆ then ( D ) + ν if ≥ f T count i i a . = ⊤ Output i Let = count +1 . count ˆ Let T T + Lap ( σ ) = count else Output = ⊥ . a i end if if ≥ c then count Halt . end if end for using the answers supplied by AboveThreshold. When AboveThresh- 1 above threshold query), we simply restart old halts (after ′ D, { f , T, ε } Sparse ( ) on the remaining stream, and continue in this i manner until we have restarted AboveThreshold c times. After the c ’th restart of AboveThreshold halts, we halt as well. We have already ′ ′ D, { f differentially pri- } , T, ε proven that AboveThreshold ) is ( ε ( , 0) i 3.20 ), c vate. Finally, by the advanced composition theorem (Theorem ε ′ √ = applications of an ε - ) ε, δ ( -differentially private algorithm is 1 ln c 8 δ ′ c applications of an ε ε/c = differentially private, and differentially private algorithm is ( ε, 0) -private as desired. It remains to prove accuracy for Sparse, by again observing that Sparse consists only of c calls to AboveThreshold. We note that if each

66 62 Basic Techniques and Composition Theorems α, β/c ) -accurate, then Sparse will of these calls to AboveThreshold is ( α, β ) be ( -accurate. queries f k , . . . , f such that For any sequence of Theorem 3.26. 1 k T ) ≡ |{ i : f L ( ( ) ≥ T − α }| ≤ c , if δ > 0 , Sparse is ( α, β ) accu- D i rate for: √ 1 2 c 512 (ln ln ) k + ln c β δ = . α ε δ , Sparse is ( α, β ) accurate for: If = 0 8 c (ln k + ln(2 c/β )) = α ε We simply apply Theorem 3.24 setting β to be β/c , and ε to be Proof. ε √ 0 , respectively. = 0 δ and ε/c , depending on whether δ > or 1 8 ln c δ Finally, we give a version of Sparse that actually outputs the numeric values of the above threshold queries, which we can do with only a constant factor loss in accuracy. We call this algorithm Numer- icSparse, and it is simply a composition of Sparse with the Laplace ∗ a ∈{⊤ , ⊥} mechanism. Rather than outputting a vector , it outputs a ∗ vector ( R ∪{⊥} ) a . ∈ We observe that NumericSparse is private: Theorem 3.27. ( ε, δ ) -differentially private. NumericSparse is Observe that if δ = 0 , NumericSparse ( D, { f } Proof. , T, c, ε, 0) is sim- i 8 D, { f , together } , T, c, ply the adaptive composition of Sparse 0) ( ε, i 9 1 ′ with the Laplace mechanism with privacy parameters ( ε , δ ) = ( . ε, 0) 9 If , then NumericSparse ( D, { 0 δ > } , T, c, ε, 0) is the composition f i √ 512 √ together with the Laplace mecha- 2) ε, δ/ D, } f { , T, c, of Sparse ( i 512+1 1 ′ √ ε ) = ( ( nism with privacy parameters ε, δ/ 2) . Hence the pri- , δ 512+1 vacy of NumericSparse follows from simple composition. To discuss accuracy, we must define what we mean by the accuracy ∗ of a mechanism that outputs a stream a ∈ ( R ∪{⊥} ) in response to a sequence of numeric valued queries:

67 3.6. The sparse vector technique 63 Input is a private database Algorithm 3 D , an adaptively chosen f 1 , a threshold T , and a cutoff point queries stream of sensitivity , . . . 1 . Output is a stream of answers c , . . . a 1 NumericSparse f } , T, c, ε, δ ) D, ( { i √ 512 8 2 2 √ √ , ε Let ← ε ε . Else Let ε If = ← δ = 0 ε ε, ε = 2 1 2 1 9 9 512+1 512+1 √ 2 32 c ln 2 c δ If δ = 0 Else Let σ ( ε ) = Let σ ( ε ) = . ε ε ˆ T = T + Lap ( σ ( ε Let )) 0 1 Let count = 0 for Each query i do ν )) = Lap (2 σ ( Let ε 1 i ˆ f ( D ) + ν ≥ T if then i count i σ υ ← Lap ( )) ( ε Let 2 i Output a . = f υ ( D ) + i i i count = +1 . Let count ˆ )) = T + Lap ( σ T ε Let ( count 1 else a ⊥ . Output = i end if count ≥ c then if . Halt end if end for (Numeric Accuracy) . We will say that an algorithm Definition 3.10 ∗ ) ∈ ( R ∪{⊥} a which outputs a stream of answers in response , . . . , 1 is k f -accurate with respect to a , . . . , f ) to a stream of ( α, β queries 1 k threshold T if except with probability at most β , the algorithm does not halt before f , and for all a : ∈ R i k f α ( D ) − a | |≤ i i and for all a = ⊥ : i f α. ( D ) ≤ T + i For any sequence of k queries f , . . . , f such that Theorem 3.28. 1 k L ( T ) ≡ |{ i : ) α, β ( D ) ≥ T − α }| ≤ c , if δ > 0 , NumericSparse is ( f i

68 64 Basic Techniques and Composition Theorems accurate for: √ √ 4 c 2 ) + ln k c 512 + 1) (ln ( ln δ β α . = ε If δ = 0 , Sparse is ( α, β ) accurate for: c/β 9 )) (ln k + ln(4 c = α ε Proof. a = ⊥ : Accuracy requires two conditions: first, that for all i by the accuracy ( D f ≤ T + α. This holds with probability 1 − β/ 2 ) i theorem for Sparse. Next, for all a . ∈ R , it requires | f α ( D ) − a | ≤ i i i 1 − 2 by the accuracy of the Laplace This holds for with probability β/ mechanism. What did we show in the end? If we are given a sequence of queries together with a guarantee that only at most c of them have answers T − above , we can answer those queries that are above a given thresh- α old , up to error α . This accuracy is equal, up to constants and a factor T log k , to the accuracy we would get, given the same privacy guar- of antee, if we knew the identities of these large above-threshold queries ahead of time, and answered them with the Laplace mechanism. That is, the sparse vector technique allowed us to fish out the identities of these large queries almost “for free”, paying only logarithmically for the irrelevant queries. This is the same guarantee that we could have got- ten by trying to find the large queries with the exponential mechanism and then answering them with the Laplace mechanism. This algorithm, however, is trivial to run, and crucially, allows us to choose our queries adaptively. 3.7 Bibliographic notes Randomized Response is due to Warner [ 84 ] (predating differential privacy by four decades!). The Laplace mechanism is due to Dwork ]. The exponential mechanism was invented by McSherry and 23 et al. [ 21 60 ]. Theorem 3.16 (simple composition) was claimed in [ Talwar [ ]; the proof appearing in Appendix B is due to Dwork and Lei [ 22 ];

69 3.7. Bibliographic notes 65 McSherry and Mironov obtained a similar proof. The material in Sec- 3.5.1 3.5.2 is taken almost verbatim from Dwork et al. [ 32 tions and ]. ] composition was modeled informally, much as we did for 32 Prior to [ the simple composition bounds. For specific mechanisms applied on a single database, there are “evolution of confidence” arguments due to 18 , 31 ], (which pre-date the definition of Dinur, Dwork, and Nissim [ k -fold com- differential privacy) showing that the privacy parameter in √ position need only deteriorate like k if we are willing to tolerate a 2 δ (for k < 1 /ε (negligible) loss in ). Theorem 3.20 generalizes those arguments to arbitrary differentially private mechanisms, The claim that without coordination in the noise the bounds in the composition theorems are almost tight is due to Dwork, Naor, and Vadhan [ 29 ]. The sparse vector technique is an abstraction of a tech- nique that was introduced, by Dwork, Naor, Reingold, Rothblum, and Vadhan [ 28 ] (indicator vectors in the proof of Lemma 4.4). It has sub- sequently found wide use (e.g. by Roth and Roughgarden [ 74 ], Dwork, 26 ], and Hardt and Rothblum [ 44 ]). In Naor, Pitassi, and Rothblum [ our presentation of the technique, the proof of Theorem 3.23 is due to Salil Vadhan.

70 4 Releasing Linear Queries with Correlated Error One of the most fundamental primitives in private data analysis is the ability to answer numeric valued queries on a dataset. In the last section, we began to see tools that would allow us to do this by adding independently drawn noise to the query answers. In this section, we continue this study, and see that by instead adding carefully correlated noise, we can gain the ability to privately answer vastly more queries to high accuracy. Here, we see two specific mechanisms for solving this problem, which we will generalize in the next section. In this section, we consider algorithms for solving the query release problem with better accuracy than we would get by simply using com- positions of the Laplace mechanism. The improvements are possible because the set of queries is handled as a whole — even in the online setting! — permitting the noise on individual queries to be correlated. To immediately see that something along these lines might be possi- ble, consider the pair of queries in the differencing attack described : “How many people in the database have the sickle cell 1 in Section trait?” and “How many people, not named X, in the database have the sickle cell trait?” Suppose a mechanism answers the first question using the Laplace mechanism and then, when the second question is posed, 66

71 67 responds “You already know the approximate answer, because you just asked me almost the exact same question.” This coordinated response to the pair of questions incurs no more privacy loss than either ques- tion would do taken in isolation, so a (small) privacy savings has been achieved. The query release problem is quite natural: given a class of queries Q over the database, we wish to release some answer a for each query i ∈Q such that the error max is as low as possible, while | a | f − ) ( x f i i i i 1 Recall that for any family of low still preserving differential privacy. sensitivity queries, we can apply the Laplace mechanism, which adds fresh, independent, noise to the answer to each query. Unfortunately, at a fixed privacy level, for ( ε, 0) -privacy guarantees, the magnitude of |Q| the noise that we must add with the Laplace mechanism scales with because this is the rate at which the sensitivity of the combined queries may grow. Similarly, for ) -privacy guarantees, the noise scales with ( ε, δ √ ln(1 |Q| ) . For example, suppose that our class of queries Q consists /δ ∗ f f for all i . If we use only of many copies of the same query: = i the Laplace mechanism to release the answers, it will add independent a will be an independent random variable with mean noise, and so each i ∗ f ( x ) . Clearly, in this regime, the noise rate must grow with |Q| since ∗ a , will converge to the true value f otherwise the average of the ( x ) i which would be a privacy violation. However, in this case, because ∗ ∗ , it would make more sense to approximate for all i f f f only once = i ∗ ∗ ∗ a f a x ) and release a . In this case, the noise rate = ≈ with for all i ( i would not have to scale with |Q| at all. In this section, we aim to design algorithms that are much more accurate than the Laplace mechanism (with error that scales with ) by adding non-independent noise |Q| log as a function of the set of queries. = { and that Recall that our universe is , χ , . . . , χ X } χ 1 2 |X| |X| N databases are represented by histograms in . A linear query is sim- ply a counting query, but generalized to take values in the interval , 1] [0 rather than only boolean values. Specifically, a linear query f takes the 1 It is the privacy constraint that makes the problem interesting. Without this constraint, the query release problem is trivially and optimally solved by just out- putting exact answers for every query.

72 68 Releasing Linear Queries with Correlated Error f : [0 , 1] , and applied to a database x returns either the form X → average or sum value of the query on the database (we will think of both, depending on which is more convenient for the analysis). When values, we will refer to average we think of linear queries as returning them as linear queries, and say that they take value: normalized |X| ∑ 1 x . ) χ ( f ( f · x ) = i i ∥ x ∥ 1 =1 i sum values we will refer When we think of linear queries as returning to them as un-normalized linear queries, and say that they take value: |X| ∑ f ( x . ) f χ · ) = ( x i i =1 i Whenever we state a bound, it should be clear from context whether we are speaking of normalized or un-normalized queries, because they take values in very different ranges. Note that normalized linear queries take values in [0 , 1] , whereas un-normalized queries take values in [0 , ∥ x ∥ . ] 1 Note that with this definition linear queries have sensitivity ≤ 1 . ∆ f Later sections will discuss arbitrary low-sensitivity queries. We will present two techniques, one each for the offline and online cases. Surprisingly, and wonderfully, the offline technique is an imme- diate application of the exponential mechanism using well-known sam- pling bounds from learning theory! The algorithm will simply be to apply the exponential mechanism with range equal to the set of all u ( small ) equal to minus the max- databases y and quality function x, y y to obtain an approx- imum approximation error incurred by querying f ( x ) : imation for x ( ) = − max u | f ( x, y ) − f ( y ) | . (4.1) ∈Q f Sampling bounds (see Lemma below) tell us that a random subset of 4.3 2 ln |Q| /α elements of x will very likely give us a good approximation for all f ( x ) (specifically, with additive error bounded by α ), so we know it is sufficient to restrict the set of possible outputs to small databases. We don’t actually care that the potential output databases are small, only that they are not too numerous: their number plays a role in the proof of

73 69 utility, which is an immediate application of the utility theorem for the 3.11 exponential mechanism (Theorem ). More specifically, if the total number of potential outputs is not too numerous then, in particular, the total number of low-utility outputs is not too numerous, and therefore the ratio of bad outputs to good outputs (there is at least one) is not too large. The online mechanism, which, despite not knowing the entire set of queries in advance, will achieve the same accuracy as the offline mecha- nism, and will be a direct application of the sparse vector technique. As a result, privacy will be immediate, but utility will require a proof. The key will be to argue that, even for a very large set of counting queries, few queries are “significant”; that is, significant queries will be sparse. As with the sparse vector algorithms, we can scale noise according to the number of significant queries, with little dependence on the total number of queries. Before we go on and present the mechanisms, we will give just one example of a useful class of linear queries. Suppose that elements of the database are represented Example 4.1. d boolean features. For example, the first feature may represent by whether or not the individual is male or female, the second feature may represent whether or not they are a college graduate, the third feature may represent whether or not they are US citizens, etc. That is, our d data universe is { 0 , 1 } X . Given a subset of these attributes S ⊆ = { , . . . , d } , we might like to know how many people in the dataset have 1 these attributes. (e.g., “What fraction of the dataset consists of male college graduates with a family history of lung cancer?”). This naturally monotone conjunction query , parameterized by defines a query called a ∏ S a subset of attributes f ( z ) = and defined as . ∈ X z , for z i S i S ∈ all }} Q = { f such queries is simply : S ⊆ { 1 , . . . , d The class of , and S d = 2 has size . A collection of answers to conjunctions is sometimes |Q| contingency called a marginal table, and is a common method of or releasing statistical information about a dataset. Often times, we may not be interested in the answers to all conjunctions, but rather just those that ask about subsets of features S of size | S | = k for some fixed ) ( d } = { f Q : S ⊆{ 1 , . . . , d . This class of queries , | S | = k } has size k . S k k

74 70 Releasing Linear Queries with Correlated Error This large and useful class of queries is just one example of the sorts of queries that can be accurately answered by the algorithms given in this section. (Note that if we wish to also allow (non-monotone) attributes, we can do that as negated conjunctions which ask about to 2 d , and set z = well — simply double the feature space from d i d + z 1 for all i − 1 , . . . , d } .) ∈{ i An offline algorithm: SmallDB 4.1 In this section, we give an algorithm based on the idea of sampling a small database using the exponential mechanism. What we will show is that, for counting queries, it suffices to consider databases that are small: their size will only be a function of the query class, and our , and crucially not on ∥ α ∥ desired approximation accuracy , the size x 1 of the private database. This is important because it will allow us to simultaneously guarantee, for all sufficiently large databases, that there one is at least database in the range of the exponential mechanism that x on queries in Q well approximates too many , and that there are not databases in the range to dissipate the probability mass placed on this “good” database. Algorithm 4 The Small Database Mechanism SmallDB ( x, Q , ε, α ) log |Q| |X| = : ∥ y ∥ y ∈ N Let } R←{ 1 2 α |X| : N ×R→ Let R be defined to be: u ( ( ) = − max u | f x, y x ) − f ( y ) | ∈Q f Sample And Output ∈ R with the exponential mechanism y M ( x, u, R ) E We first observe that the Small Database mechanism preserves ε -differential privacy. Proposition 4.1. The Small Database mechanism is ( ε, 0) differentially private.

75 4.1. An offline algorithm: SmallDB 71 The Small Database mechanism is simply an instantiation of the Proof. . 3.10 exponential mechanism. Therefore, privacy follows from Theorem We may similarly call on our analysis of the exponential mechanism to understand the utility guarantees of the Small Database mechanism. |X| y ∈ N ∥ : ∥ y But first, we must justify our choice of range R = = { 1 |Q| log 2 } , the set of all databases of size log |Q| /α . 2 α ∈ , if R = { y Q For any finite class of linear queries Theorem 4.2. |Q| log |X| |X| ∥ such = N then for all ∈ R y } : x ∥ N y , there exists a ∈ 1 2 α that: max | f ( x ) − α ( y ) |≤ f f ∈Q In other words, we will show that for any collection of linear queries and for any database x , there is a “small” database y of size ∥ Q ∥ = y 1 |Q| log that approximately encodes the answers to every query in Q , up 2 α α . to error |X| (Sampling Bounds) . For any x ∈ N and for any collection Lemma 4.3 of linear queries Q , there exists a database y of size log |Q| ∥ = ∥ y 1 2 α such that: max | f ( x ) − f ( y ) |≤ α ∈Q f log |Q| Proof. m Let = . We will construct a database y by taking m 2 α uniformly random samples from the elements of . Specifically, for i ∈ x 1 , . . . , m } , let X χ be a random variable taking value { with ∈ X j i x be the database containing elements / probability x ∥ , and let y ∥ 1 j f X . We . Now fix any f ∈ Q and consider the quantity , . . . , X ( y ) m 1 have: |X| m ∑ ∑ 1 1 ( ) = f y ) X ( y f · f . χ ) = ( i i i ∥ y m ∥ 1 i i =1 =1

76 72 Releasing Linear Queries with Correlated Error f We note that each term ( ) of the sum is a bounded random variable X i 1 ( X taking values ) ≤ f with expectation 0 ≤ i |X| ∑ x j [ ( E , f ) = ) X x ( f ( χ )] = f i j x ∥ ∥ 1 =1 j is: and that the expectation of ) f ( y m ∑ 1 y )] = f . E [ ) x E [ f ( X f )] = ( ( i m i =1 3.1 Therefore, we can apply the Chernoff bound stated in Theorem which gives: 2 − 2 mα − f ( x ) | > α ] ≤ 2 e Pr [ | f ( . y ) ∈Q , we get: f Taking a union bound over all of the linear queries [ ] 2 − 2 mα ( y ) − f ( x ) | > α Pr ≤ 2 |Q| e max . | f ∈Q f log |Q| m = Plugging in makes the right hand side smaller than 1 (so 2 α long as |Q| > 2 ), proving that there exists a database of size m satis- fying the stated bound, which completes the proof of the lemma. The proof of Theorem simply follows from the observation that 4.2 |Q| log . contains databases of size R all 2 α Proposition 4.4. Q be any class of linear queries. Let y be the Let x, Q , ε, α ) . Then with probability 1 − β : database output by SmallDB ( ( )) ( |Q| log |X| log 1 + log 2 2 β α . + | f ( max ) − f ( y ) |≤ α x ∈Q f ε ∥ x ∥ 1 Proof. Applying the utility bounds for the exponential mechanism 1 ( 3.11 u = (Theorem (which follows α ≤ ) and OPT D ∆ ) with q ∥ x ∥ 1 4.2 ), we find: from Theorem ] [ 2 t − α ( x ) − f ( y ) |≥ | + max . ≤ f e (log ( |R| ) + t ) Pr f ∈Q ∥ ε x ∥ 1 R , which is the set of all We complete the proof by (1) noting that 2 2 log |Q| /α , satisfies databases of size at most log |Q| /α |R| ≤ |X| and ( ) 1 (2) by setting t = log . β

77 4.1. An offline algorithm: SmallDB 73 Finally, we may now state the utility theorem for SmallDB. y α , letting Theorem 4.5. By the appropriate choice of be the database α , ε, ) , we can ensure that with probability x, ( output by SmallDB Q 2 : 1 − β ) ( / 1 3 1 |Q| log 16 log |X| + 4 log β (4.2) |≤ ( f − ) max y . ) x ( f | f ∈Q x ∥ ε ∥ 1 Equivalently, for any database x with ) ( 1 log |Q| + 4 log |X| 16 log β ∥ ≥ ∥ x (4.3) 1 3 εα f β : max 1 with probability | f ( x ) − − ( y ) |≤ α . ∈Q f Proof. By Theorem 4.2 , we get: ( ( )) |Q| 4 log |X| log 1 2 + log 2 α β α + x f ) − f ( y ) |≤ max ( . | ∈Q f ε ∥ ∥ x 2 1 Setting this quantity to be at most ∥ x ∥ and solving for yields ( 4.3 ). α 1 α yields ( 4.4 ). Solving for Note that this theorem states that for fixed α ε , even with and = 0 many queries in the exponentially δ , it is possible to answer almost 2 This is in contrast to the Laplace mechanism, size of the database. when we use it directly to answer linear queries, which can only answer linearly many. Note also that in this discussion, it has been most convenient to think about normalized queries. However, we can get the corresponding bounds for unnormalized queries simply by multiplying by ∥ ∥ x : 1 Theorem 4.6 . By the (Accuracy theorem for un-normalized queries) appropriate choice of , letting y be the database output by α 2 Specifically, solving for k we find that the mechanism can answer k queries for: ( ( )) 3 ε α ∥ x ∥ 1 k O ≤ exp . |X| log

78 74 Releasing Linear Queries with Correlated Error α , we can ensure that with probability , ε, ( x, ) Q 1 − β : SmallDB 2 ( ) / 3 1 1 |X| 16 log log |Q| + 4 log β / 3 2 (4.4) |≤∥ x ∥ f y ) . ( ( max f ) − x | 1 ∈Q f ε We proved that every More Refined Bounds. set of linear queries 2 log /α |Q| |X| has a collection of databases of size at most that well- Q x with respect to approximates every database with error at most α . Q This is often an over-estimate however, since it completely ignores the structure of the queries. For example, if Q simply contains the same query repeated over and over again, each time in a different guise, then there is no reason that the size of the range of the exponen- tial mechanism should grow with |Q| . Similarly, there may even be classes of queries Q that have infinite cardinality, but nevertheless are well approximated by small databases. For example, queries that corre- spond to asking whether a point lies within a given interval on the real Q , since there are uncountably many line form an infinitely large class intervals on the real line. Nevertheless, this class of queries exhibits very simple structure that causes it to be well approximated by small databases. By considering more refined structure of our query classes, we will be able to give bounds for differentially private mechanisms 4.3 ) and can which improve over the simple sampling bounds (Lemma 3 be non-trivial even for doubly exponentially large classes of queries. We will not fully develop these bounds here, but will instead state several results for the simpler class of counting queries . Recall that a f : counting query 0 , 1 } maps database points to boolean values, X →{ rather than any value in the interval , 1] as linear queries do. [0 Definition 4.1 . A class of counting queries Q shatters a (Shattering) collection of points S ⊆X if for every T ⊆ S , there exists an f ∈Q such if for every one of that x ∈ S : f ( x ) = 1 } = T . That is, Q shatters S { | S | S the subsets T of 2 , there is some function in Q that labels exactly 3 In fact, our complexity measure for a class of queries can be finite even for infinite classes of queries, but here we are dealing with queries over a finite universe, so there do not exist infinitely many distinct queries.

79 4.1. An offline algorithm: SmallDB 75 those elements as positive, and does not label any of the elements in S as positive. \ T | S | to shatter Note that for |Q| ≥ 2 S Q it must be the case that must contain a function for each subset T ⊆ S . We can now Q since f define our complexity measure for counting queries. . A collection of Definition 4.2 (Vapnik–Chervonenkis (VC) Dimension) has VC-dimension counting queries if there exists some set S ⊆X Q d shatters S = | such that Q | S , and Q does not shatter of cardinality d d +1 . We can denote this quantity by VC-DIM any set of cardinality Q ) . ( Consider again the class of 1-dimensional intervals on the range , ∞ defined over the domain X = R . The function f [0 corresponding ] a,b ( ] is defined such that f to the interval a, b x ) = 1 if and only if [ a,b x [ a, b ] . This is an infinite class of queries, but its VC-dimension ∈ is 2. For any pair of distinct points , there is an interval that x < y ( ) , an interval that contains both points contains neither point a, b < x a < x < y < b ) ( , and an interval that contains each of the points but not the other ( a < x < b < y and x < a < y < b ). However, for any 3 distinct points x < y < z , there is no interval [ a, b ] such that but f ] = f [ z ] = 1 x f [ y ] = 0 . [ a,b a,b a,b We observe that the VC-dimension of a finite concept class can never be too large. For any finite class Q , VC-DIM ( Q ) ≤ log |Q| . Lemma 4.7. Q If VC-DIM Q ) = d then ( shatters some set of items S ⊆X of Proof. d | S = d . But by the definition of shattering, since S has 2 cardinality | d Q 2 distinct functions in it. distinct subsets, must have at least log It will turn out that we can essentially replace the term |Q| ( Q ) with the term VC-DIM in our bounds for the SmallDB mechanism. By the previous lemma, this is can only be an improvement for finite classes Q . Theorem 4.8. For any finite class of linear queries Q , if R = { y ∈ ( ) VC-DIM ( Q ) |X| |X| N ∥ y ∥∈ O y ∈R } then for all x ∈ N : , there exists a 2 α

80 76 Releasing Linear Queries with Correlated Error such that: y | f ( x ) − f max ( ) |≤ α ∈Q f As a result of this theorem, we get the analogue of Theorem 4.5 with VC-dimension as our measure of query class complexity: α y be the database output by SmallDB ( x, Q , ε, Theorem 4.9. Let ) . 2 Then with probability − β : 1 ) ( 1 3 / 1 VC-DIM ) + log Q log |X| ( β ) O ) x − f ( y ( f |≤ | max f ∈Q ∥ x ∥ ε 1 Equivalently, for any database x with ) ( 1 log |X| VC-DIM ( Q ) + log β O ≥ x ∥ ∥ 1 3 εα with probability 1 − β : max . α | f ( x ) − f ( y ) |≤ ∈Q f An analogous (although more cumbersome) measure of query com- plexity, the “Fat Shattering Dimension,” defines the complexity of a class of linear queries, as opposed to simply counting queries. The Fat Shattering Dimension controls the size of the smallest “ α -net” (Defini- tion 5.2 in Section 5 ) for a class of linear queries Q as VC-dimension does for counting queries. This measure can similarly be used to give more refined bounds for mechanisms designed to privately release linear queries. 4.2 An online mechanism: private multiplicative weights We will now give a mechanism for answering queries that arrive online and may be interactively chosen. The algorithm will be a simple com- bination of the sparse vector algorithm (which can answer threshold queries adaptively), and the exponentiated gradient descent algorithm for learning linear predictors online. This latter algorithm is also known as Hedge or more generally the multiplicative weights technique. The idea is the following: When we

81 4.2. An online mechanism: private multiplicative weights 77 |X| D N as a histogram and are interested only view the database ∈ in linear queries (i.e., linear functions of this histogram), then we can view the problem of answering linear queries as the problem of learn- D that defines the query answers ⟨ D, q ing the linear function , given ⟩ |X| ∈ [0 , q a query . If the learning algorithm only needs to access 1] the data using privacy-preserving queries, then rather than having a privacy cost that grows with the number of queries we would like to answer, we can have a privacy cost that grows only with the number of queries the learning algorithm needs to make. The “multiplicative weights” algorithm which we present next is a classical example of such a learning algorithm: it can learn any linear predictor by making only a small number of queries. It maintains at all times a current “hypoth- esis predictor,” and accesses the data only by requiring examples of queries on which its hypothesis predictor differs from the (true) private database by a large amount. Its guarantee is that it will always learn the target linear function up to small error, given only a small number of such examples. How can we find these examples? The sparse vector algorithm that we saw in the previous section allows us to do this on the fly, while paying for only those examples that have high error on the current multiplicative weights hypothesis. As queries come in, we ask whether the true answer to the query differs substantially from the answer to the query on the current multiplicative weights hypothesis. Note that this is a threshold query of the type handled by the sparse vector technique. If the answer is “no” — i.e., the difference, or error, is “below threshold,” — then we can respond to the query using the pub- licly known hypothesis predictor, and have no further privacy loss. If the answer is “yes,” meaning that the currently known hypothesis predictor gives rise to an error that is above threshold, then we have found an example appropriate to update our learning algorithm. Because “above threshold” answers correspond exactly to queries needed to update our learning algorithm, the total privacy cost depends only on the learning rate of the algorithm, and not on the total number of queries that we answer. First we give the multiplicative weights update rule and prove a the- orem about its convergence in the language of answering linear queries.

82 78 Releasing Linear Queries with Correlated Error x as being probability dis- It will be convenient to think of databases . That is, letting X ]) denote the ∆([ tributions over the data universe X ] ]) x ∈ ∆([ X |X| . set of probability distributions over the set [ , we have Note that we can always scale a database to have this property without changing the normalized value of any linear query. The Multiplicative Weights (MW) Update Rule. It is Algorithm 5 instantiated with a parameter η ≤ 1 . In the following analysis, we will take α/ 2 , where α is the parameter specifying our target accuracy. η = t ( , f MW , v x ): t t t if v ( < f x ) then t t r Let = f t t else Let = 1 − f r t t − , r ) ( (i.e., for all ] ) = 1 χ f χ [ χ t i i t i end if For all Update: ∈ [ |X| ] Let i +1 t t i = exp( ˆ ηr x [ x ]) · − t i i +1 t x ˆ +1 t i = x ∑ i |X| t +1 x ˆ =1 j j t +1 Output x . Fix a class of linear queries Q and a database x ∈ Theorem 4.10. 1 ]) , and let x ∈ ∆([ ∆([ X ]) describe the uniform distribution over X 1 i x = 1 / |X| for all : . Now consider a maximal length sequence X i t t +1 t ∈ { 2 , . . . , L } generated by setting x x for = of databases t ( x MW , f ∈ Q , v f ) as described in Algorithm 5 , where for each t , t t t and v ∈ R are such that: t t | f , and ( x ) − f > α ( x 1. ) | t t 2. | f . ( x ) − v < α | t t Then it must be that: 4 log |X| . L ≤ 1 + 2 α

83 4.2. An online mechanism: private multiplicative weights 79 Note that if we prove this theorem, we will have proven that for +1 L the last database f ∈ Q : x in the sequence it must be that for all L +1 | ≤ − f ( f ( ) ) x α , as otherwise it would be possible to extend | x distin- the sequence, contradicting maximality. In other words, given t guishing queries f , the multiplicative weights update rule learns the x with respect to any class of linear queries Q , up private database α to some tolerance L ) of steps. We will , in only a small number ( use this theorem as follows. The Private Online Multiplicative Weights algorithm, described (twice!) below, will at all times have a pub- t t x f to the database x . Given an input query lic , approximation the algorithm will compute a noisy approximation to the difference t ( x ) − f f x | ) | . If the (noisy) difference is large, the algorithm will pro- ( vide a noisy approximation f ( x )+ λ λ to the true answer f ( x ) , where t t is drawn from some appropriately chosen Laplace distribution, and the Multiplicative Weights Update Rule will be invoked with parameters t λ ( x ( x , f, f ) . If the update rule is invoked only when the difference )+ t t ( x ) − f ( x | ) | is truly large (Theorem 4.10 , condition 1 ), and if the f f ( x ) + λ approximations are sufficiently accurate (Theorem , con- 4.10 t ), then we can apply the theorem to conclude that updates are 2 dition +1 L and the resulting not so numerous (because L is not so large) gives x accurate answers to all queries in Q (because no distinguishing query remains). Theorem 4.10 is proved by keeping track of a potential function Ψ t measuring the similarity between the hypothesis database at time t , x and the true database D . We will show: 1. The potential function does not start out too large. 2. The potential function decreases by a significant amount at each update round. 3. The potential function is always non-negative. Together, these 3 facts will force us to conclude that there cannot be too many update rounds. Let us now begin the analysis for the proof of the convergence theorem.

84 80 Releasing Linear Queries with Correlated Error t } We must show that any sequence , f Proof. , v { ) x with the ( t t =1 ,...,L t t | ( x property that ) − f | ( x ) | > α and f v cannot have − f < α ( x ) | t t t t |X| 4 log L > . 2 α We define our potential function as follows. Recall that we here view x ∥ the database as a probability distribution — i.e., we assume = 1 . ∥ 1 Of course this does not require actually modifying the real database. The potential function that we use is the relative entropy, or KL diver- t and x gence, between (when viewed as probability distributions): x ( ) |X| ∑ i ] x [ def t i ] log [ x ∥ x Ψ ) = = KL ( x . t t ] [ i x =1 i We begin with a simple fact: log t : Ψ . ≥ Proposition 4.11. , and Ψ |X| ≤ For all 0 1 t Relative entropy (KL-Divergence) is always a non-negative Proof. quantity, by the log-sum inequality, which states that if , . . . , a and a 1 n b are non-negative numbers, then , . . . , b n 1 ) ( ∑ ∑ ∑ a a i i i ∑ log a . a ≥ i i b b i i i i i 1 , and so ≤ log |X| To see that x Ψ [ i ] = 1 / |X| for all i , recall that 1 ∑ |X| ] log ( i is a probability distribution, x x [ Ψ = |X| x [ i ]) . Noting that 1 =1 i we see that this quantity is maximized when x [1] = 1 and x [ i ] = 0 for all i > , giving Ψ 1 = log |X| . i We will now argue that at each step, the potential function drops 2 α / 4 by at least log |X| , and must . Because the potential begins at always be non-negative, we therefore know that there can be at most 2 steps in the database update sequence. To begin, let ≤ X | /α L | 4 log us see exactly how much the potential drops at each step: Lemma 4.12. ) ( 2 t ⟩−⟨ ≥ η − ⟨ r , x Ψ r , x ⟩ Ψ − η t t t +1 t

85 4.2. An online mechanism: private multiplicative weights 81 ∑ |X| Proof. x [ i ] = 1 . Recall that =1 i ) ( ) ( |X| |X| ∑ ∑ x ] i ] i [ x [ − ] log [ Ψ x Ψ i = − x [ i ] log t +1 t t t +1 x x i i i =1 =1 i ) ( |X| t +1 ∑ x i x [ i ] log = t x i =1 i ) ( ∑ |X| t t +1 +1 ∑ / x ˆ x ˆ i i i ] log = x [ i t x i i =1 [ ( ) |X| t ∑ x exp( − ])) i [ ηr t i log ] i [ x = t x i i =1 |X| ∑ t ηr ]) x − j exp( [ log − t j j =1 |X| |X| ∑ ∑ t ] [ i = − x − log [ i ] ηr x ]) j exp( − ηr [ t t j i =1 i j = |X| ∑ t ⟨ ⟩− log = r exp( , x ηr − [ j ]) x η − t t j =1 j |X| ∑ t 2 x ]) (1 + η − ηr j [ r ≥ − , x ⟩− log η ⟨ t t j =1 j ) ( 2 t ⟨ r ⟨ 1 + η η − η log r ⟩− , x − ⟩ = , x t t ( ) t 2 η , x ⟨ ⟩−⟨ r . , x ⟩ ≥ − η r t t The first inequality follows from the fact that: 2 2 2 . [ j ]) ≤ 1 − ηr exp( [ j ] + η − ( r η [ j ]) ηr ≤ 1 − ηr ] + [ j t t t t The second inequality follows from the fact that log(1 + y ) ≤ y for y > − 1 .

86 82 Releasing Linear Queries with Correlated Error The rest of the proof now follows easily. By the conditions of the 4.10 database/query sequence (described in the hypothesis for Theorem above), for every , t t |≥ ( x ) − f and ( x 1. ) | f α t t ) v − f | ( 2. . | < α x t t t t ( x ) < f if ( x Thus, ) if and only if v f < f = ( x f ) . In particular, r t t t t t t t t if − f ( x ) ≥ α , and r ( − f x f f ( x ) − f ( x ) ) ≥ α . Therefore, by = 1 t t t t t t as described in the Update Rule, 4.12 η = α/ 2 and the choice of Lemma 2 2 2 ) ( α α α α α t ≥ ⟨ ⟩−⟨ r . , x ⟩ , x − = − ) α ( r Ψ − ≥ Ψ t t t +1 t 4 2 4 4 2 Finally we know: 2 2 α α . ≤ log |X|− L 0 ≤ Ψ Ψ − L · ≤ 0 L 4 4 4 log |X| Solving, we find: L ≤ . This completes the proof. 2 α We can now combine the Multiplicative Weights Update Rule with the NumericSparse algorithm to give an interactive query release mech- ( ε, 0) anism. For privacy, we essentially (with somewhat worse con- stants) recover the bound for SmallDB. For ( ε, δ ) -differential privacy, we obtain better bounds, by virtue of being able to use the compo- sition theorem. The queries to NumericSparse are asking whether the magnitude of the error given by estimating ( x ) by applying f f to the i i t x to x is above an appropriately chosen thresh- current approximation t , that is, they are asking if | f ( x ) − f ( x old ) | is large. For technical T t f ( x ) − f ( x reasons this is done by asking about ) (without the absolute t f x value) and about ) − ( ( x ) . Recall that the NumericSparse algo- f rithm responds with either ⊥ or some (positive) value exceeding T . We use the mnemonic for the responses to emphasize that the query is E asking about an error. Theorem 4.13. The Online Multiplicative Weights Mechanism (via NumericSparse) is ( ε, 0) -differentially private.

87 4.2. An online mechanism: private multiplicative weights 83 The Online Multiplicative Weights Mechanism (via Algorithm 6 , privacy param- x NumericSparse) takes as input a private database α and β , and a stream of linear queries eters , accuracy parameters ε, δ } that may be chosen adaptively from a class of queries Q . It outputs { f i } . { a stream of answers a i x, { f ) } , ε, δ, α, β OnlineMW via NumericSparse ( i 4 log |X| c Let ← , 2 α if = 0 then δ c/β 18 c (log(2 |Q| )) )+log(4 ← Let T x || ε || 1 else √ √ 2 c 4 (log (2+32 c 2) k +log log ) δ β Let ← T || ε x || 1 end if ′ Initialize f x, NumericSparse( { , T, c, ε, δ ) with a stream of queries } i ′ f } , outputting a stream of answers E . { i i 0 0 0 , and let x Let ∈ ∆([ t ]) satisfy x ← X ∈ / |X| for all i = 1 [ |X| ] . i for each query f do i t ′ Let . ) ( · ) = f − ( · ) f f x ( i i 1 − i 2 ′ t Let f ( · ) = f · ( x ) ) − f ( i i i 2 E ⊥ then ⊥ = = if and E 2 i 1 − i 2 t a ) = f Let ( x i i else E if ∈ R then i − 1 2 t = Let f E ( x a ) + i 2 i − 1 i else t a E = f − ( x Let ) i i 2 i end if t +1 t x M W ( x ) , f Let , a = i i Let ← t + 1 . t end if end for Proof. This follows directly from the privacy analysis of Numeric- Sparse, because the OnlineMW algorithm accesses the database only through NumericSparse.

88 84 Releasing Linear Queries with Correlated Error Speaking informally, the proof of utility for the Online Multi- plicative Weights Mechanism (via NumericSparse) uses the utility ) to conclude that, theorem for the NumericSparse (Theorem 3.28 with high probability, the Multiplicative Weights Update Rule is only f is truly a distinguishing query, meaning, invoked when the query t t ( x ) − f ) ( x | ) | is “large,” and the released noisy approximations to f f ( x i i t are “accurate.” Under this assumption, we can apply the convergence 4.10 theorem (Theorem ) to conclude that the total number of updates is small and therefore the algorithm can answer all queries in Q . δ = 0 Theorem 4.14. 1 − β , for all queries For , with probability at least , the Online Multiplicative Weights Mechanism (via NumericSparse) f i a such that returns an answer | f such that: ( x ) − a α |≤ 3 α for any i i i ( ( )) |X| 32 log ) + log 32 log log( |X| |Q| 2 β α α ≥ 2 || εα x || 1 Recall that, by Theorem , given k queries and a maximum Proof. 3.28 c of above-threshold queries, NumericSparse is number α, β ) -accurate ( for any α such that: )) c/β 9 c (log k + log(4 ≥ α . ε 2 = 2 = 4 log In our case k |X| |Q| , and we have been normal- c /α and by a factor of || x || izing, which reduces . With this in mind, we can α 1 take )) ( ( 32 log |X| |Q| 32 log |X| ) + log log( 2 α β α = 2 x || εα || 1 T = 2 α for the case δ = 0 . and note that with this value we get 1 − β ) probability case. Then for all i Assume we are in this high ( t f (The- triggers an update, | f α ( x ) − f = ( x such that ) | ≥ T − α i i i , a 4.10 1 ). Thus, f orem , condition form a valid pair of query/value i i updates as required in the hypothesis of Theorem 4.10 and so, by that |X| 4 log such update steps. = theorem, there can be at most c 2 α In addition, still by the accuracy properties of the Sparse Vector algorithm, 1. at most one of ; ⊥ will have value , E E 1 − 2 i i 2

89 4.2. An online mechanism: private multiplicative weights 85 t for all a 2. = f such that no update is triggered ( ( x i ) ) we have i i t | x ) − f f ( x = 3 ) |≤ T + α ( α ; and i i |≤ 3. | f i ( x ) − a α for all such that an update is triggered we have i i , condition 2 ). (Theorem 4.10 α and removing the normal- Optimizing the above expression for ization factor, we find that the OnlineMW mechanism can answer each linear query to accuracy α except with probability β for: 3 ( ( )) 3 1 / / 3 2 / 3 1 || x |X| 32 log || 1 ) + log log( |X| 36 log |Q| β / 3 2 || α x = || 1 ε which is comparable to the SmallDB mechanism. By repeating the same argument, but instead using the utility the- ( ε, δ orem for the -private version of Sparse Vector (Theorem 3.28 ), we ) obtain the following theorem. Theorem 4.15. For δ > 0 , with probability at least 1 − β , for all queries ) f such that | f ( x a − a |≤ 3 α for any , OnlineMW returns an answer i i i i such that: α √ ( )) ( √ 32 log |X| 2 |X| (2 + 32 2) log log |Q| + log log · 2 δ β α α ≥ || αε || x 1 Again optimizing the above expression for α and removing the nor- malization factor, we find that the OnlineMW mechanism can answer each linear query to accuracy 3 α except with probability β , for: √ ( )) ( 1 2 / √ || 32 x || 2 1 (2 + 32 · |X| log |Q| + log log log 2) δ β 1 2 / = x || α || 1 ε which gives better accuracy (as a function of || x || ) than the SmallDB 1 mechanism. Intuitively, the greater accuracy comes from the iterative nature of the mechanism, which allows us to take advantage of our composition theorems for ( ε, δ ) -privacy. The SmallDB mechanism runs

90 86 Releasing Linear Queries with Correlated Error in just a single shot, and so there is no opportunity to take advantage of composition. The accuracy of the private multiplicative weights algorithm has dependencies on several parameters, which are worth further discus- sion. In the end, the algorithm answers queries using the sparse vec- tor technique paired with a learning algorithm for linear functions. As we proved in the last section, the sparse vector technique introduces ( c log error that scales like ( ε ∥ x ∥ O )) when a total of k sensitivity k/ 1 / ∥ x ∥ c queries are made, and at most 1 of them can have “above thresh- 1 T . Recall that these error terms arise old" answers, for any threshold because the privacy analysis for the sparse vector algorithm allows us to “pay” only for the above threshold queries, and therefore can add O ( c/ ( ε ∥ x ∥ to each query. On the other hand, since we end up )) noise 1 adding independent Laplace noise with scale c/ ( ε ∥ x ∥ Ω( )) to k queries 1 in total, we expect that the maximum error over all k queries is larger by a log k factor. But what is c , and what queries should we ask? The multiplicative weights learning algorithm gives us a query strategy and 2 O (log |X| = a guarantee that no more than ) queries will be above a c /α T = O ( α ) , for any α . (The queries we ask are always: “ How threshold of much does the real answer differ from the predicted answer of the cur- rent multiplicative weights hypothesis.” The answers to these questions both give us the true answers to the queries, as well as instructions how to update the learning algorithm appropriately when a query is above O ( α ) , threshold.) Together, this leads us to set the threshold to be 2 α is the expression that satisfies: α = O (log |X| log k/ ( ε ∥ x ∥ . α where )) 1 This minimizes the two sources of error: error from the sparse vector technique, and error from failing to update the multiplicative weights hypothesis. 4.3 Bibliographical notes The offline query release mechanism given in this section is from Blum et al. [ 8 ], which gave bounds in terms of the VC-Dimension of the query class (Theorem 4.9 ). The generalization to fat shattering dimension is given in [ 72 ].

91 4.3. Bibliographical notes 87 The online query release mechanism given in this section is from Hardt and Rothblum [ 44 ]. This mechanism uses the classic multiplica- tive weights update method, for which Arora, Hazan and Kale give an excellent survey [ 1 ]. Slightly improved bounds for the private multi- plicative weights mechanism were given by Gupta et al. [ 39 ], and the analysis here follows the presentation from [ 39 ].

92 5 Generalizations In this section we generalize the query release algorithms of the previous section. As a result, we get bounds for arbitrary low sensitivity queries (not just linear queries), as well as new bounds for linear queries. These generalizations also shed some light on a connection between query release and machine learning. 4 is a The SmallDB offline query release mechanism in Section net mechanism . We saw that both special case of what we call the mechanisms in that section yield synthetic databases , which provide a convenient means for approximating the value of any query in Q on the private database: just evaluate the query on the synthetic database and take the result as the noisy answer. More generally, a mechanism can produce a data structure of arbitrary form, that, together with a fixed, public, algorithm (independent of the database) provides a method for approximating the values of queries. The Net mechanism is a straightforward generalization of the SmallDB mechanism: First, fix, independent of the actual database, an α -net of data structures such that evaluation of any query in Q using the released data structure gives a good (within an additive α error) estimate of the value of the query on the private database. Next, apply 88

93 5.1. Mechanisms via α 89 -nets the exponential mechanism to choose an element of this net, where the quality function minimizes the maximum error, over the queries in Q , for the elements of the net. We also generalize the online multiplicative weights algorithm so that we can instantiate it with any other online learning algorithm for learning a database with respect to a set of queries. We note that such a mechanism can be run either online, or offline, where the set of queries to be asked to the “online” mechanism is instead selected using a “private distinguisher,” which identifies queries on which the current hypothesis of the learner differs substantially from the real database. These are queries that would have yielded an update step in the online algorithm. A “distinguisher” turns out to be equivalent to an agnostic learning algorithm, which sheds light on a source of hardness for efficient query release mechanisms. In the following sections, we will discuss data structures for classes of queries Q . A data structure D drawn from some class of data Definition 5.1. D structures Q is implicitly endowed with an for a class of queries : R with which we can evaluate any evaluation function Eval D×Q→ Q on D . However, to avoid being encumbered by notation, we query in ( will write simply D ) to denote Eval ( D, f ) when the meaning is clear f from context. 5.1 Mechanisms via α -nets Given a collection of queries Q α -net as follows: , we define an ( -net) . An α -net of data structures with respect to a Definition 5.2 α |X| |X| class of queries N Q such that for all x ∈ N is a set , there N ⊂ exists an element of the α -net y ∈N such that: α . max |≤ | f ( x ) − f ( y ) ∈Q f We write N -net of minimum cardinality among the ( Q ) to denote an α α set of all α -nets for Q .

94 90 Generalizations x , there exists a member of the That is, for every possible database x with respect to all queries in , up to an error -net that “looks like” α Q . α tolerance of α -nets will be useful for us, because when paired with the Small exponential mechanism, they will lead directly to mechanisms for , answering queries with high accuracy. Given a class of functions Q we will define an instantiation of the exponential mechanism known as Net the mechanism. We first observe that the Net mechanism preserves -differential privacy. ε Algorithm 7 The Net Mechanism NetMechanism x, Q , ε, α ) ( R←N ( Let Q ) α |X| : N Let ×R→ R be defined to be: q ) ( ) = − max | ) q f ( x x, y − f ( y | ∈Q f Sample And Output y ∈ R with the exponential mechanism M ) ( x, q, R E Proposition 5.1. ( ε, 0) differentially private. The Net mechanism is The Net mechanism is simply an instantiation of the exponential Proof. 3.10 . mechanism. Therefore, privacy follows from Theorem We may similarly call on our analysis of the exponential mechanism to begin understanding the utility guarantees of the Net mechanism: Let Q be any class of sensitivity 1 / ∥ Proposition 5.2. ∥ y queries. Let x 1 be the database output by NetMechanism ( x, Q , ε, α ) . Then with prob- ability 1 − β : ( ( )) 1 ) ) + log | log ( Q ( |N 2 α β . x + |≤ | ) y ( f − ) α ( f max f ∈Q ε ∥ x ∥ 1

95 5.2. The iterative construction mechanism 91 1 and noting that ( q ) = , and Proof. By applying Theorem S 3.11 x ∥ ∥ 1 D ) that OPT α by the definition of an α -net, we find: ( ≤ q [ ] 2 t − f ( x ) − f ( y ) |≥ α + max Pr . e ≤ | (log ( |N ) ( Q ) | ) + t α f ∈Q x ∥ ε ∥ 1 ( ) 1 completes the proof. Plugging in t = log β We can therefore see that an upper bound on |N ( Q ) | for a collec- α tion of functions immediately gives an upper bound on the accuracy Q that a differentially private mechanism can provide simultaneously for all functions in the class Q . 4.1 , where we saw that the This is exactly what we did in Section key quantity is the VC-dimension of Q Q is a class of linear , when queries. 5.2 The iterative construction mechanism In this section, we derive an offline generalization of the private mul- tiplicative weights algorithm, which can be instantiated with any properly defined learning algorithm. Informally, a database update 1 2 , . . . , D algorithm maintains a sequence of data structures D that give increasingly good approximations to the input database x (in a sense that depends on the database update algorithm). Moreover, these mechanisms produce the next data structure in the sequence by consid- ering only one query that distinguishes the real database in the sense f t f ( D f ) differs significantly from that ( x ) . The algorithm in this section shows that, up to small factors, solving the query-release problem in a learn- differentially private manner is equivalent to solving the simpler ing or distinguishing problem in a differentially private manner: given a private distinguishing algorithm and a non-private database update algorithm, we get a corresponding private release algorithm. We can plug in the exponential mechanism as a canonical private distinguisher, and the multiplicative weights algorithm as a generic database update algorithm for the general linear query setting, but more efficient dis- tinguishers are possible in special cases.

96 92 Generalizations D×Q× → : R Syntactically, we will consider functions of the form U D , where D represents a class of data structures on which queries in are a data structure in D Q can be evaluated. The inputs to U , which t ; a query f , which represents D represents the current data structure the distinguishing query, and may be restricted to a certain set Q ; ( x ) . Formally, we define a and also a real number, which estimates f U , to capture the sequence of inputs to database update sequence used 2 1 , D to generate the database sequence . D , . . . |X| Let x ∈ N Definition 5.3 be any (Database Update Sequence) . { } t L , f be a sequence of , v ( ) D database and let ) R ∈ ( D×Q× t t =1 t ,...,L Q , α, T U, x, -database update sequence ( tuples. We say the sequence is a ) if it satisfies the following properties: 1 = U ( ⊥ , 1. , · ) , D · ∣ ∣ t ∣ ∣ , = 1 2 for every 2. , , . . . , L t ) − f x ( D ( ) f , ≥ α t t ( t , 2 , . . . , L , for every f 3. = 1 x ) − v | < α , | t t t +1 t = 1 , 2 , . . . , L − 1 , D 4. and for every = U ( D t , f . , v ) t t We note that for all of the database update algorithms we consider, the approximate answer sign of f v ( is used only to determine the ) − x t t t D f ) , which is the motivation for requiring that the estimate of f ( ( x ) t t ( v ) have error smaller than α . The main measure of efficiency we’re t interested in from a database update algorithm is the maximum number t of updates we need to perform before the database approximates x D well with respect to the queries in . To this end we define a database Q update algorithm as follows: Definition 5.4 . Let U : D×Q× R →D (Database Update Algorithm) T : be an update rule and let → R be a function. We say U is a T ( α ) - R database update algorithm for query class Q if for every database x ∈ |X| N ( U, x, Q , α, L ) -database update sequence satisfies L ≤ T ( α ) . , every T Note that the definition of a α ) -database update algorithm ( U implies that if T ( α ) -database update algorithm, then given any is a maximal ( U, x, Q , α, U ) -database update sequence, the final database ∣ ∣ ∣ ∣ L L ) must satisfy max f ( x α − f ( D ≤ ) D or else there would exist ∣ ∣ ∈Q f

97 5.2. The iterative construction mechanism 93 , and thus there another query satisfying property 2 of Definition 5.3 Q + 1) -database update sequence, contradict- ( , α, L would exist a U, x, ing maximality. That is, the goal of a ) database update rule is T ( α to generate a maximal database update sequence, and the final data structure in a maximal database update sequence necessarily encodes ∈Q the approximate answers to every query f . Now that we have defined database update algorithms, we can 4.10 remark that what we really proved in Theorem was that the Mul- ( tiplicative Weights algorithm is a ) -database update algorithm for T α 2 α ) = 4 log |X| /α T . ( Before we go on, let us build some intuition for what a database update algorithm is. A α ) -database update algorithm begins with T ( 1 D about what the true database x looks like. some initial guess Because this guess is not based on any information, it is quite likely that 1 and x bear little resemblance, and that there is some f ∈Q that is D α : that is, able to distinguish between these two databases by at least 1 f ( x ) and f ( D that ) differ in value by at least α . What a database t update algorithm does is to update its hypothesis D given evidence t − 1 D that its current hypothesis is incorrect: at each stage, it takes as input some query in Q which distinguishes its current hypothesis from the true database, and then it outputs a new hypothesis. The parame- T ( α ) is an upper bound on the number of times that the database ter update algorithm will have to update its hypothesis: it is a promise that after at most T ( α ) distinguishing queries have been provided, the algorithm will finally have produced a hypothesis that looks like the 1 For a database , at least up to error α . true database with respect to Q update algorithm, smaller bounds T α ) are more desirable. ( We Database Update Algorithms and Online Learning Algorithms: remark that database update algorithms are essentially online learning 1 Imagine that the database update algorithm is attempting to sculpt x out of a 1 block of clay. Initially, its sculpture bears no resemblance to the true database: it D is simply a block of clay. However, a helpful distinguisher points out to the sculptor places in which the clay juts out much farther than the true target database: the sculptor dutifully pats down those bumps. If the distinguisher always finds large protrusions, of magnitude at least α , the sculpture will be finished soon, and the distinguisher’s time will not be wasted!

98 94 Generalizations in the mistake bound model algorithms . In the setting of online learning, unlabeled examples arrive in some arbitrary order, and the learning algorithm must attempt to label them. In the Background from Learning Theory. mistake bound model of , x , y learning ( ∈ X ×{ 0 , labeled examples 1 } arrive one at a time, ) i i in a potentially adversarial order. At time i , the learning algorithm observes A , and must make a prediction ˆ y x about the label for i i x . It then sees the true label y , and is said to make a mistake if i i y its prediction was wrong: i.e., if ̸ = ˆ y for . A learning algorithm A i i C M , if for a class of functions is said to have a mistake bound of all f ∈ C , and for all adversarially selected sequences of examples x mistakes. , f ( x M )) ( ( x never makes more than , f ( x A )) , . . . , , . . . , i i 1 1 Without loss of generality, we can think of such a learning algorithm ˆ 0 f : X → { as maintaining some hypothesis , 1 } at all times, and updating it only when it makes a mistake. The adversary in this model is quite powerful — it can choose the sequence of labeled examples adaptively, knowing the current hypothesis of the learning algorithm, and its entire history of predictions. Hence, learning algorithms that have finite mistake bounds can be useful in extremely general settings. It is not hard to see that mistake bounded online learning algo- C . Consider, for exam- rithms always exist for finite classes of functions halving algorithm . The halving algorithm initially maintains a ple, the set C consistent with the examples that it has seen S of functions from S = C . Whenever a new unlabeled example arrives, so far: Initially it predicts according to the majority vote of its consistent hypothe- 1 whenever |{ f ∈ S : f ( x ses: that is, it predicts label ) = 1 }|≥| S | / 2 . i Whenever it makes a mistake on an example x , it updates S by remov- i ing any inconsistent function: S f ∈ S : f ( x ← { ) = y . Note that } i i whenever it makes a mistake, the size of S is cut in half! So long as all examples are labeled by some function f ∈ C , there is at least one function ∈ C that is never removed from S . Hence, the halving f algorithm has a mistake bound of log | C | . Generalizing beyond boolean labels, we can view database update algorithms as online learning algorithms in the mistake bound model:

99 5.2. The iterative construction mechanism 95 here, examples that arrive are the queries (which may come in adver- sarial order). The labels are the approximate values of the queries when evaluated on the database. The database update algorithm hypothesis t t x if | f ( D mistake ) − f f on query ) | ≥ α , in which case D makes a ( (that is, v ) and allow the database update f we learn the label of t algorithm to update the hypothesis. Saying that an algorithm U is a ( )-database update algorithm is akin to saying that it has a mis- T α ( ) : no adversarially chosen sequence of queries can T take bound of α ( α ) -mistakes. Indeed, the database ever cause it to make more than T update algorithms that we will see are taken from the online learning literature. The multiplicative weights mechanism is based on an online , which we have already discussed. learning algorithm known as Hedge Halving The Median Mechanism (later in this section) is based on the Algorithm , and the Perceptron algorithm is based (coincidentally) on an algorithm known as Perceptron . We won’t discuss Perceptron here, additive updates, rather than the multiplica- but it operates by making tive updates used by multiplicative weights. A database update algorithm for a class Q will be useful together , whose job is to output a function distinguisher with a corresponding t x and the hypothesis D , that behaves differently on the true database that is, to point out a mistake. ( F Definition 5.5 ε ) , γ ) -Private Distinguisher) . Let Q be a set of ( ( γ ≥ 0 and let F ( ε ) : R → R be a function. An algo- queries, let |X| ( ) × D → Q is an ( F N ε : , γ ) -Private Distin- rithm Distinguish ε Q if for every setting of the privacy parameter ε , on every guisher for |X| x ∈ N pair of inputs , D ∈ D it is ( ε, 0) -differentially private with ∗ ∗ ∗ respect to ∈ Q such that | f f ( x ) − f x ( D ) | ≥ and it outputs an 1 f | f ( x ) − max ( D ) |− F ( ε ) with probability at least . − γ ∈Q f In machine learning, the goal is to find a function : Remark 5.1. f 0 , 1 X → { from a class of functions Q that best labels a collection of } labeled examples ( x 0) , y x, ) , . . . , ( x ( , y . (Examples ) ∈ X ×{ 0 , 1 } m m 1 1 negative examples , and examples are known as x, 1) are known as pos- ( itive examples ). Each example x , and a function has a true label y i i f correctly labels x for if f ( x agnostic learning algorithm ) = y . An i i i a class Q is an algorithm that can find the function in Q that labels

100 96 Generalizations Q , all of the data points approximately as well as the best function in can perfectly label them. Note that equiva- even if no function in Q lently, an agnostic learning algorithm is one that maximizes the number of positive examples labeled 1 minus the number of negative exam- ples labeled 1. Phrased in this way, we can see that a distinguisher as defined above is just an agnostic learning algorithm: just imagine that y x contains all of the “positive” examples, and that contains all of the x and y are not disjoint — “negative examples.” (Note that it is ok of in the learning problem, the same example can occur with both a pos- itive and a negative label, since agnostic learning does not require that any function perfectly label every example.) Finally, note also that for Q classes of linear queries , a distinguisher is simply an optimization y f f ( x ) − f algorithm. Because for linear queries , ) = f ( x − y ) , a ( distinguisher simply seeks to find arg max . | | f ( x − y ) ∈Q f a priori , a differentially private distinguisher is a weaker Note that, object than a differentially private release algorithm: A distinguisher Q with the approximately largest value, merely finds a query in a set Q . In whereas a release algorithm must find the answer to every query in the algorithm that follows, however, we reduce release to optimization. We will first analyze the IC algorithm, and then instantiate it with a specific distinguisher and database update algorithm. What follows is a formal analysis, but the intuition for the mechanism is simple: we simply run the iterative database construction algorithm to con- struct a hypothesis that approximately matches x with respect to the queries Q . If at each round our distinguisher succeeds in finding a query that has high discrepancy between the hypothesis database and the true database, then our database update algorithm will output a database that is -accurate with respect to Q . If the distinguisher ever fails to β find such a query, then it must be that there are no such queries, and our database update algorithm has already learned an accurate hypothesis with respect to the queries of interest! This requires at most T itera- tions, and so we access the data only 2 T times using ( ε -differentially , 0) 0 private methods (running the given distinguisher, and then checking its answer with the Laplace mechanism). Privacy will therefore follow from our composition theorems.

101 5.2. The iterative construction mechanism 97 The Iterative Construction (IC) Mechanism. It takes as Algorithm 8 input a parameter ( F ( ε ε ) , γ ) -Private Distinguisher Distinguish , an 0 0 T ( α ) -iterative database update algorithm , together with an Q for U . Q for x, α, ε Distinguish , IC ( ): , U 0 0 = U ( ⊥ , · , · ) . Let D t = 1 T ( α/ 2) do to for ) t − 1 t ( ( x, D ) Let f = Distinguish ( ) 1 t ) ( t ) ( Lap x ) + ˆ = f Let . ( v ε ∥ x ∥ 1 0 1 − t ( t ) ) t ( 3 f D v ˆ | ) | < ( α/ 4 then − if t − 1 Output y = . D else ( t − 1 t t ) ( t ) , f U Let D , ˆ v D ( = ) . end if end for ( α/ 2) T . = D y Output The analysis of this algorithm just involves checking the tech- nical details of a simple intuition. Privacy will follow because the T ( algorithm is just the composition of ) steps, each of which is 2 α ε , 0) -differentially private. Accuracy follows because we are always ( 0 outputting the last database in a maximal database update sequence. If the algorithm has not yet formed a maximal Database Update Sequence, then the distinguishing algorithm will find a distinguishing query to add another step to the sequence. The IC algorithm is ε, 0) -differentially private for Theorem 5.3. ( 2) ≤ ε/ 2 T ( α/ ε . The IC algorithm is ( ε, δ ) -differentially private for 0 ε √ ε ≤ . 0 T ( α/ 4 /δ ) 2) log(1 compositions of Proof. ( α/ 2) T ε The algorithm runs at most - 2 0 3.20 that ε differentially private algorithms. Recall from Theorem 0 2 kε differentially private under differentially private algorithms are 0 √ ′ ′ ′ ε ) , δ 2 private for ε k = -fold composition, and are 4 k ln(1 /δ ( ) ε + 0 ε 0 − proves the claim. e kε 2 ( 1) . Plugging in the stated values for ε 0 0

102 98 Generalizations Given an ( ( ε ) , γ ) -private distinguisher, a parameter ε Theorem 5.4. , F 0 T ) -Database Update Algorithm, with probability at least 1 − β , and a ( α max f the IC algorithm returns a database | f ( x ) − such that: ( y ) |≤ y f ∈Q such that where: for any α α [ ] ( T 2) /β ) 8 log(2 α/ 8 α ≥ ) ε ( , max F 0 x ∥ ε ∥ 1 0 . ≤ T ( α/ 2)) (2 γ so long as β/ The analysis is straightforward. Proof. / ∼ Lap (1 Recall that if ( ε ∥ Y ∥ )] = )) , we have: Pr[ | Y ∥ |≥ t/ ( ε ∥ x x i 1 1 i − exp( ) . By a union bound, if Y , then , . . . , Y )) ∼ Lap (1 / ( ε ∥ x ∥ t 1 1 k Pr[max | | ≥ t/ ( ε ∥ x ∥ . Therefore, because we make Y ≤ k exp( − t ) )] 1 i i ( α/ 2) draws from Lap (1 / ( ε at most ∥ x ∥ T )) , except with probability 1 0 at most 2 , for all t : β/ 2 ( T α/ 2) α 1 ) t ( t ( ) |≤ f . | ≤ ˆ − ( log x ) v x β ∥ 8 ε ∥ 0 1 2)) γ Note that by assumption, (2 T ( α/ ≤ , so we also have that except β/ with probability β/ 2 : 1 − ( t ) t 1 − t ) ( t f ( ) x ) | ≥ max D − | f ( x ) − f ( D ( f | ) |− F ( ε ) 0 f ∈Q α t − 1 . x ) − f ( D max | f ) |− ( ≥ ∈Q f 8 For the rest of the argument, we will condition on both of these events β occurring, which is the case except with probability . ′ α/ ( T 2) = D is out- There are two cases. Either a data structure D ′ t put, or data structure D D for t < T ( α/ 2) is output. First, sup- = ′ T ( α/ 2) D D = t < T ( α/ 2) it must have been the pose . Since for all ) t ( ( t ) 1 − ( t ) t | D f − ) | ≥ 3 α/ 4 and by our conditioning, | ˆ v v ˆ ( − case that α ) ) ( ( t t ( t ) t − 1 | f ( x ) ( x ) − f |≤ f . Therefore, ( D , we know for all t : ) |≥ α/ 2 8 t ( t ) ( t ) , ˆ v D the sequence , f ) ( ( U, x, Q , α/ 2 , T ( α/ 2)) - , formed a maximal 5.3 ), and we have that Database Update Sequence (recall Definition ′ as desired. 2 | f ( x ) − f ( x |≤ ) max α/ ∈Q f ′ t − 1 D α/ = for t < T ( Next, suppose 2) . Then it must have been D ( t ) ( t ) t − 1 − f | ˆ . By our conditioning, in ( D the case that for t , ) | < 3 α/ 4 v

103 5.2. The iterative construction mechanism 99 α 7 ) ( t ) t − 1 ( t < ( D ) − f ) | | ( x this case it must be that , and that therefore f 8 ) ( ε ( ) , γ F -distinguisher: by the properties of an 0 7 α ′ f ( x ) − f ( D α ) | < max | ≤ + F ( ε ) 0 ∈Q f 8 as desired. Note that we can use the exponential mechanism as a private dis- tinguisher: take the domain to be Q , and let the quality score be: t ) | f ( D ) − f ( ( D, f ) = | , which has sensitivity 1 / ∥ x ∥ q . Applying the D 1 exponential mechanism utility theorem, we get: The exponential mechanism is an ( F ( ε ) , γ ) distinguisher Theorem 5.5. for: ( ) |Q| 2 . log F ε ) = ( ε γ ∥ x ∥ 1 Therefore, using the exponential mechanism as a distinguisher, 5.4 gives: Theorem Theorem 5.6. Given a T ( α ) -Database Update Algorithm and a param- eter ε together with the exponential mechanism distinguisher, with 0 probability at least β , the IC algorithm returns a database y such 1 − f ( | f max x ) − that: ( y ) |≤ α where: ∈Q f ( [ )] 8 log(2 ( α/ 2) /β ) |Q| 16 T log ≤ α max , ∥ x ∥ ε ∥ x ∥ γ ε 0 1 1 0 so long as γ ≤ β/ (2 T ( α/ 2)) . Plugging in our values of ε : 0 Theorem 5.7. T ( α ) -Database Update Algorithm, together Given a with the exponential mechanism distinguisher, the IC mechanism is ε -differentially private and with probability at least 1 − β , the IC algo- rithm returns a database y such that: max where: α | f ( x ) − f ( y ) |≤ ∈Q f ( ) |Q| 8 ( α/ 2) T log ≤ α ε x ∥ γ ∥ 1

104 100 Generalizations ( ε, δ -differentially private for: and ) √ ) ( 16 α/ 2) log(1 /δ ) |Q| T ( log ≤ α ε x ∥ γ ∥ 1 γ ≤ so long as (2 T ( α/ 2)) . β/ Note that in the language of this section, what we proved in 4.10 Theorem was exactly that the multiplicative weights algorithm 4 log |X| . Plugging ) -Database Update Algorithm for T is a α ) = T ( α ( 2 α this bound into Theorem 5.7 recovers the bound we got for the online multiplicative weights algorithm. Note that now, however, we can plug in other database update algorithms as well. Applications: other database update algorithms 5.2.1 Here we give several other database update algorithms. The first works -nets, and therefore can get non-trivial bounds even directly from α for nonlinear queries (unlike multiplicative weights, which only works for linear queries). The second is another database update algorithm for linear queries, but with bounds incomparable to multiplicative weights. (In general, it will yield improved bounds when the dataset has size close to the size of the data universe, whereas multiplicative weights will give better bounds when the dataset is much smaller than the data universe.) We first discuss the median mechanism, which takes advantage of -nets. The median mechanism does not operate on databases, but α instead on median data structures: Definition 5.6 . A median data structure D is (Median Data Structure) |X| D N a collection of databases: . Any query f can be evaluated on ⊂ a median data structure as follows: f ( D ) = Median ( { f ( x ) : x ∈ D } ) . In words, a median data structure is just a set of databases. To evaluate a query on it, we just evaluate the query on every database in the set, and then return the median value. Note that the answers given by the median data structure need not be consistent with any database! However, it will have the useful property that whenever it makes an

105 5.2. The iterative construction mechanism 101 error, it will rule out at least half of the data sets in its collection as being inconsistent with the true data set. The median mechanism is then very simple: Algorithm 9 The Median Mechanism (MM) Update Rule. It inputs and outputs a median data structure. It is instantiated with an α -net ( Q ) for a query class Q , and its initial state is D = N N ( Q ) α α t ( D M M , f , v ): t t α, Q t if D ⊥ then = 0 ( ← N Output D Q ) . α end if t < f if ( D ) v then t t t +1 t t D D \{ x ∈ D : f Output ( x ) ≥ f . ( D ← ) } t t else t +1 t t ← D D \{ x ∈ D : f . ( x ) ≤ f } ( D Output ) t t end if The intuition for the median mechanism is as follows. It maintains a set of databases that are consistent with the answers to the dis- tinguishing queries it has seen so far. Whenever it receives a query and answer that differ substantially from the real database, it updates itself to remove all of the databases that are inconsistent with the new information. Because it always chooses its answer as the median database among the set of consistent databases it is maintaining, every update step removes at least half of the consistent databases! Moreover, α -net with because the set of databases that it chooses initially is an , there is always some database that is never removed, respect to Q because it remains consistent on all queries. This limits how many update rounds the mechanism can perform. How does the median mechanism do? Theorem 5.8. For any class of queries Q , The Median Mechanism is a ( α ) -database update algorithm for T ( α ) = log | N T ( Q ) | . α t Proof. ( D with the , f We must show that any sequence , v { ) } t t =1 ,...,L t t t t t x D f ) − f property that ( x ) | > α and | v cannot have − f ( ( | ) | < α t 0 log |N α ( Q ) | . First observe that because L > -net = N is an ( Q ) D α α

106 102 Generalizations t such that y ∈ D Q for all t y , by definition there is at least one for (Recall that the update rule is only invoked on queries with error at . Since there is guaranteed to be a database y that has error less α least than α on all queries, it is never removed by an update step). Thus, t t t | D , and for all | ≥ 1 . Next we can always answer queries with D , 1 − t t / | ≤ | | , | D 2 . This is because each update D observe that for each t step removes at least half of the elements: all of the elements at least as t with respect D large as, or at most as large as the median element in L L . Therefore, after L to query | D f | ≤ 1 / 2 update steps, ·|N . ( Q ) | α t L | Setting ( Q ) | L > |N D log | < 1 , a contradiction. gives α For classes of linear queries Q , we may refer to the Remark 5.2. N ( Q ) given in Theorem 4.2 to see that the upper bound on α T ( α ) -database update algorithm for T ( α ) = Median Mechanism is a 2 |Q| log |X| /α log . This is worse than the bound we gave for the Multi- plicative Weights algorithm by a factor of log |Q| . On the other hand, nothing about the Median Mechanism algorithm is specific to linear queries — it works just as well for any class of queries that admits a small net. We can take advantage of this fact for nonlinear low sensi- tivity queries. Note that if we want a mechanism which promises ( ε, δ ) -privacy for δ > 0 , we do not even need a particularly small net. In fact, the trivial ∥ x ∥ will be sufficient: net that simply includes every database of size 1 For every class of queries Q and every α ≥ 0 , there is Theorem 5.9. n N ∥ x ∥ an = n of size -net for databases of size . ( Q ) ≤|X| α α 1 n We can simply let N y ( Q ) be the set of all |X| Proof. databases α of size ∥ ∥ = n . Then, for every x such that y x ∥ , we have = n ∥ 1 1 | ) y = 0 ( , and so clearly: f x . max − ) | f ( x Q min ) ( ∈N α ∈Q f ∈N y ( Q ) α We can use this fact to get query release algorithms for arbitrary low sensitivity queries, not just linear queries. Applying Theorem 5.7 to the above bound, we find:

107 5.2. The iterative construction mechanism 103 Using the median mechanism, together with the Theorem 5.10. ) - exponential mechanism distinguisher, the IC mechanism is ( ε, δ β , the IC algo- 1 differentially private and with probability at least − such that: max ( rithm returns a database | f ( x ) − f y y ) |≤ α where: ∈Q f √ ) ( 2 |X| |Q| n log 1 16 log |X| log log δ β √ ≤ , α nε Q can be family of sensitivity 1 /n queries, not necessarily where any linear. This follows simply by combining Theorems Proof. 5.9 to find and 5.8 ( α ) -Database Update Algorithm that the Median Mechanism is a T T ( for ) = n log |X| for databases of size ∥ x ∥ 0 = n for every α > α 1 and every class of queries Q . Plugging this into Theorem 5.7 gives the desired bound. Note that this bound is almost as good as we were able to achieve ! However, unlike 4.15 for the special case of linear queries in Theorem in the case of linear queries, because arbitrary queries may not have α -nets which are significantly smaller than the trivial net used here, ( 0) - we are not able to get nontrivial accuracy guarantees if we want ε, differential privacy. The next database update algorithm we present is again for linear queries, but achieves incomparable bounds to those of the multiplica- Perceptron tive weights database update algorithm. It is based on the algorithm from online learning (just as multiplicative weights is derived algorithm from online learning). Since the algorithm hedge from the is for linear queries, we treat each query f ∈ Q as being a vector t |X| f [0 , 1] ∈ . Note that rather than doing a multiplicative update, t Algorithm 10 The Perceptron update rule t Perceptron ): ( x , f , v t t Q α, t t +1 |X| ⊥ then: output If: x x = 0 = α t t +1 t x > v f then: output x ( x = f Else if: − ) · t t t |X| α t t +1 t Else if: ) ≤ v · then: output x f f = x ( + x t t t |X|

108 104 Generalizations as in the MW database update algorithm, here we do an additive update. In the analysis, we will see that this database update algo- rithm has an exponentially worse dependence (as compared to multi- plicative weights) on the size of the universe, but a superior dependence on the size of the database. Thus, it will achieve better performance for databases that are large compared to the size of the data universe, and worse performance for databases that are small compared to the size of the data universe. ) Perceptron is a -database update algorithm for: T Theorem 5.11. α ( ) ( 2 ∥ ∥ |X| x 2 ) = α T . · ( 2 ∥ α x ∥ 1 Unlike for multiplicative weights, it will be more convenient to Proof. analyze the Perceptron algorithm without normalizing the database to ′ α T database ( be a probability distribution, and then prove that it is a ) 2 ∥ ∥ x |X| ′ ′ 2 α will then ) = update algorithm for T ∥ . Plugging in α ( = α ∥ x 1 2 ′ α f is linear, we can complete the proof. Recall that since each query t |X| ∈ [0 , 1] view as a vector with the evaluation of f being equal f x ) ( t t f to ⟩ . ⟨ , x t t ( x , v , f We must show that any sequence { with the prop- ) } t t t ,...,L =1 t ′ ′ x cannot have ) − f ( x ) erty that > α | and | v − f f ( x ) | < α ( | t t t t 2 x |X| ∥ ∥ 2 . L > 2 ′ α = 1 , 2 , . . . , L , t We use a potential argument to show that for every +1 t t is significantly closer to x than x . Specifically, our potential func- x 2 t L tion is the norm of the database x − x , defined as 2 ∑ 2 2 i = x . ∥ x ( ∥ ) 2 ∈X i 1 2 2 1 2 ∥ ∥ Observe that . Thus it = ∥ x ∥ x 0 since x − = 0 , and ∥ x ∥ x ≥ 2 2 2 2 ′ α / |X| . suffices to show that in every step, the potential decreases by t We analyze the case where ( x f ) > v , the analysis for the opposite t t t t R x case will be similar. Let − x . Observe that in this case we have = ′ t t R f ) = f . ( x ( ) − f α ( x ) ≥ t t t

109 5.2. The iterative construction mechanism 105 Now we can analyze the drop in potential. t 2 t 2 ′ 2 t +1 t 2 R ∥ ∥ |X| ∥ ∥ R − −∥ R ∥ R ( α −∥ / = ) · f ∥ t 2 2 2 2 ∑ 2 t ′ 2 t = i )) ) − ( R (( ( i ) − ( α R / |X| ) · f i ( ( )) t ∈X i ) ( 2 ′ ′ ∑ 2 α α 2 t ( i ) f i ( i ) − · R ) ( f = t t 2 |X| |X| ∈X i ′ 2 ′ ∑ α 2 α 2 t = − ( R f ) i ) ( f t t 2 |X| |X| ∈X i 2 ′ ′ α α 2 t f ) − ( R ≥ |X| t 2 |X| |X| ′ 2 2 ′ ′ 2 α α 2 α = . − ≥ |X| |X| |X| 2 2 ′ x ∥ This bounds the number of steps by ∥ |X| /α , and completes the 2 proof. 5.7 to obtain the fol- We may now plug this bound into Theorem lowing bound on the iterative construction mechanism: Theorem 5.12. Using the perceptron database update algorithm, together with the exponential mechanism distinguisher, the IC mecha- ( ε, δ nism is -differentially private and with probability at least 1 − β , ) the IC algorithm returns a database y such that: max |≤ ) | f ( x ) − f ( y ∈Q f where: α √ √ 2 √ ∥ ) x |Q∥X|·∥ log(2 / 1 4 2 4 |X| ln(1 (4 )) ∥ x ∥ 4 /δ 2 β √ ≤ α , ε x ∥ ∥ 1 where Q is a class of linear queries. If the database x represents the edge set of a graph, for example, for all ∈ [0 , 1] we will have i , and so: x i √ ) ( 4 3 / 1 ∥ x ∥ 2 ≤ . x ∥ x ∥ ∥ ∥ 1 1 Therefore, the perceptron database update algorithm will outperform the multiplicative weights database update algorithm on dense graphs.

110 106 Generalizations Iterative construction mechanisms and online algorithms 5.2.2 In this section, we generalize the iterative construction framework to the online setting by using the NumericSparse algorithm. The online multiplicative weights algorithm which saw in the last chapter is an instantiation of this approach. One way of viewing the online algo- rithm is that the NumericSparse algorithm is serving as the private distinguisher in the IC framework, but that the “hard work” of distin- guishing is being foisted upon the unsuspecting user. That is: if the user asks a query that does not serve as a good distinguishing query, this is a good case. We cannot use the database update algorithm to update our hypothesis, but we don’t need to! By definition, the cur- rent hypothesis is a good approximation to the private database with respect to this query. On the other hand, if the user asks a query for which our current hypothesis is not a good approximation to the true database, then by definition the user has found a good distinguishing query, and we are again in a good case — we can run the database update algorithm to update our hypothesis! The idea of this algorithm is very simple. We will use a database update algorithm to publicly maintain a hypothesis database. Every time a query arrives, we will classify it as either a hard query, or an easy query. An easy query is one for which the answer given by the hypothesis database is approximately correct, and no update step is needed: if we know that a given query is easy, we can simply compute its answer on the publicly known hypothesis database rather than on the private database, and incur no privacy loss. If we know that a query is hard, we can compute and release its answer using the Laplace mech- anism, and update our hypothesis using the database update algorithm. This way, our total privacy loss is not proportional to the number of queries asked, but instead proportional to the number of hard queries asked. Because the database update algorithm guarantees that there will not need to be many update steps, we can be guaranteed that the total privacy loss will be small. Theorem 5.13. OnlineIC is ( ε, δ ) -differentially private.

111 5.2. The iterative construction mechanism 107 Algorithm 11 The Online Iterative Construction Mechanism param- α ) -database update algorithm U . It takes as input a T eterized by a ( ε, δ x and private database , accuracy parameters , privacy parameters α f } that may be chosen adaptively from a { , and a stream of queries β i { a class of queries } . Q . It outputs a stream of answers i ( { f OnlineIC x, , ε, δ, α, β ) } i U Let T ( α ) , c ← δ then if = 0 18 (log(2 |Q| )+log(4 c/β )) c ← Let T || x || ε 1 else √ √ 2 4 c (2+32 ) 2) k +log c log (log β δ T ← Let || ε x || 1 end if ′ Initialize NumericSparse( { x, f ) with a stream of queries , T, c, ε, δ } i ′ ′ } , outputting a stream of answers a . f { i i 0 0 0 , D Let ∈ x be such that D t ← |X| = 1 |X| for all i ∈ [ / ] . i for each query f do i ′ t Let . ) f ( · ) = f D ( · ) − f ( i i 1 − i 2 ′ t Let f ( · ) = f · ( D ) ) − f ( i i i 2 ′ ′ then ⊥ = a = ⊥ and a if 1 2 − i 2 i t a Let f ( D ) = i i else ′ if a then R ∈ 1 2 i − t ′ Let a D = ) + a f ( i i i − 1 2 else ′ t D = f a ( Let a ) − i i i 2 end if t +1 t D U ( D ) , f Let , a = i i Let ← t + 1 . t end if end for Proof. This follows directly from the privacy analysis of Numeric- Sparse, because the OnlineIC algorithm accesses the database only through NumericSparse.

112 108 Generalizations For δ , With probability at least 1 − β , for all Theorem 5.14. = 0 a , OnlineIC returns an answer f such that | f queries ( x ) − a α | ≤ 3 i i i i for any α such that: 9 T ( α )(log(2 |Q| )) T ( α ) /β ) + log(4 ≥ α . || x ε || 1 3.28 that given k queries and a maxi- Proof. Recall that by Theorem , Sparse Vector is ( α, β ) - c mum number of above-threshold queries of accurate for: c/β 9 )) (log k + log(4 c = α . x ε || || 1 k ( α c and T = 2 |Q| . Note that we have set the = Here, we have ) = 2 α in the algorithm. First let us assume that the sparse threshold T vector algorithm does not halt prematurely. In this case, by the utility β , we have for all i such that theorem, except with probability at most t t a , as we wanted. Additionally, ( D f ) : | f α ( D ) − f = 3 ( D = ) |≤ T + α i i i i ′ ′ ′ a . = a , we have α |≤ for all a or a − = a i ) D such that | f ( i i i 1 − i i 2 i 2 ′ ′ i = a such that Note that we also have for all a a or = a : i i 1 i − 2 i 2 ′ form a ( D ) − f , a ( D | f | ≥ T − α = α , since T = 2 α . Therefore, f ) i i i i valid step in a database update sequence. Therefore, there can be at such update steps, and so the Sparse vector algorithm = T ( most ) c α does not halt prematurely. Similarly, we can prove a corresponding bound for ( ε, δ ) -privacy. Theorem 5.15. For δ > 0 , With probability at least 1 − β , for all x queries a α such that | f 3 ( , OnlineIC returns an answer ) − a | ≤ f i i i i α such that: for any √ √ α 4 T ( ) 2 ) + ln ( 512 + 1)(ln(2 α |Q| T ( ) ln ) β δ ≥ α || x || ε 1 We can recover the bounds we proved for online multiplicative weights by recalling that the MW database update algorithm is a 4 log |X| . More generally, ) -database update algorithm for T ( α ) = T ( α 2 α we have that any algorithm in the iterative construction framework can be converted into an algorithm which works in the interactive setting without loss in accuracy. (i.e., we could equally well plug in

113 5.3. Connections 109 the median mechanism database update algorithm or the Perceptron database update algorithm, or any other). Tantalizingly, this means that (at least in the iterative construction framework), there is no gap in the accuracy achievable in the online vs. the offline query release mod- els, despite the fact that the online model seems like it should be more difficult. Connections 5.3 5.3.1 α -nets Iterative construction mechanism and The Iterative Construction mechanism is implemented differently than the Net mechanism, but at its heart, its analysis is still based on the α -nets for the queries C . This connection is explicit existence of small for the median mechanism, which is parameterized by a net, but it holds for all database update algorithms. Note that the database output by the iterative database construction algorithm is entirely determined T functions f ∈ Q , . . . , f fed into it, as selected by by the at most 1 T the distinguisher while the algorithm is running. Each of these func- log |Q| bits, and so every database tions can be indexed by at most output by the mechanism can be described using only log |Q| bits. T α -net for Q of In other words, the IC algorithm itself describes an T N ( Q ) ≤ |Q| size at most . To obtain error α using the Multi- α plicative Weights algorithm as an iterative database constructor, it 2 4.10 to take T = 4 log |X| /α , which gives us suffices by Theorem 2 2 /α 4 log |Q| 4 log |X| /α ≤ |Q| |X| ) Q ( . Note that up to the factor of = N α α - 4 in the exponent, this is exactly the bound we gave using a different net in Theorem 4.2 ! There, we constructed an α -net by considering all 2 log |Q| /α collections of data points, each of which could be indexed by 2 log bits. Here, we considered all collections of log |X| /α functions |X| in Q , each of which could be indexed by log |Q| bits. Both ways, we got α -nets of the same size! Indeed, we could just as well run the Net mechanism using the α -net defined by the IC mechanism, to obtain the same utility bounds. In some sense, one net is the “dual” of the other: one is constructed of databases, the other is constructed of queries, yet both nets are of the same size. We will see the same phenomenon in the

114 110 Generalizations “boosting for queries” algorithm in the next section — it too answers a large number of linear queries using a data structure that is entirely determined by a small “net” of queries. 5.3.2 Agnostic learning One way of viewing what the IC mechanism is doing is that it is reducing the seemingly (information theoretically) more difficult prob- lem of to the easier problem of query distinguishing or query release learning . Recall that the distinguishing problem is to find the query f which varies the most between two databases x and ∈ Q . Recall y that in learning , the learner is given a collection of labeled examples ( x . , y x ) , . . . , ( x of , y label ) ∈X ×{ 0 , 1 } , where y is the ∈{ 0 , 1 } i m 1 i 1 m x positive examples in some large data If we view as representing the set, and y as representing the negative examples in the same data set, then we can see that the problem of distinguishing is exactly the prob- agnostic learning . That is, a distinguisher finds the query that lem of best labels the positive examples, even when there is no query in the class that is guaranteed to perfectly label them (Note that in this set- ting, the same example can appear with both a positive and a negative label — so the reduction still makes sense even when and y are not x disjoint). Intuitively, learning should be an information-theoretically easier problem than query release. The query release problem requires that we release the approximate value of every query f in some class Q , evaluated on the database. In contrast, the agnostic learning problem asks only that we return the evaluation and identity of a single query: the query that best labels the dataset. It is clear that information the- oretically, the learning problem is no harder than the query release problem. If we can solve the query release problem on databases x and y , then we can solve the distinguishing problem without any further access to the true private dataset, merely by checking the approximate evaluations of every query f ∈ Q on x and y that are made available to us with our query release algorithm. What we have shown in this section is that the reverse is true as well: given access to a private distin- guishing or agnostic learning algorithm, we can solve the query release 2 problem by making a small (i.e., only log |X| /α ) number of calls to the

115 5.3. Connections 111 with no further access to the private private distinguishing algorithm, dataset . What are the implications of this? It tells us that up to small factors, the information complexity of agnostic learning is equal to the infor- mation complexity of query release. Computationally, the reduction is only as efficient as our database update algorithm, which, depending on our setting and algorithm, may or may not be efficient. But it tells us that any sort of information theoretic bound we may prove for the one problem can be ported over to the other problem, and vice versa. For example, most of the algorithms that we have seen (and most of the algorithms that we know about!) ultimately access the dataset by making linear queries via the Laplace mechanism. It turns out that any such algorithm can be seen as operating within the so-called statisti- model of data access, defined by Kearns in the context of cal query machine learning. But agnostic learning is very hard in the statistical query model: even ignoring computational considerations, there is no algorithm which can make only a polynomial number of queries to the dataset and agnostically learn conjunctions to subconstant error. For query release this means that, in the statistical query model , there is no algorithm for conjunctions (i.e., contingency tables) that releasing /α , where α is the desired accuracy level. If 1 runs in time polynomial in there is a privacy preserving query release algorithm with this run-time guarantee, it must operate outside of the SQ model, and therefore look very different from the currently known algorithms. Because privacy guarantees compose linearly, this also tells us that 2 log |X| /α (up to the possible factor of ) we should not expect to be able to privately learn to significantly higher accuracy than we can privately perform query release, and vice versa: an accurate algorithm for the one problem automatically gives us an accurate algorithm for the other. 5.3.3 A game theoretic view of query release In this section, we take a brief sojourn into game theory to interpret some of the query release algorithms we have (and will see). Let us consider an interaction between two adversarial players, Alice and Bob.

116 112 Generalizations A , and Bob has a set of Alice has some set of actions she might take, . The game is played as follows: simultaneously, Alice picks actions B a ∈ A (possibly at random), and Bob picks some action some action (possibly at random). Alice experiences a cost c ( a, b ) ∈ [ − b , 1] . ∈B 1 Alice wishes to play so as to minimize this cost, and since he is adver- this cost. This is what is sarial, Bob wishes to play so as to maximize zero sum game called a . So how should Alice play? First, we consider an easier question. Suppose we handicap Alice and require that she announce her ran- domized strategy to Bob before she play it, and allow Bob to respond optimally using this information? If Alice announces that she will draw a some action according to a probability distribution D ∈A , then Bob A will respond optimally so as to maximize Alice’s expected cost. That is, Bob will play: ∗ = arg max . )] E a, b b ( [ c ∼D a A b ∈B Hence, once Alice announces her strategy, she knows what her cost will be, since Bob will be able to respond optimally. Therefore, Alice minimizes her cost will wish to play a distribution over actions which . That is, Alice will wish to play the distribution D once Bob responds A defined as: D . = arg min )] a, b ( max c [ E a ∼D A ∈B b ∆ A D∈ (and Bob responds optimally), Alice will experience the If she plays D A lowest possible cost that she can guarantee, with the handicap that she must announce her strategy ahead of time. Such a strategy for Alice is called a min-max strategy. Let us call the cost that Alice achieves when A for the game, denoted v playing a min-max strategy Alice’s : value A ( . )] a, b max = min E c [ v ∼D a b ∈B A D∈ ∆ We can similarly ask what Bob should play if we instead place him at the disadvantage and force him to announce his strategy first to Alice. If he does this, he will play the distribution D ∈ B over actions b B that maximizes Alice’s expected cost when Alice responds optimally. We call such a strategy D for Bob a max-min strategy. We can define B

117 5.3. Connections 113 B Bob’s value for the game, , as the maximum cost he can ensure by v any strategy he might announce: B [ = max min . )] E v a, b ( c ∼D b ∈A a ∆ D∈ B B A v ≤ , since announcing one’s strategy is only a handicap. Clearly, v One of the foundational results of game theory is Von-Neumann’s A B 2 min-max Theorem, which states that in any zero sum game, . = v v In other words, there is no disadvantage to “going first” in a zero sum game, and if players play optimally, we can predict exactly Alice’s cost: A b v v ≡ v , which we refer to as the value of the game. = it will be In a zero sum game defined by action sets A , B and a Definition 5.7. v c [ − 1 , 1] , let A×B → be the value of the game. An cost function : -approximate min-max strategy is a distribution D such that: α A a, b α E + v ≤ [ c ( max )] ∼D a A b ∈B α D Similarly, an -approximate max-min strategy is a distribution B such that: α v )] a, b ( ≥ [ c − E min ∼D b B ∈A a D and D are both α -approximate min-max and max-min strategies If B A ( D α , D respectively, then we say that the pair ) is an -approximate B A Nash equilibrium of the zero sum game. So how does this relate to query release? Consider a particular zero sum-game tailored to the problem of releasing a set of linear queries Q over a data universe X . First, assume ˆ ∈Q , there is a query without loss of generality that for every f ∈Q f ˆ ˆ f = 1 − f (i.e., for each χ ∈X , such that f ( χ ) = 1 − f ( χ ) ). Define Alice’s action set to be A X and define Bob’s action set to be B = Q . We will = database player , and to Bob as the query player . refer to Alice as the Finally, fixing a true private database x normalized to be a probability distribution (i.e., ∥ x ∥ 1] = 1 ), define the cost function c : A×B → [ − 1 , 1 2 Von Neumann is quoted as saying “As far as I can see, there could be no theory of games ... without that theorem . . . I thought there was nothing worth publishing until the Minimax Theorem was proved” [ 10 ].

118 114 Generalizations . Let us call this game the “Query Release χ, f f ( χ ) − f ( x ( ) = to be: c ) Game.” We begin with a simple observation: Proposition 5.16. The value of the query release game is v = 0 . A ≤ = v Proof. 0 . Consider what happens if we let We first show that v x . the database player’s strategy correspond to the true database: D = A Then we have: A max v E ≤ )] χ, f ( [ c ∼D χ A ∈B f |X| ∑ ) = max x f ( χ ( ) · x f − i i f ∈B i =1 ( x ) = f f x ) ( − . = 0 B = v Next we observe that ≥ 0 . For point of contradiction, assume v v < that D 0 such . In other words, that there exists a distribution A for all f ∈Q that 0 < . c ( χ, f ) E ∼D χ A Here, we simply note that by definition, if E then 0 c ( χ, f ) = c < ∼D χ A ˆ ˆ E 0 ( χ, . f ) = − c > ∈Q , which is a contradiction since c f χ ∼D A What we have established implies that for any distribution D that A is an α -approximate min-max strategy for the database player, we have that for all queries f ∈Q : | E . In other words, the α |≤ f ( χ ) − f ( x ) ∼D χ A D can be viewed as a synthetic database that answers distribution A Q with α every query in -accuracy. How about for nonlinear queries? We can repeat the same argument above if we change the query release game slightly. Rather than letting the database player have strategies corresponding to universe elements χ ∈ X , we let the database player have strategies corresponding to databases themselves! Then, c ( f, y ) = | f ( x ) − f ( y ) | . Its not hard to see that this game still has value and that α -approximate min-max 0 strategies correspond to synthetic data which give α -accurate answers to queries in Q .

119 5.4. Bibliographical notes 115 So how do we compute approximate min-max strategies in zero sum games? There are many ways! It is well known that if Alice plays the game repeatedly, updating her distribution on actions using an online-learning algorithm with a no-regret guarantee (defined in Sec- 11.2 ), and Bob responds at each round with an approximately-cost- tion maximizing response, then Alice’s distribution will quickly converge to an approximate min-max strategy. Multiplicative weights is such an algorithm, and one way of understanding the multiplicative weights mechanism is as a strategy for Alice to play in the query release game defined in this section. (The private distinguisher is playing the role of Bob here, picking at each round the query that corresponds to approx- imately maximizing Alice’s cost). The median mechanism is another such algorithm, for the game in which Alice’s strategies correspond to databases, rather than universe elements, and so is also computing an approximate min-max solution to the query release game. However, there are other ways to compute approximate equilibria as well! For example, Bob , the query player, could play the game using a no-regret learning algorithm (such as multiplicative weights), and Alice could repeatedly respond at each round with an approximately- cost-minimizing database! In this case, the average over the databases that Alice plays over the course of this experiment will converge to an approximate min-max solution as well. This is exactly what is being done in Section , in which the private base-sanitizer plays the 6 role of Alice, at each round playing an approximately cost-minimizing database given Bob’s distribution over queries. In fact, a third way of computing an approximate equilibrium of a zero-sum game is to have both Alice and Bob play according to no- regret learning algorithms. We won’t cover this approach here, but this approach has applications in guaranteeing privacy not just to the database, but also to the set of queries being asked, and to privately solving certain types of linear programs. 5.4 Bibliographical notes The Iterative Construction Mechanism abstraction (together with the perception based database update algorithm) was formalized by

120 116 Generalizations 39 Gupta et al. [ ], generalizing the median mechanism of Roth and Roughgarden [ 74 ] (initially presented as an online algorithm), the online private multiplicative weights mechanism of Hardt and Roth- 44 ], and its offline variant of Gupta et al. [ 38 ]; see also Hardt blum [ et al. [ 41 ]. All these algorithm can be seen to be instantiations. The connection between query release and agnostic learning was observed 38 ]. The observation that the median mechanism, when analyzed in [ using the composition theorems of Dwork et al. [ 32 ] for ( ε, δ ) privacy, can be used to answer arbitrary low sensitivity queries is due to Hardt and Rothblum. The game theoretic view of query release, along with its applications to analyst privacy, is due to Hsu, Roth, and Ullman [ 48 ].

121 6 Boosting for Queries In the previous sections, we have focused on the problem of private query release in which we insist on bounding the worst-case error over all queries. Would our problem be easier if we instead asked only for low error on average, given some distribution over the queries? In this section, we see that the answer is no: given a mechanism which is able to solve the query release problem with low average error given any distribution on queries, we can “boost” it into a mechanism which solves the query release problem to worst-case error. This both sheds light on the difficulty of private query release, and gives us a new tool for designing private query release algorithms. Boosting is a general and widely used method for improving the accuracy of learning algorithms. Given a set of labeled training examples { ( x , , y } ) , ( x ) , y , y ) , . . . , ( x m 2 2 1 1 m where each is drawn from an underlying distribution D on a universe x i U , and each y , a learning algorithm produces a hypothesis ∈{ +1 , − 1 } i } : U → { +1 , − 1 h . Ideally, h will not just “describe” the labeling on the given samples, but will also generalize , providing a reasonably accu- rate method of classifying other elements drawn from the underlying 117

122 118 Boosting for Queries base learner, distribution. The goal of boosting is to convert a weak which produces a hypothesis that may do just a little better than ran- dom guessing, into a strong learner, which yields a very accurate pre- dictor for samples drawn according to . Many boosting algorithms D share the following basic structure. First, an initial (typically uniform) probability distribution is imposed on the sample set. Computation then proceeds in rounds. In each round t : The base learner is run on the current distribution, denoted D , 1. t h producing a classification hypothesis ; and t 2. h The hypotheses , . . . , h are used to re-weight the samples, t 1 D defining a new distribution . +1 t The process halts either after a predetermined number of rounds or when an appropriate combining of the hypotheses is determined to be sufficiently accurate. Thus, given a base learner, the design decisions for a boosting algorithm are (1) are how to modify the probability distribution from one round to the next, and (2) how to combine the hypotheses } to form a final output hypothesis. h { t t =1 ,...,T In this section we will use boosting on queries — that is, for the U is a set of queries purposes of the boosting algorithm the universe — to obtain an offline algorithm for answering large numbers of arbi- Q trary low-sensitivity queries. This algorithm requires less space than the median mechanism, and, depending on the base learner, is potentially more time efficient as well. The algorithm revolves around a somewhat magical fact 6.5 ): if we can find a synopsis that provides accurate answers (Lemma on a few selected queries, then in fact this synopsis provides accurate most queries! We apply this fact to the base learner, which answers on samples from a distribution on Q and produces as output a “weak” synopsis that yields “good” answers for a majority of the weight in Q , boosting, in a differentially private fashion, to obtain a synopsis that is good for all of Q . Although the boosting is performed over the queries, the privacy is still for the rows of the database. The privacy challenge in boosting for queries comes from the fact that each row in the database affects the

123 6.1. The boosting for queries algorithm 119 answers to all the queries. This will manifest in the reweighting of the queries: adjacent databases could cause radically different reweightings, which will be observable in the generated that, collectively, will form h t the synopsis. The running time of the boosting procedure depends quasi-linearly |Q| of queries and on the running time of the base on the number synopsis generator, independent of the data universe size |X| . This yields a new avenue for constructing efficient and accurate privacy- preserving mechanisms, analogous to the approach enabled by boost- ing in the machine learning literature: an algorithm designer can tackle the (potentially much easier) task of constructing a weak privacy- preserving base synopsis generator, and automatically obtain a stronger mechanism. The boosting for queries algorithm 6.1 We will use the for databases, outlined in Section 2 , row representation X where we think of the database as a multiset of rows, or elements of . Fix a database size , a data universe X , and a query set Q = { q : n ∗ R X } of real-valued queries of sensitivity at most ρ . → We assume the existence of a (in Section 6.2 base synopsis generator we will see how to construct these). The property we will need of the base generator, formulated next, is that, for any distribution D on the query set Q , the output of base generator can be used for computing accurate answers for a large fraction of the queries, where the “large D . The base gen- fraction” is defined in terms of the weights given by , the number of queries to be sampled; λ erator is parameterized by k , an accuracy requirement for its outputs; η , a measurement of “large” describing what we mean by a large fraction of the queries, and , a β failure probability. Definition 6.1 ( ( k, λ, η, β ) -base synopsis generator) . For a fixed database size n , data universe X and query set Q , consider a synopsis generator , that samples k queries independently from a distribution M D on Q and outputs a synopsis. We say that M is a ( k, λ, η, β ) -base syn- opsis generator if for any distribution D on Q , with all but β probability

124 120 Boosting for Queries M , the synopsis that M outputs is λ -accurate over the coin flips of S as weighted by / ) -fraction of the mass of Q η D : for a (1 2 + (6.1) [ | Pr ( S ) q q ( x ) |≤ λ ] ≥ 1 / 2 + η. − ∼D q The query-boosting algorithm can be used for any class of queries and any differentially private base synopsis generator. The running time is inherited from the base synopsis generator. The booster invests |Q| , and in particular its running additional time that is quasi-linear in time does not depend directly on the size of the data universe. To specify the boosting algorithm we will need to specify a stopping condition, an aggregation mechanism, and an algorithm for updating the current distribution on Q . T We will run the algorithm for a fixed number Stopping Condition. T will be selected so of rounds — this will be our stopping condition. as to ensure sufficient accuracy (with very high probability); as we will 2 /η log rounds will suffice. see, |Q| Although the distributions are never Updating the Distribution. A A , directly revealed in the outputs, the base synopses , . . . , A 2 1 T are revealed, and each can in principle leak information about A i D . We therefore need , in constructing A the queries chosen, from i i to constrain the max-divergence between the probability distributions obtained on neighboring databases. This is technically challenging because, given A , the database is very heavily involved in constructing i D . +1 i D , will be uniform over Q . A standard The initial distribution, 1 method for updating D is to increase the weight of poorly handled t | q ( x ) − q ( A > λ ) | elements, in our case, queries for which , by a fixed t e factor, say, , and decrease the weight of well-handled elements by the same factor. (The weights are then normalized so as to sum to 1 .) To x = y ∪{ get a feel for the difficulty, let } , and suppose that all queries ξ q are handled well by A when the database is y , but the addition t of ξ causes this to fail for, say, a 1 / 10 fraction of the queries; that is, ) | ( y ) − q ( A 10 ) |≤ λ for all queries q , but | q ( x q − q ( A / ) | > λ for some |Q| t t queries. Note that, since A “does well” on 9 / 10 of the queries even t

125 6.1. The boosting for queries algorithm 121 , it could be returned from the base sanitizer no when the database is x is the true data set. Our concern is with the effects matter which of x, y of the updating: when the database is y all queries are well handled and there is no reweighting (after normalization), but when the database x is there is a reweighting: one tenth of the queries have their weights increased, the remaining nine tenths have their weights decreased. This A , difference in reweighting may be detected in the next iteration via t +1 which is observable, and which will be built from samples drawn from rather different distributions depending on whether the database is x or y . For example, suppose we start from the uniform distribution . D 1 ( y ) ) ) z ( y ( , where by D D D = we mean the distribution at round i Then 1 2 i z when the database is . This is because the weight of every query is , which disappears in the normalization. So decreased by a factor of e ( y ) / |Q| each D q ∈ Q is assigned weight 1 . In contrast, when the in 2 x the “unhappy” queries have normalized weight database is e |Q| . 1 e 1 9 1 + e 10 10 |Q| |Q| ) x ) ( y ( q . The is ( q ) / D ratio D Consider any such unhappy query ) ( q 2 2 given by e |Q| ( x ) 1 1 9 e 1 + ( q D ) 10 |Q| e |Q| 10 2 = 1 ( ) y ) q ( D |Q| 2 10 def . . 5085 = = F ≈ 4 9 1 + 2 e ln F ≈ 1 . Now, , and even though the choice of queries used in 506 round 2 by the base generator are not explicitly made public, they may be detectable from the resulting A , which is made public. Thus, 2 there is a potential privacy loss of up to . 506 per query (of course, 1 we expect cancellations; we are simply trying to explain the source of the difficulty). This is partially addressed by ensuring that the number of samples used by the base generator is relatively small, although we still have the problem that, over multiple iterations, the distributions D may evolve very differently even on neighboring databases. t

126 122 Boosting for Queries The solution will be to attenuate the re-weighting procedure. Instead of always using a fixed ratio either for increasing the weight (when the answer is “accurate”) or decreasing it (when it is not), we ) and “inaccuracy” ( + μ , for set separate thresholds for “accuracy” ( λ λ that scales with the bit size of the output an appropriately chosen μ below). Queries for which the 6.5 of the base generator; see Lemma error is below or above these thresholds have their weight decreased or increased, respectively, by a factor of e . For queries whose error lies between these two thresholds, we scale the natural logarithm of the 1 − 2( | q ( weight change linearly: ) − q ( A , so queries with ) | − λ ) /μ x t λ μ/ 2 increase in weight, and those errors of magnitude exceeding + λ with errors of magnitude less than μ/ 2 decrease in weight. + The attenuated scaling reduces the effect of any individual on the re-weighting of any query. This is because an individual can only affect the true answer to a query — and thus also the accuracy of the base synopsis generator’s output q ( A — by a small amount, and the atten- ) t uation divides this amount by a parameter which will be chosen to μ kT T distributions samples chosen (total) from the compensate for the obtained over the course of the execution of the boosting algorithm. This helps to ensure privacy. Intuitively, we view each of these kT sam- ples as a “mini-mechanism.” We first bound the privacy loss of sampling 6.4 ) and then bound the cumulative loss via the at any round (Claim composition theorem. μ ) between the thresholds for “accurate” and The larger the gap ( “inaccurate,” the smaller the effect of each individual on a query’s weight can be. This means that larger gaps are better for privacy. For accuracy, however, large gaps are bad. If the inaccuracy threshold is large, we can only guarantee that queries for which the base synopsis generator is very inaccurate have their weight substantially increased during re-weighting. This degrades the accuracy guarantee of the boost- ing algorithm: the errors are roughly equal to the “inaccuracy” thresh- old ( λ + μ ). we will run the base generator to obtain a Aggregation. ∈ [ T ] t For synopsis A . The synopses will be aggregated by taking the median: t T A is estimated by taking the , . . . , A ) , the quantity q ( x given 1 T

127 6.1. The boosting for queries algorithm 123 ( approximate values for computed using each of the A x , and then q ) i computing their median. With this aggregation method we can show ≤ q , 1 by arguing that a majority of the i ≤ T accuracy for query A i provide μ accuracy (or better) for q . This implies that the median λ + T q ( x ) will be within approximations to + μ of the true value of the λ value. Notation. Throughout the algorithm’s operation, we keep track of several 1. q variables (explicitly or implicitly). Variables indexed by ∈ Q in the query set. Variables hold information pertaining to query q ∈ [ T ] , usually computed in round indexed by , will be used to t t D used for sampling in time period construct the distribution t +1 + 1 . t For a predicate we use [[ P ]] to denote 1 if the predicate is true P 2. and 0 if it is false. There is a final tuning parameter α used in the algorithm. It will 3. 6.3 below) to have value be chosen (see Corollary ( ) 1 + 2 η η ) = (1 / 2) ln α = α . ( 2 η 1 − 6.1 u in Step 2(2b) The algorithm appears in Figure . The quantity t,q is the new, un-normalized, weight of the query. For the moment, let us α = 1 (just so that we can ignore any α factors). Letting a set be j,q the natural logarithm of the weight change in round , 1 ≤ j ≤ t , the j new weight is given by: t ∑ u exp . a ← − j,q t,q =1 j Thus, at the end of the previous step the un-normalized weight was ∑ t − 1 = exp( − and the update corresponds to multiplication ) a u ,q 1 j,q − t j =1 ∑ t − a j,t is large, the weight is small. Every a by . When the sum e j,q =1 j time a synopsis gives a very good approximation to q ( x ) , we add 1 to this sum; if the approximation is only moderately good (between λ and

128 124 Boosting for Queries Boosting for queries. Figure 6.1: μ/ 2 ), we add a positive amount, but less than 1. Conversely, when λ + + accuracy), we subtract 1; λ μ the synopsis is very bad (worse than + μ/ 2 when it is barely acceptable (between λ + μ ), we subtract λ and a smaller amount. In the theorem below we see an inverse relationship between privacy loss due to sampling, captured by between the , and the gap μ ε sample thresholds for accurate and inaccurate. Theorem 6.1. Q be a query family with sensitivity at most ρ . For Let 2 = log an appropriate setting of parameters, and with /η T rounds, |Q| the algorithm of Figure 6.1 is an accurate and differentially private query-boosting algorithm: 1. When instantiated with a ( k, λ, η, β ) -base synopsis generator, the output of the boosting algorithm gives ( + μ ) -accurate answers λ − all to with probability at least 1 the queries in T β , where Q √ √ 3 2 3 / ρ (((log Q | ) O k ∈ log(1 /β ) | ) / ( ε . · η μ )) (6.2) sample

129 6.1. The boosting for queries algorithm 125 ( ε 2. , δ If the base synopsis generator is ) -differentially pri- base base ( + T · ε vate, then the boosting algorithm is , δ ε + base sample sample -differentially private. T δ ) base to be swallowed up into the big-O nota- η Allowing the constant √ 3 / 2 O (((log tion, and taking ρ = 1 for simplicity, we get Q | ) μ k = | √ )) /ε log(1 ) . Thus we see that reducing the number k of input /β sample queries needed by the base sanitizer improves the quality of the output. Similarly, from the full statement of the theorem, we see that improv- ing the generalization power of the base sanitizer, which corresponds to (a bigger “strong majority”), also improves having a larger value of η the accuracy. Proof of Theorem . We first prove accuracy, then privacy. 6.1 − + We introduce the notation and a a , satisfying t,q t,q − + , a 1. a ∈{− 1 , 1 } ; and t,q t,q − + . ≤ a ≤ a a 2. t,q t,q t,q Recall that a larger a indicates a higher quality of the approximation t,q A ) for q ( x of the synopsis . t − q is 1 if A 1. is λ -accurate on a , and − 1 otherwise. To check that t t,q − − ≤ a , and , note that if a a q = 1 then A -accurate for is λ t t,q t,q t,q − a then = 1 as well. If instead we have a so by definition 1 = − t,q t,q a since we always have − 1 , 1] , we are done. ∈ [ t,q − to lower bound a measure of the quality a We will use the t,q of the output of the base generator. By the promise of the base is A -accurate for at least a 1 / 2 + η fraction of the generator, λ t . Thus, D mass of t ∑ − r (6.3) η. D ) = 2 [ q ] · a , η ≥ (1 / 2 + η ) − (1 / 2 − t t t,q ∈Q q + a ( otherwise. To is − 1 if A 1 is 2. λ + μ ) - inaccurate for q , and t t,q + + 1 a a - , note that if a ≤ ) = check that μ then A + is ( λ − t t,q t,q t,q inaccurate for q , so by definition a as well. If instead = − 1 t,q + a = 1 then since we always have a , we are done. ∈ [ − 1 , 1] t,q t,q + a Thus is at least minimally is positive if and only if A t t,q + adequately accurate for q . We will use the a to prove accuracy t,q

130 126 Boosting for Queries + , we get a pos- a of the aggregation. When we sum the values t,q A are providing itive number if and only if the majority of the t μ — approximations to q ( x ) . In passable — that is, within λ + μ λ this case the median value will be within + . rounds of boosting, with all but T probability, After Lemma 6.2. T β 2 η the answers to all but an T ) -fraction of the queries are ( λ + μ ) - exp( − accurate. In the last round of boosting, we have: Proof. u T,q ] = [ (6.4) q D . T +1 Z T + a ≤ a Since we have: t,q t,q ∑ ∑ T T + a − α a α − + t,q t,q =1 t t =1 . u = (6.5) ≤ e u e , T,q T,q + ” reminds us that this unweighted value was com- (The superscript “ + + puted using the terms .) Note that we always have u a . Com- ≥ 0 t,q T,q 6.4 ) and ( 6.5 ), for all q ∈Q : bining Equations ( + u T,q . q ] ≥ (6.6) [ D +1 T Z T P denotes the boolean variable that has value 1 [[ ]] Recalling that if and only if the predicate P is true, we turn to examining the value is ( λ + μ ) -inaccurate for [[ ]] . If this predicate is 1, then it must be the A q T } {A case that the majority of are λ + μ ) -inaccurate, as otherwise ( j j =1 ( their median would be μ ) -accurate. λ + ∑ + T From our discussion of the significance of the sign of a , we t,q =1 t have: T ∑ + λ + μ ) -inaccurate for q ⇒ A is ( a ≤ 0 t,q =1 t ∑ T + a α − t,q =1 t ⇔ ≥ 1 e + u ≥ 1 ⇔ T,q + u ≥ 0 , We conclude that: Since T,q + A is ( λ + μ ) -inaccurate for q ]] ≤ u [[ T,q

131 6.1. The boosting for queries algorithm 127 6.6 ) yields: Using this together with Equation ( ∑ ∑ 1 1 + is ( λ + μ ) -inaccurate for q ]] · [[ u · A ≤ T,q |Q| |Q| q ∈Q q ∈Q ∑ 1 · ≤ D Z · [ q ] +1 T T |Q| ∈Q q Z T . = |Q| Thus the following claim completes the proof: Claim 6.3. t of boosting, with all but tβ probability: In round 2 ≤ exp( − η t · Z ·|Q| ) t By definition of a base synopsis generator, with all but Proof. β prob- (1 2 + η ) - ability, the synopsis generated is -accurate for at least a / λ − a fraction of the mass of the distribution D ∈ {− 1 , 1 } . Recall that t t,q − and recall is λ -accurate on q is 1 if and only if a A a ≤ , and that t t,q t,q ∑ − ). 6.3 · r D defined in Equation ( [ further the quantity ] , a q t t t,q ∈Q q As discussed above, r measures the “success” of the base synopsis gen- t erator in round t , where by “success” we mean the stricter notion of λ -accuracy. As summarized in Equation ( ), if a (1 / 2 + η ) -fraction of 6.3 D λ -accuracy, then r ≥ 2 η . Now observe the mass of is computed with t t ∈ [ T ] , assuming the base sanitizer did not fail in round t : also that for t ∑ = Z u t,q t q ∈Q ∑ α a · − t,q e · = u t − ,q 1 q ∈Q ∑ a · α − t,q · Z ·D e [ q ] = t − t 1 q ∈Q ∑ − a · α − t,q ·D Z [ q ] · e ≤ t − t 1 ∈Q q (( ( ) ) ) − − ∑ 1 + a a − 1 t,q t,q − α α · D [ q ] = · Z e · e · + t t 1 − 2 2 q ∈Q (case analysis)

132 128 Boosting for Queries ] [ Z t 1 − α − − α α α ) + r = ( e e + − e ( ) e t 2 [ ] Z − 1 t − α α α − α α − α ( e ( e − e ≤ ) + ( r 0 ≥ 2 η and ( e e ) − ) + 2 η ) ≤ e t 2 α α − α − α )+2 η ( e e By simple calculus we see that − e e ) is minimized ( + when ( ) 1 + 2 η = (1 / 2) ln α . η 1 2 − Plugging this into the recurrence, we get √ t 2 2 1 − 4 η ( ) Z ≤ exp( − 2 η |Q|≤ t ) |Q| . t This completes the proof of Lemma 6.2 . The lemma implies that accuracy for all queries simultaneously can be achieved by setting ln |Q| . T > 2 η ( S , , A ) , . . . , S Privacy. We will show that the entire sequence A 1 1 T T can be output while preserving differential privacy. Note that this is S , . . . , S stronger than we need — we do not actually output the sets . 1 T By our adaptive composition theorems, the privacy of each A will be i guaranteed by the privacy guarantees of the base synopsis generator, together with the fact that was computed in a differentially private S i − 1 ( S , , A ) , . . . , S A way. Therefore, it suffices to prove that given that i 1 1 i S is as well. We can then combine the pri- is differentially private, +1 i vacy parameters using our composition theorems to compute a final guarantee. 4 αT ρ ∗ Lemma 6.4. = Let ε is . For all i ∈ [ T ] , once ( S ) , A A , . . . , S , i 1 1 i μ ∗ ε 0) fixed, the computation of each element of ( S -differentially , is +1 i private. Proof. Fixing A has sensi- , . . . , A d , for every j ≤ i , the quantity i 1 q,j tivity ρ , since A is fixed), and ( q ) is database independent (because A j j

133 6.1. The boosting for queries algorithm 129 ∈ Q q j ≤ i , has sensitivity bounded by ρ every . Therefore, for every ρ/μ sensitive by construction, and so is a 2 j,q i ∑ def = g q ) ( a j,q i =1 j def ≤ 2 T ρ/μ . Then ∆ g 2 has sensitivity at most = 2 T ρ/μ is an upper iρ/μ i bound on the sensitivity of . g i S To argue privacy, we will show that the selection of queries for +1 i − g ( q ) as the is an instance of the exponential mechanism. Think of i q during the selection process at round + 1 . The utility of a query i ∗ ε , ( -differential privacy exponential mechanism says that to achieve 0) with probability proportional to we should choose q ( ) ∗ ε exp ) g ( q − . i 2∆ g i ∗ 2∆ / Since g ε = α and the algorithm selects q with probability pro- i − αg ( q ) i portional to e , we see that this is exactly what the algorithm does! We bound the privacy loss of releasing the s by treating each selec- S i T rounds tion of a query as a “mini-mechanism” that, over the course of times. By Lemma 6.4 each mini-mechanism of boosting, is invoked kT 3.20 , for all β > 0 is 0) -differentially private. By Theorem (4 αT ρ/μ, kT mechanisms, each of which is ( α 4 T ρ/μ, 0) - the composition of ( , δ ) -differentially private, where ε differentially private, is sample sample ) ( √ 2 α 4 T ρ def (6.7) /δ )( α 4 T ρ/μ ) + kT = ε 2 kT log(1 . sample sample μ Our total privacy loss comes from the composition of T calls to the base sanitizer and the cumulative loss from the kT samples. We con- clude that the boosting algorithm in its entirety is: ( ε - , δ ) boost boost differentially private, where ε ε = T ε + base sample boost δ δ = T δ + base boost sample

134 130 Boosting for Queries To get the parameters claimed in the statement of the theorem, we can take: √ √ 3 / 2 (( T k μ log(1 /β ) ∈ ) /ε O ) . (6.8) αρ sample 6.2 Base synopsis generators Algorithm SmallDB (Section 4 ) is based on the insight that a small randomly selected subset of database rows provides good answers to large sets of fractional counting queries. The base synopsis generators described in the current section have an analogous insight: a small synopsis that gives good approximations to the answers to a small subset of queries also yields good approximations to most queries. Both of these are instances of generalization bounds . In the remainder of this section we firs prove a generalization bound and then use it to construct differentially base synopsis generators. 6.2.1 A generalization bound D over a large set Q of queries to be approx- We have a distribution imated. The lemma below says that a sufficiently small synopsis that randomly gives sufficiently good approximations to the answers of a subset ⊂Q of queries, sampled according to the distribution selected S D Q , will, with high probability over the choice of S , also give good on approximations to the answers to most queries in Q (that is, to most of the mass of Q D ). Of course, to make any sense the , weighted by synopsis must include a method of providing an answer to all queries in , not just the subset S ⊆ Q received as input. Our particular Q 6.2.2 and Theorem 6.6 will produce generators, described in Sections synthetic databases; to answer any query one can simply apply the query to the synthetic database, but the lemma will be stated in full generality. Let R ( ) denote the answer given by the synopsis y (when used as y, q input for the reconstruction procedure) on query q . A synopsis y λ -fits a y, q database w.r.t a set S of queries if max y | . Let | R ( x ) − q ( x ) |≤ λ | S ∈ q

135 6.2. Base synopsis generators 131 y . Since our synopses will denote the number of bits needed to represent y | N log | |X| for some appropriately chosen be synthetic databases, = 2 of universe elements. The generalization bound shows that N number x with respect to a large enough (larger than | y | if y λ -fits ) randomly of queries sampled from a distribution D , then with high S chosen set -fits x for most of the mass of D . probability y λ D Let Q = Lemma 6.5. be an arbitrary distribution on a query set ∗ 1) : → R } . For all m ∈ N , γ q (0 , X , η ∈ [0 , 1 / 2) , let a = { ∈ /γ ) + 2(log(1 ) / ( m (1 − 2 η )) . Then with probability at least 1 − γ over m a · m ∼ D the choice of , every synopsis y of size at most m bits that S -fits x S , also λ -fits x with respect to at λ with respect to the query set / 2 + η ) -fraction of D . least a (1 a is a compression factor: Before proving the lemma we observe that am m we are squeezing the answers to -bit output, so queries into an a larger corresponds to more compression. Typically, this means better generalization, and indeed we see that if a is larger then, keeping m and fixed, we would be able to have larger η . The lemma also says that, γ m , the number of queries needed as input to for any given output size obtain an output that does well on a majority ( 1 / 2 + η fraction) of D is only O (log(1 /γ ) + m ) . This is interesting because a smaller number of queries k needed by the base generator leads, via the privacy loss due to sampling of queries and its inverse relationship to the ε kT sample (Equation ), to improved accuracy of the output of the μ slackness 6.7 boosting algorithm. . Fix a set of queries S ⊂Q chosen independently Proof of Lemma 6.5 a m · . Examine an arbitrary m D y . Note that y according to -bit synopsis m -bit string. Let us say y is described by an bad if | R ( y, q ) − q ( x ) | > is λ (log(1 /γ ) + m ) / for at least a a · m ) fraction of D , meaning that ( Pr . ) [ | R ( y, q ) − q ( x ) | > λ ] ≥ (log(1 /γ ) + m ) / ( a · m ∼D q In other words, y is bad if there exists a set Q ⊂ Q of fractional y weight at least (log(1 /γ )+ m ) / ( a · m ) such that | R ( y, q ) − q ( x ) | > λ for gives q Q -accurate . For such a y , what is the probability that y ∈ λ y answers for every q ∈ S ? This is exactly the probability that none of

136 132 Boosting for Queries S , or Q the queries in is in y m − m /γ )+ m ) a · − (log(1 )) ≤ (1 − ≤ e (log(1 /γ ) + m ) / ( a · 2 · m γ m Taking a union bound over all possible choices for y , the probability 2 m y that is accurate on all the queries that there exists an -bit synopsis S but inaccurate on a set of fractional weight (log(1 /β in m ) / ( a · m ) )+ is at most γ . Letting k = am = | S | we see that it is sufficient to have /γ ) + m ) 2(log(1 a > (6.9) . − η ) (1 · m 2 This simple lemma is extremely powerful. It tells us that when t , we only need to worry about constructing a base generator at round ensuring good answers for the small set of random queries sampled from D ; doing well for most of D will happen automatically! t t The base generator 6.2.2 Our first generator works by brute force. After sampling a set S of k queries independently according to a distribution D , the base gener- ator will produce noisy answers for all queries in S via the Laplace mechanism. Then, making no further use of the actual database, the algorithm searches for database of size n for which these noisy any answers are sufficiently close, and outputs this database. Privacy will be immediate because everything after the k invocations of the Laplace mechanism is in post-processing. Thus the only source of privacy loss is the cumulative loss from these k invocations of the Laplace mechanism, which we know how to analyze via the composition theorem. Utility will follow from the utility of the Laplace mechanism — which says that we are unlikely to have “very large” error on even one query — coupled with the fact that the true database x is an n -element database that 1 fits these noisy responses. 1 This argument assumes the size n of the database is known. Alternatively we can include a noisy query of the form “How many rows are in the database?” and exhaustively search all databases of size close to the response to this query.

137 6.2. Base synopsis generators 133 (Base Synopsis Generator for Arbitrary Queries) . Theorem 6.6 For any ∗ , database size Q : {X n → R } of queries X , and class data universe ε , δ > 0 , there exists an , for any of sensitivity at most ρ base base ) , δ -base synopsis gen- ( -differentially private ε k, λ, η = 1 / 3 , β ) ( base base n , where = am > 6( m +log(2 erator for )) = 6( k log |X| +log(2 /β )) Q /β √ 2 b (log k + log (2 /β )) , where b = ρ ) am log(1 /δ and λ > /ε . base base The running time of the generator is n poly( n, log(1 /β ) , log(1 /ε |X| ) , log(1 · . )) /δ base base Proof. We first describe the base generator at a high level, then deter- mine the values for k and . The synopsis y produced by the base λ n . Thus m = | y | = generator will be a synthetic database of size · n |X| . The generator begins by choosing a set S of k queries, log D sampled independently according to . It computes a noisy answer for each query q ∈ S using the Laplace mechanism, adding to each ( b ) for an appropriate b true answer an independent draw from Lap to ̂ be determined later. Let q ( x { } be the collection of noisy answers. ) ∈Q q n |X| n databases of size The generator enumerates over all , and out- puts the lexicographically first database y q ∈ S such that for every ̂ ) ) − | q ( x y | ≤ λ/ 2 . If no such database is found, it outputs ( we have q ̂ . Note that if | ⊥ q ( x ) − q ( x ) | < λ/ 2 and instead, and we say it fails ̂ ( | ) − q q ( x ) | < λ/ 2 , then | q ( y ) − q ( x ) | < λ . y There are two potential sources of failure for our particular gener- y fails to generalize, or is bad as defined in ator. One possibility is that 6.5 . A second possibility is that one of the samples the proof of Lemma from the Laplace distribution is of excessively large magnitude, which might cause the generator to fail. We will choose our parameters so as to bound the probability of each of these events individually by at most β/ . 2 η 6.9 / 3 and m = n log | X | into Equation Substituting shows = 1 a > that taking /β ) /m ) suffices in order for the probability 6(1 + log(2 S to be bounded by β/ 2 . Thus, taking of failure due to the choice of k = am > 6( m + log(2 /β )) = 6( n log |X| + log(2 /β )) suffices. We have queries of sensitivity at most ρ . Using the Laplace k √ b = 2 ρ/ε 2 k log(1 /δ , ensures that ) mechanism with parameter base base √ ε ) / , which by each query incurs privacy loss at most k ln(1 /δ 2 base base

138 134 Boosting for Queries 3.21 Corollary ε ensures that the entire procedure will be , δ ( ) - base base differentially private. λ so that the probability that any draw from Lap ( b ) We will choose 2 is at most β/ has magnitude exceeding . Conditioned on the event λ/ 2 k draws have magnitude at most λ we know that the input that all database itself will λ -fit our noisy answers, so the procedure will not fail. Recall that the concentration properties of the Laplace distribution t − a draw from Lap 1 ensure that with probability at least ( b ) will have e tb λ/ 2 = tb , the probability that a given . Setting magnitude bounded by 2 − − t b λ/ e e is bounded by 2 λ/ draw will have magnitude exceeding . = k draws has magnitude exceeding λ/ 2 it To ensure that none of the suffices, by a union bound, to have − λ/ 2 b ke < β/ 2 2 2 b λ/ ⇔ e > k β ⇔ 2 > b (log k + log(2 /β )) λ/ )) λ > ⇔ (log k + log(2 /β 2 . b The Special Case of Linear Queries. For the special case of lin- ear queries it is possible to avoid the brute force search for a small database. The technique requires time that is polynomial in ( |Q| , |X| , n, log(1 /β )) . We will focus on the case of counting queries and sketch the construction. As in the case of the base generator for arbitrary queries, the base S k = am queries according to generator begins by selecting a set of D and computing noisy answers using Laplace noise. The generator for linear queries then runs a syntheticizer on S which, roughly speaking, transforms any synopsis giving good approximations to any set R of queries into a synthetic database yielding approximations of similar R . The input to the syntheticizer will be the noisy quality on the set values for the queries in S , that is, R = S . (Recall that when we modify the size of the database we always think in terms of the fractional version of the counting queries: “What fraction of the database rows satisfies property P ?”)

139 6.2. Base synopsis generators 135 The resulting database may be quite large, meaning it may ′ have many rows. The base generator then subsamples only n = 2 log(1 /α (log of the rows of the synthetic database, creating a /β k )) − has 1 smaller synthetic database that with probability at least β α -accuracy with respect to the answers given by the large synthetic 2 k log(1 /β )) /α database. This yields an m |X| -bit synopsis = ((log ) log (1 log(1 /β )) over that, by the generalization lemma, with probability − queries, answers well on a k / 2 + η ) fraction of Q the choice of the (1 D ). (as weighted by As in the case of the base generator for arbitrary queries, we require 2 am > 6 log(1 /β ) + 6 m . Taking α k = (log Q ) /n we get that = /β log k log(1 ) log |X| 6 log(1 ) + 6 k > /β 2 α log |X| /β log k = 6 log(1 /β ) log(1 n . ) + 6 log |Q| The syntheticizer is nontrivial. Its properties are summarized by the following theorem. Theorem 6.7. Let X be a data universe, Q a set of fractional counting queries, and A an ( ε, δ ) -differentially private synopsis generator with utility ( and arbitrary output. Then there exists a syntheticizer α, β, 0) ′ ′ 0) ) -differentially private and has utility that is α, β, ( . A ε, δ out- A (3 puts a (potentially large) synthetic database. Its running time is poly- A and ( |X| , |Q| , 1 /α, log(1 /β )) . nomial in the running time of A In our case, is the Laplace mechanism, and the synopsis is sim- ply the set of noisy answers. The composition theorem says that for A ( ε ) , δ -differentially private the parameter to the Laplace to be base base √ . For fractional count- 2 k log(1 /δ )) ( ε mechanism should be / ρ/ base base ρ = 1 /n ing queries the sensitivity is . Thus, when we apply the Theorem we will have an α of order √ . Here, k log(1 /β ) /ε ρ ) ρ is the sensitivity. For counting queries ( base it is 1 , but we will shift to fractional counting queries, so ρ = 1 /n . Proof Sketch for Theorem 6.7 . Run A to get (differentially private) (fractional) counts on all the queries in R . We will then use linear pro- gramming to find a low-weight fractional database that approximates

140 136 Boosting for Queries these fractional counts, as explained below. Finally, we transform this fractional database into a standard synthetic database by rounding the fractional counts. q . The A ∈Q yields a fractional count for each query The output of ′ input database x ( ε, δ ) -differentially is never accessed again and so A is private. Let v v is the fractional be the resulting vector of counts, i.e., q ’s output gives on query q . With probability 1 − β A count that , all of v are α -accurate. the entries in z that approximates these counts is A “fractional” database obtained as follows. Recall the histogram representation of a database, where for each element in the universe X the histogram contains the number of instances of this element in the database. Now, for every ∈X a ≥ 0 that will “count” the (fractional) i , we introduce a variable i in the fractional database number of occurrences of . We will impose i z the constraint ∑ a = 1 . i ∈X i q z as the sum of the count of items in We represent the count of query that satisfy q : i ∑ a i s.t. q ( i )=1 i ∈X α We want all of these counts to be within a an additive accuracy of v . Writing this as a linear inequality we get: the respective counts in q ∑ ∑ ∑ − v ) ( a ≤ α ) . a α + v ( ≤ a i q q i i ∈X i ∈X i q s.t. )=1 ∈X i ( i When the counts are all α -accurate with respect to the counts in v , it c is also the case that (with probability 1 − β ) they are all 2 α -accurate with respect to the true counts on the original database x . We write a linear program with two such constraints for each query ′ (a total of 2 |Q| A tries to find a fractional solution to constraints). this linear program. To see that such a solution exists, observe that the database x itself is α -close to the vector of counts v , and so there exists a solution to the linear program (in fact even an integer solution), and ′ hence A will find some fractional solution.

141 6.2. Base synopsis generators 137 ′ We conclude that can generate a fractional database with A -utility, but we really want a synthetic (integer) database. To α, β, (2 0) transform the fractional database into an integer one, we round down , for i ∈ X , to the closest multiple of α/ each , this changes each a |X| i |X| additive factor, and so the rounded α/ fractional count by at most a counts have (3 α, β, 0) utility. Now we can treat the rounded fractional database (which has total weight 1), as an integer synthetic database /α of (polynomial) size at most |X| . we defined Recall that in our application of Theorem 6.7 A to be the mechanism that adds Laplace noise with parameter √ 2 / ρ/ ( k ε /δ draws, so by taking )) . We have k log(1 base base √ ′ ρ = 2 k log(1 /δ α )(log k + log(1 /β )) base ′ we have that α A , β, 0) -accurate. For the base generator we chose ( is 2 = (log |Q| ) /n . If the output of the syntheticizer is too large, error α we subsample ) /β log |Q| log(1 /β ) log(1 k log ′ = = n 2 2 α α 1 − β the resulting database maintains rows. With probability √ √ ρ ) (log |Q| O /n + ( ( 2 k log(1 /δ -accuracy ) /ε )) )(log k + log(1 /β base base on all of the concepts simultaneously. k Finally, the base generator can fail if the choice of queries ∈ D S does not lead to good generalization. With the parameters we have chosen this occurs with probability at most β , leading to a total failure β . probability of the entire generator of 3 . For any (Base Generator for Fractional Linear Queries) Theorem 6.8 n , database size n data universe Q : {X X → R } of fractional , and class linear queries (with sensitivity at most /n ), for any ε , 1 0 > , δ base base there exists an ( ε -base , δ ) ) -differentially private ( k, λ, 1 / 3 , 3 β base base Q synopsis generator for , where ) ( log( n ) /β ) log(1 |X| O = k log |Q| √ ( ( )) √ log(1 ) /β log |X| 1 √ |Q| + λ O . log · = |Q| log ε n base

142 138 Boosting for Queries poly( |X| log(1 /β ) , The running time of the base generator is , n, )) log(1 /ε . base The sampling bound used here is the same as that used in the construction of the SmallDB mechanism, but with different parameters. Here we are using these bounds for a base generator in a complicated boosting algorithm with a very small query set; there we are using them for a single-shot generation of a synthetic database with an enormous query set. 6.2.3 Assembling the ingredients (see Equation 6.2 ) and λ , The total error comes from the choice of μ the accuracy parameter for the based generator. 6.1 : Let us recall Theorem 6.1 . Let Q be a query family with sensitiv- (Theorem ) Theorem 6.9 . For an appropriate setting of parameters, and with ity at most ρ 2 6.1 is an accurate and |Q| /η rounds, the algorithm of Figure = log T differentially private query-boosting algorithm: When instantiated with a 1. k, λ, η, β ) -base synopsis generator, the ( ( output of the boosting algorithm gives μ ) -accurate answers λ + T β the queries in all 1 − Q , where to with probability at least √ √ 3 3 / 2 ) (((log | ) O k ∈ log(1 /β ) ρ Q / ( ε (6.10) · η μ )) . | sample 2. If the base synopsis generator is ( ε -differentially pri- , δ ) base base , T (( T · ε ε ) + ( β + vate, then the boosting algorithm is base sample δ -differentially private. )) base 6.7 , By Equation ( ) √ 2 T ρ 4 α def log(1 )( 4 T ρ/μ ) + kT /β α kT 2 , = ε sample μ where α = (1 / 2)(ln(1 + 2 η )(1 − 2 η )) ∈ O (1) . We always have T = 2 (log /η , so substituting in this value into the above equation we ) |Q| see that the bound √ √ 3 / 2 3 ∈ | Q | O (((log k ) log(1 /β ) ρ ) / ( ε )) · η μ sample in the statement of the theorem is acceptable.

143 6.3. Bibliographical notes 139 η a constant, we have For the case of arbitrary queries, with ) ( √ ρ /β ( O n log |X| log(1 /δ ∈ )(log( n log |X| λ . ))) ) + log(2 base ε base ε = T ε = Now, ε T ε . Set these two terms equal, so + sample base base boost ε / 2 = ε term with , whence we can replace the 1 /ε base boost sample 2 have sim- = (log |Q| /η T /ε ) / 2 ε μ . Now our terms for λ 2 and boost boost η ilar denominators, since is constant. We may therefore conclude that the total error is bounded by: ) ( √ 3 / 2 2 / 3 log |X| ρ log /β |Q| )) n (log(1 ̃ O + λ . μ ∈ ε boost With similar reasoning, for the case of fractional counting queries we get ) ( √ 3 2 / ) |Q| log(1 /β log |X| log ̃ √ . μ O ∈ + λ n ε boost To convert to a bound for ordinary, non-fractional, counting queries we n to obtain multiply by ) ( √ 3 / 2 log |Q| log(1 /β ) n log |X| ̃ λ + μ ∈ . O ε boost Bibliographical notes 6.3 6.1 ) is a variant of AdaBoost algorithm The boosting algorithm (Figure 78 ]. See Schapire [ 77 ] for an excellent survey of Schapire and Singer [ of boosting, and the textbook “Boosting” by Freund and Schapire [ 79 ] for a thorough treatment. The private boosting algorithm covered in 32 ], which also contains the base this section is due to Dwork et al. [ generator for linear queries. This base generator, in turn, relies on the syntheticizer of Dwork et al. [ 28 ]. In particular, Theorem 6.7 comes from [ 28 ]. Dwork, Rothblum, and Vadhan also addressed differentially private boosting in the usual sense.

144 7 When Worst-Case Sensitivity is Atypical In this section, we briefly describe two general techniques, both enjoying unconditional privacy guarantees, that can often make life easier for the data analyst, especially when dealing with a function that has arbitrary, or difficult to analyze, worst-case sensitivity. These algorithms are most useful in computing functions that, for some exogenous reason, the analyst has reason to believe are “usually” insensitive in practice. 7.1 Subsample and aggregate The Subsample and Aggregate technique yields a method for “forcing” f the computation of a function x ) to be insensitive, even for an arbi- ( trary function f . Proving privacy will be trivial. Accuracy depends on properties of the function f and the specific data set x ; in particular, , if ( x ) can be accurately estimated, with high probability, on f ( S ) f where S is a random subset of the elements in x , then accuracy should be good. Many maximum likelihood statistical estimators enjoy this property on “typical” data sets — this is why these estimators are employed in practice. 140

145 7.1. Subsample and aggregate 141 Subsample and Aggregate with a generic differentially private aggre- Figure 7.1: . gation algorithm M rows of the database are n x In Subsample and Aggregate, the blocks B randomly partitioned into , . . . , B m , each of size n/m . The m 1 f is computed exactly, without noise function , independently on each block. The intermediate outcomes ( B f ) , . . . , f ( B ) are then combined m 1 via a differentially private aggregation mechanism — typical examples 1 include standard aggregations, such as the α -trimmed mean, the Win- 2 and the median, but there are no restrictions — and sorized mean, then adding Laplace noise scaled to the sensitivity of the aggregation function in question; see Figure 7.1 . The key observation in Subsample and Aggregate is that any single element can affect at most one block, and therefore the value of just f ( B ) . Thus, changing the data of any individual can change a single i f at most a single input to the aggregation function. Even if is arbi- trary, the analyst chooses the aggregation function, and so is free to choose one that is insensitive, provided that choice is independent of the database! Privacy is therefore immediate: For any δ ≥ 0 and any func- tion f , if the aggregation mechanism M is ( ε, δ ) -differentially private 1 α -trimmed mean is the mean after the top and bottom α fraction of the The inputs have been discarded. 2 The Winsorized mean is similar to the α -trimmed mean except that, rather than being discarded, the top and bottom α fraction are replaced with the most extreme remaining values.

146 142 When Worst-Case Sensitivity is Atypical then so is the Subsample and Aggregate technique when instantiated 3 with . f M and Utility is a different story, and it is frustratingly difficult to argue even for the case in which data are plentiful and large random sub- sets are very likely to give similar results. For example, the data may be labeled training points in high dimensional space and the function is logistic regression, which produces a vector v and labels a point p +1 if and only if p · v with T for some (say, fixed) threshold T . ≥ Intuitively, if the samples are sufficiently plentiful and typical then all v blocks should yield similar vectors . The difficulty comes in getting a good bound on the worst-case sensitivity of the aggregation function — we may need to use the size of the range as a fallback. Nonetheless, some nice applications are known, especially in the realm of statisti- cal estimators, where, for example, it can be shown that, under the no addi- assumption of “generic normality,” privacy can be achieved at tional cost in statistical efficiency (roughly, accuracy as the number of samples grows). We do not define generic normality here, but note that estimators fitting these assumptions include the maximum likeli- hood estimator for “nice” parametric families of distributions such as gaussians, and maximum-likelihood estimators for linear regression and logistic regression. f discrete range of cardinality m , say, Suppose the function has a m ] [ . In this case Subsample and Aggregate will need to aggregate a set of b elements drawn from [ m ] , and we can use Report Noisy Arg-Max to find the most popular outcome. This approach to aggregation requires b log m to obtain meaningful results even when the intermediate ≥ outcomes are unanimous. We will see an alternative below with no such requirement. Example 7.1 (Choosing a Model) . Much work in statistics and machine learning addresses the problem of model selection : Given a data set and a discrete collection of “models,” each of which is a family of proba- bility distributions, the goal is to determine the model that best “fits” 3 The choice of aggregation function can even depend on the database, but the selection must be made in a differentially private fashion. The privacy cost is then the cost of composing the choice operation with the aggregation function.

147 7.2. Propose-test-Release 143 -dimensional data, the d the data. For example, given a set of labeled ≪ d features, and collection of models might be all subsets of at most s the goal is to find the set of features that best permits prediction of the f might be choosing the best model from the given labels. The function m models, a process known as model fitting , via an arbitrary set of of learning algorithm. Aggregation to find the most popular value could be done via Report Noisy Max, which also yields an estimate of its popularity. Example 7.2 (Significant Features) . This is a special case of model fit- d ting. The data are a collection of points in and the function is the R s L ∈ [ d ] very popular LASSO, which yields as output a list of at most s ≪ significant features. We can aggregate the output in two ways: d d executions of Subsample feature by feature — equivalent to running and Aggregate, one for each feature, each with a range of size 2 — or ( ) d on the set as a whole, in which case the cardinality of the range is . s 7.2 Propose-test-Release At this point one might ask: what is the meaning of the aggregation if there is not substantial agreement among the blocks? More generally, for any reasonably large-scale statistical analysis in real life, we expect the results to be fairly stable, independent of the presence or absence of any single individual. Indeed, this is the entire intuition behind the significance of a statistic and underlying the utility of differential pri- vacy. We can even go further, and argue that if a statistic is not stable, we should have no interest in computing it. Often, our database will in fact be a sample from a larger population, and our true goal is not to compute the value of the statistic on the database itself, but rather estimate it for the underlying population. Implicitly, therefore, when computing a statistic we are already assuming that the statistic is sta- ble under subsampling! Everything we have seen so far has provided privacy even on very “idiosyncratic” datasets, for which “typically” stable algorithms my be highly unstable. In this section we introduce a methodology, Propose- Test-Release, which is motivated by the philosophy that if there is

148 144 When Worst-Case Sensitivity is Atypical insufficient stability then the analysis can be abandoned because the results are not in fact meaningful. That is, the methodology allows , the function satisfies the analyst to check that, on the given dataset some “robustness” or “stability” criterion and, if it does not, to halt the analysis. The goal of our first application of Propose-Test-Release is to come up with a variant of the Laplace mechanism that adds noise scaled to something strictly smaller than the sensitivity of a function. This , which is defined for a (function, leads to the notion of local sensitivity f, x ) . Quite simply, the local sensitivity of f with database) pair, say, ( x is the amount by which the respect to ( y ) can differ from f ( x ) for f any adjacent to x . y Definition 7.1 . The local sensitivity of a function (Local Sensitivity) n k X x → R f with respect to a database : is: ) max . ∥ f ( x ) − f ( y ∥ 1 adjacent to x y propose a bound, The Propose-Test-Release approach is to first b , on local sensitivity — typically the data analyst has some idea say of what this should be — and then run a differentially private test to ensure that the database is “far” from any database for which this bound fails to hold. If the test is passed, then the sensitivity is assumed to be bounded by , and a differentially private mechanism such as, for b b/ε , is used to example, the Laplace mechanism with parameter release the (slightly) noisy response to the query. Note that we can view this approach as a two-party algorithm where one party plays an honest data analyst and the other is the Laplace mechanism. There is an interplay between the honest analyst and the mechanism in which the algorithm asks for an estimate of the sensitiv- ity and then “instructs” the mechanism to use this estimated sensitivity in responding to subsequent queries. Why does it need to be so compli- cated? Why can’t the mechanism simply add noise scaled to the local sensitivity without playing this private estimation game? The reason is that the local sensitivity may itself be sensitive. This fact, combined with some auxiliary information about the database, can lead to pri- vacy problems: the adversary may know that the database is one of x ,

149 7.2. Propose-test-Release 145 which has very low local sensitivity for the computation in question, and a neighboring y , for which the function has very high local sensi- tivity. In this case the adversary may be able to guess rather accurately x is the true database. For example, if f ( x ) = y ( y ) = s which of and f and the reponse is far from y . s , then the adversary would guess This is captured by the math of differential privacy. There are neigh- boring instances of the median function which have the same median, , but arbitrarily large gaps in the local sensitivity. Suppose the say, m response R to the median query is computed via the Laplace mechanism with noise scaled to the local sensitivity. When the database is x the prob- m ability mass is close to , because the sensitivity is small, but when the database is the mass is far flung, because the sensitivity is large. As y x an extreme case, suppose the local sensitivity on is exactly zero, for 6 = { 0 , 10 x } , n is even, and example, , which has size X + 1 , con- n tains n/ 2 zeros. Then the median of x 1 + is zero and the local sensitivity of the median, when the database is x , is 0 . In contrast, the neighbor- ing database y has size n , contains n/ 2 zeros, has median zero (we have defined median to break ties in favor of the smaller value), and the local 6 , is 10 sensitivity of the median, when the database is . On x all the mass y 0 = 0 ) is concentrated on of the Laplace mechanism (with parameter /ε 0 ; but on y the probability distribution has standard devi- the single point √ 6 · 10 2 . This destroys all hope of differential privacy. ation To test that the database is “far” from one with local sensitivity greater than the proposed bound b , we may pose the query: “What is the distance of the true database to the closest one with local sensi- tivity exceeding b ?” Distance to a fixed set of databases is a (global) sensitivity 1 query, so this test can be run in a differentially private fashion by adding noise Lap /ε ) to the true answer. To err on the side (1 of privacy, the algorithm can compare this noisy distance to a conser- vative threshold — one that is only negligibly likely to be exceeded due to a freak event of very large magnitude Laplace noise. For example, 2 if the threshold used is, say, ln n , the probability of a false positive (passing the test when the local sensitivity in fact exceeds b ) is at most − ε ln n , by the properties of the Laplace distribution. Because of O n ) ( the negligible probability of a false positive, the technique cannot yield ( ε, 0) -differential privacy for any ε .

150 146 When Worst-Case Sensitivity is Atypical To apply this methodology to consensus on blocks, as in our dis- cussion of Subsample and Aggregate, view the intermediate results as a data set and consider some measure of the con- f ) , . . . , f ( B ( B ) m 1 centration of these values. Intuitively, if the values are tightly concen- trated then we have consensus among the blocks. Of course, we still need to find the correct notion of concentration, one that is meaningful and that has a differentially private instantiation. In a later section we will define and weave together two notions of stability that seem relevant to Subsample and Aggregate: insensitivity (to the removal or addition of a few data points) and stability under subsampling, cap- turing the notion that a subsample should yield similar results to the full data set. 7.2.1 Example: the scale of a dataset Given a dataset, a natural question to ask is, “What is the scale, or dis- persion, of the dataset?” This is a different question from data location , which might be captured by the median or the mean. The data scale is more often captured by the variance or an interquantile range. We will interquartile range (IQR) , a well-known robust estimator focus on the for the scale of the data. We begin with some rough intuition. Suppose the data are i.i.d. samples drawn from a distribution with cumulative − 1 − 1 F ) , defined as F distribution function F (3 / 4) − F . Then IQR( (1 / 4) , is a constant, depending only on . It might be very large, or very tiny, F but either way, if the density of F is sufficiently high at the two quar- tiles, then, given enough samples from F , the empirical (that is, sample) interquartile distance should be close to IQR( F ) . Our Propose-Test-Release algorithm for the interquartile distance first tests how many database points need to be changed to obtain a data set with a “sufficiently different” interquartile distance. Only if the (noisy) reply is “sufficiently large” will the algorithm release an approximation to the interquartile range of the dataset. The definition of “sufficiently different” is multiplicative, as an addi- tive notion for difference of scale makes no sense — what would be the right scale for the additive amount? The algorithm therefore works with the logarithm of the scale, which leads to a multiplicative noise

151 7.2. Propose-test-Release 147 on the IQR. To see this, suppose that, as in what might be the typical case, the sample interquartile distance cannot change by a factor of 2 by modifying a single point. Then the logarithm (base 2) of the sample interquartile has local sensitivity bounded by 1. This lets us privately the logarithm of the sample interquartile release an approximation to ) . range by adding to this value a random draw from Lap /ε (1 Let x ) denote the sample interquartile range when the the IQR( . The algorithm is (implicitly) proposing to add noise data set is x )) /ε to the value log (1 (IQR( x ) . To test whether this drawn from Lap 2 magnitude of noise is sufficient for differential privacy, we discretize R { [ k ln 2 , ( k +1) ln 2) } into disjoint bins and ask how many data points Z k ∈ must be modified in order to obtain a new database, the logarithm (base 2) of whose interquartile range is in a different bin than that (IQR( x )) . If the answer is at least two then the local sensitivity log of 2 (of the logarithm of the interquartile range) is bounded by the bin width. We now give more details. To understand the choice of bin size, we write x ) ln IQR( c ln 2 log , (IQR( x )) = = 2 ln 2 ln 2 ln(IQR( )) on the scale of ln 2 is equiv- whence we find that looking at x alent to looking at on the scale of 1. Thus we have scaled (IQR( x )) log 2 bins which are intervals whose endpoints are a pair of adjacent inte- gers: B , so = [ k, k + 1) , k ∈ Z , and we let k ⌋ = ⌊ log )) (IQR( x 1 k 2 and we say informally that the logarithm (IQR( )) ∈ [ k log , k x + 1) 1 1 2 . Consider the following testing query: of the IQR is in bin k 1 : How many data points need to change in order to get Q 0 z such that a new database ? (IQR( z )) / ∈ B log k 2 1 A . ( x ) Let Q x when the database is be the true answer to 0 0 If A − ( x ) ≥ 2 , then neighbors y of x satisfy | log )) (IQR( y 0 2 . That is, they are close to each other. This is x )) | ≤ 1 log (IQR( 2 not equivalent to being in the same interval in the discretization: log (IQR( x )) may lie close to one of the endpoints of the interval 2 [ k may lie just on the other side of the , k )) + 1) and log y (IQR( 1 1 2 endpoint. Letting R R = A , a small ( , even when the ) + Lap (1 /ε ) x 0 0 0

152 148 When Worst-Case Sensitivity is Atypical draw from the Laplace distribution has small magnitude, might not actually indicate high sensitivity of the interquartile range. To cope with the case that the local sensitivity is very small, but (IQR( x )) log 2 is very close to the boundary, we consider a second discretization (2) (1) B k − 0 . 5 , k +0 . 5) } { = [ B . We denote the two discretizations by Z k ∈ k (2) — indeed, any value — log B (IQR( and )) respectively. The value x 2 cannot be close to a boundary in both discretizations. The test is passed if R is large in at least one discretization. 0 Scale algorithm (Algorithm 12 The ) below for computing database scale assumes that , the size of the database, is known, and the dis- n tance query (“How far to a database whose interquartile range has sensitivity exceeding moved b ?”) is asking how many points must be to reach a database with high sensitivity of the IQR. We can avoid this assumption by having the algorithm first ask the (sensitivity 1) query: “How many data points are in ?” We remark that, for techni- x ( x cal reasons, to cope with the case , we define log 0 = −∞ , IQR ) = 0 = −∞ ⌊−∞⌋ [ −∞ , −∞ ) = {−∞} . , and let Algorithm 12 Scale Algorithm (releasing the interquartile The range) ∗ Require: dataset: ∈X , privacy parameters: ε, δ > 0 x 1: for the j th discretization ( j = 1 , 2) do , where 2: ( x ) = A Compute ( x ) + z . ) z /ε ∈ (1 Lap R 0 0 0 0 R if R ≤ 3: 1 + ln(1 /δ ) then 0 ( j ) . = ⊥ s 4: Let else 5: ) ( j j ( ) z ) j ( s × 2 = 6: Let s IQR , where z ( x ) . ∼ Lap (1 /ε ) s 7: end if 8: end for (1) s if ̸ = ⊥ then 9: (1) Return s 10: . 11: else (2) 12: Return s . 13: end if

153 7.2. Propose-test-Release 149 x , . . . , x denote Note that the algorithm is efficient: let , x n ) (2) (1) ( database points , and let x ( m ) denote the median, so n after sorting the ⌊ ( n m / 2 ⌋ . Then the local sensitivity of the median is max { x ( m ) − = +1) ( m − 1) , x ( m + 1) − x ( m ) x and, more importantly, one can compute } k k +1 1 1 O ( n ) sliding intervals with width 2 A ( and 2 x ) by considering , 0 each having one endpoint in . The computational cost for each interval x is constant. We will not prove convergence bounds for this algorithm because, for the sake of simplicity, we have used a base for the logarithm that is far from optimal (a better base is / ln n ). We briefly outline the 1 + 1 steps in the proof of privacy. Algorithm Scale (Algorithm Theorem 7.1. ) is (4 ε, δ ) -differentially 12 private. Proof. (Sketch.) Letting s be shorthand for the result obtained with a single discretization, and defining D , the proof = { x : A } ( x ) ≥ 2 0 0 shows: The worst-case sensitivity of query is at most 1. 1. Q 0 ⊥ 2. Neighboring databases are almost equally likely to result in : x, y : For all neighboring database ε = ⊥| x ] ≤ e = Pr[ s Pr[ ⊥| y ] . s Databases not in are unlikely to pass the test: 3. D 0 δ . x s ∈D ̸ = ⊥| : Pr[ ] ≤ x / ∀ 0 2 + 4. ∈ R ∀ , x ∈D C and all neighbors y of x : 0 ε 2 . ∈ | x ] ≤ e s Pr[ Pr[ s ∈ C | y ] C Thus, we get ε, δ/ 2) -differential privacy for each discretization. (2 Applying Theorem 3.16 (Appendix B ), which says that “the epsilons and the deltas add up,” yields (4 ε, δ ) -differential privacy.

154 150 When Worst-Case Sensitivity is Atypical Stability and privacy 7.3 7.3.1 Two notions of stability We begin by making a distinction between the two notions of stability intertwined in this section: stability under subsampling, which yields similar results under random subsamples of the data, and perturbation stability, or low local sensitivity, for a given dataset. In this section we will define and make use of extreme versions of both of these. : We say f • q -subsampling stable on x if Subsampling stability is (ˆ x ) = f ( x ) with probability at least 3 / 4 when ˆ x f is a random subsample from which includes each entry independently with x q A , a vari- probability . We will use this notion in Algorithm samp ant of Sample and Aggregate. Perturbation Stability : We say that f is stable on x • f takes the if value ( x ) f x (and unstable otherwise). on all of the neighbors of In other words, f is stable on x if the local sensitivity of f on x is zero. We will use this notion (implemented in Algorithm A dist below) for the aggregation step of . A samp At the heart of Algorithm A is a relaxed version of perturba- samp tion stability, where instead of requiring that the value be unchanged on neighboring databases — a notion that makes sense for arbitrary ranges, including arbitrary discrete ranges — we required only that the value be “close” on neighboring databases — a notion that requires a metric on the range. Functions f with arbitrary ranges, and in particular the problem of aggregating outputs in Subsample and Aggregate, motivate the next algorithm, A with high probability . On input f, x , A ) outputs f ( x dist dist ) 2 log(1 /δ x if is at distance at least from the nearest un stable data ε set. The algorithm is conceptually trivial: compute the distance to the nearest unstable data set, add Laplace noise Lap (1 /ε ) , and check that /δ ) 2 log(1 x , otherwise . If so, release f ( ) this noisy distance is at least ε output ⊥ . We now make this a little more formal. We begin by defining a quantitative measure of perturbation stability.

155 7.3. Stability and privacy 151 ∗ A function X f →R is k -stable on input x if adding Definition 7.2. : elements from does not change the value of f , that k x or removing any ( x ) = f ( y ) for all is, such that | x △ y | ≤ k . We say f is stable on x f y x , and unstable otherwise. if it is (at least) 1-stable on ∗ The of a data set x ∈ X Definition 7.3. with distance to instability f respect to a function is the number of elements that must be added to or removed from y to reach a data set that is not stable under f . f is k -stable on x if and only if the distance of x to Note that k instability is at least . A Algorithm , an instantiation of Propose-Test-Release for dist discrete-valued functions , appears in Figure 13 . g A based on distance to instability) Algorithm 13 g ( x ) (releasing dist ∗ dataset: x ∈ X 0 , privacy parameters: ε, δ > Require: , function g : ∗ X R → d 1: x to nearest unstable instance ← distance from ˆ d ← + Lap (1 /ε ) 2: d ) /δ log(1 ˆ d > if 3: then ε Output ( x ) 4: g else 5: Output ⊥ 6: end if 7: The proof of the following propostion is immediate from the prop- erties of the Laplace distribution. Proposition 7.2. For every function g : 1. A -differentially private. is ( ε, δ ) dist ) /β )+ln(1 /δ ln(1 x is : if 0 β > For all ) = -stable on g , then A x ( 2. dist ε g ( x ) with probability at least 1 − β , where the probability space is the coin flips of . A dist This distance-based result is the best possible, in the following sense: if there are two data sets outputs different and y for which A x dist

156 152 When Worst-Case Sensitivity is Atypical g ( ) and g ( y ) , respectively, with at least constant probability, values x to . must be Ω(log(1 /δ ) /ε ) x then the distance from y Distance to instability can be difficult to compute, or even to lower bound, so this is not in general a practical solution. Two examples where distance to instability turns out to be easy to bound are the median and the mode (most frequently occurring value). A may also be unsatisfactory if the function, say f , is not stable dist is not stable on the specific datasets of interest. For example, suppose f because of the presence of a few outliers in x . Instances of the average behave this way, although for this function there are well know robust alternatives such as the winsorized mean, the trimmed mean, and the f median. By what about for general functions ? Is there a method of “forcing” an arbitrary to be stable on a database x ? f A This will be the goal of , a variant of Subsample and Aggre- samp ( with high probability (over its own random ) f gate that outputs x is subsampling stable on x . choices) whenever f Algorithm A 7.3.2 samp In , so that , the blocks B with replacement , . . . , B A are chosen m 1 samp each block has the same distribution as the inputs (although now an ele- ment of x may appear in multiple blocks). We will call these subsampled datasets ˆ x } , . . . , ˆ x ) . The intermediate outputs z = { f (ˆ x x ) , . . . , f (ˆ 1 m 1 m A g = mode . The distance with function are then aggregated via dist is a scaled ver- measure used to estimate the stability of the mode on z sion of the difference between the popularity of the mode and that of . 14 A the second most frequent value. Algorithm , appears in Figure samp 2 f about 1 /q Its running time is dominated by running times; hence it is efficient whenever f is. The key property of Algorithm A , it out- is that, on input f, x samp f ( x ) with high probability, over its own random choices, when- puts ε . This result has is q -subsampling stable on x for q = ever f /δ 64 log(1 ) an important statistical interpretation. Recall the discussion of model selection from Example 7.1 . Given a collection of models, the sample complexity of model selection is the number of samples from a dis- tribution in one of the models necessary to select the correct model

157 7.3. Stability and privacy 153 2 with probability at least . The result says that differentially private / 3 model selection increases the sample complexity of (non-private) model selection by a problem-independent (and range-independent) factor of ) /ε ) . (log(1 /δ O : Bootstrapping for Subsampling-Stable f Algorithm 14 A samp ∗ , function , privacy parameters : X x → R dataset: ε, δ > Require: f . 0 ) log( n/δ ε 1: , m q ← . ← 2 ) /δ 64 ln(1 q Subsample m data sets ˆ x includes each , ..., ˆ x ˆ from x , where 2: x m 1 i x independently with probability . position of q then some element of appears in more than 2 mq sets if x 3: x ˆ i 4: Halt and output ⊥ . 5: else z ←{ f (ˆ x , ) 6: ··· , f (ˆ x . ) } m 1 i For each ∈ R , let count ( r ) = # 7: r : f (ˆ x . ) = r } { i 8: Let count . 2 , denote the the i th largest count, i = 1 ) i ( (4 d ← − count ( ) / count mq ) − 1 9: (2) (1) Now run A 10: ( g, z ) using d to estimate distance to Comment dist instability: 1 ˆ ← d 11: Lap + d . ) ( ε ˆ if d > ln(1 /δ ) /ε then 12: Output g ( z ) = 13: ( z ) . mode 14: else 15: Output ⊥ . 16: end if 17: end if Theorem 7.3. 1. A Algorithm is ( ε, δ ) -differentially private. samp ε q 2. q -subsampling stable on input x where If = f , then is /δ ) 64 ln(1 algorithm A . ( x ) outputs f ( x ) with probability at least 1 − 3 δ samp ) f can be computed in time T ( n If on inputs of length n , then 3. log n . O ( A ) n ) + )( T ( qn runs in expected time samp 2 q

158 154 When Worst-Case Sensitivity is Atypical Note that the utility statement here is an input-by-input guaran- tee; q -subsampling stable on all inputs. Importantly, f need not be . In the context of there is no dependence on the size of the range R model selection, this means that one can efficiently satisfy differential ) ) privacy with a modest blowup in sample complexity (about log(1 /ε /δ whenever there is a particular model that gets selected with reasonable probability. The proof of privacy comes from the insensitivity of the compu- , the privacy of the Propose-Test-Release technique, and d tation of the privacy of Subsample and Aggregate, modified slightly to allow for the fact that this algorithm performs sampling with replacement and thus the aggregator has higher sensitivity, since any individual might mq blocks. The main observation for analyzing the util- affect up to 2 is a function mode ity of this approach is that the stability of the of the difference between the frequency of the mode and that of the is subsam- next most popular element. The next lemma says that if f , then x is far from unstable with respect to the mode pling stable on x ( z g g ( f (ˆ x ), and ) , . . . , f (ˆ x f )) (but not necessarily with respect to ) = m 1 x and moreover one can estimate the distance to instability of efficiently privately. ∗ ∗ ˆ (0 , 1) . Given f Lemma 7.4. X Fix → R , let q f : X ∈ → R be : ˆ x f = mode( f (ˆ x includes each ) , ..., f (ˆ x the function )) where each ˆ m 1 i 2 ) and m element of n/δ q /q independently with probability . Let x = ln( ( z ) = (count ) − count of ) / (4 mq d − 1 ; that is, given a “database” z (2) (1) d ( z )+1 isa scaled difference between the number of occurrences values, x . Let E be the event of the two most popular values. Fix a data set x is included in more than 2 mq of the subsets ˆ x that no position of . i Then, when q ε/ 64 ln(1 /δ ) we have: ≤ 1. occurs with probability at least 1 − δ . E ˆ Conditioned on E , d lower bounds the stability of , and f on x 2. d has global sensitivity 1. If f is q -subsampling stable on x , then with probability at least 3. ˆ − δ over the choice of subsamples, we have 1 f ( x ) = f ( x ) , and, conditioned on this event, the final test will be passed with

159 7.3. Stability and privacy 155 1 probability at least , where the probability is over the draw − δ /ε . (1 from Lap ) The events in Parts 2 and 3 occur simultaenously with probability at − 2 δ . least 1 1 Part 2 , notice Proof. follows from the Chernoff bound. To prove Part , adding or removing one entry in E that, conditioned on the event count by at most the original data set changes any of the counts ( r ) mq . Therefore, count . This in − count 4 changes by at most 2 mq (2) (1) d ( (ˆ x turn means that ) , . . . , f (ˆ x f )) changes by at most one for any x 1 m and hence has global sensitivity of one. This also implies that d lower ˆ f bounds the stability of on . x We now turn to part . We want to argue two facts: 3 If , then there is likely to be a large is q -subsampling stable on x 1. f gap between the counts of the two most popular bins. Specifically, count − count ≥ we want to show that with high probability (1) (2) m/ 4 . Note that if the most popular bin has count at least 5 m/ 8 then the second most popular bin can have count at most 3 8 , m/ m/ 4 with a difference of . By definition of subsampling stability m/ and 3 4 the most popular bin has an expected count of at least = 1 hence, by the Chernoff bound, taking 8 , has probability α / 2 − 2 mα 32 − m/ = e at most of having a count less than 5 m/ 8 . (All e the probabilities are over the subsampling.) 2. When the gap between the counts of the two most popular bins is large, then the algorithm is unlikely to fail; that is, the test is 1 likely to succeed. The worry is that the draw from Lap ( ) will be ε ˆ d negative and have large absolute value, so that falls below the threshold ( ln(1 ) /ε ) even when d is large. To ensure this happens /δ with probability at most it suffices that d > 2 ln(1 /δ ) /ε . δ By definition, d = ( count , and, assuming − count 1 ) / (4 mq ) − (2) (1) we are in the high probability case just described, this implies 1 m/ 4 1 1 = − − ≥ d 4 16 q mq

160 156 When Worst-Case Sensitivity is Atypical so it is enough to have 1 > 2 ln(1 /δ ) /ε. q 16 ≤ ε/ 64 ln(1 /δ ) suffices. Taking q − m/ 32 Finally, note that with these values of e q and m < δ . we have Example 7.3. [The Raw Data Problem] Suppose we have an analyst whom we can trust to follow instructions and only publish information obtained according to these instructions. Better yet, suppose we have such analysts, and we can trust them not to communicate among b themselves. The analysts do not need to be identical, but they do need to be considering a common set of . For example, these options options might be different statistics in a fixed set S of possible statistics, and in this first step the analyst’s goal is to choose, for eventual publication, the most significant statistic in S . Later, the chosen statistic will be recomputed in a differentially private fashion, and the result can be published. choice of statis- As described the procedure is not private at all: the tic made in the first step may depend on the data of a single individual! Nonetheless, we can use the Subsample-and-Aggregate framework to i carry out the first step, with the th analyst receiving a subsample of the data points and applying to this smaller database the function f i to obtain an option. The options are then aggregated as in algorithm A ; if there is a clear winner this is overwhelmingly likely to be the samp selected statistic. This was chosen in a differentially private manner, and in the second step it will be with differential privacy. computed Bibliographic Notes Subsample and Aggregate was invented by Nissim, Raskhodnikova, and Smith [ 68 ], who were the first to define and exploit low local sensitivity. ], as is the algorithm 22 Propose-Test-Release is due to Dwork and Lei [ for releasing the interquartile range. The discussion of stability and privacy, and Algorithm A which blends these two techniques, is samp due to Smith and Thakurta [ 80 ]. This paper demonstrates the power of

161 7.3. Stability and privacy 157 A by analyzing the subsampling stability conditions of the famous samp LASSO algorithm and showing that differential privacy can be obtained “for free,” via (a generalization of A ), precisely under the (fixed samp data as well as distributional) conditions for which LASSO is known to have good explanatory power.

162 8 Lower Bounds and Separation Results In this section, we investigate various lower bounds and tradeoffs: 1. How inaccurate must responses be in order not to completely destroy any reasonable notion of privacy? 2. How does the answer to the previous question depend on the number of queries? 3. Can we separate ( ε, 0) -differential privacy from ( ε, δ ) -differential privacy in terms of the accuracy each permits? Is there an intrinsic difference between what can be achieved for 4. linear queries and for arbitrary low-sensitivity queries while main- taining ( ε, 0) -differential privacy? A different flavor of separation result distinguishes the compu- tational complexity of generating a data structure handling all the queries in a given class from that of generating a synthetic database that achieves the same goal. We postpone a discussion of this result to 9 . Section 158

163 8.1. Reconstruction attacks 159 Reconstruction attacks 8.1 that any non-trivial mechanism must be ran- We argued in Section 1 domized. It follows that, at least for some database, query, and choice of random bits, the response produced by the mechanism is not perfectly in accurate. The question of how accurate answers must be in order to protect privacy makes sense in all computational models: interactive, non-interactive, and the models discussed in Section 12 . For the lower bounds on distortion, we assume for simplicity that the database consists of a single — but very sensitive — bit per -bit Boolean vector person, so we can think of the database as an n . This is an abstraction of a setting in which the d , . . . , d = ( ) d 1 n database rows are quite complex, for example, they may be medical records, but the attacker is interested in one specific field, such as the presence or absence of the sickle cell trait. The abstracted attack con- S sists of issuing a string of queries, each described by a subset of the database rows. The query is asking how many ’s are in the selected 1 n -bit characteristic vector of rows. Representing the query as the S , with S s in all the positions corresponding to rows in S and the set 1 s everywhere else, the true answer to the query is the inner product 0 ∑ n ( A ) = . S d S i i i =1 Fix an arbitrary privacy mechanism. We will let r ( S ) denote the response to the query S . This may be obtained explicitly, say, if the S mechanism is interactive and the query is issued, or if the mecha- nism is given all the queries in advance and produces a list of answers, or implicitly, which occurs if the mechanism produces a synopsis from r ( S which the analysts extracts . Note that r ( S ) may depend on ran- ) dom choices made by the mechanism and the history of queries. Let E ( S, r ( S )) denote the error , also called noise or distortion , of the ( response ( S ) , so E ( S, r ( S )) = | A r S ) − r ( S ) | . The question we want to ask is, “How much noise is needed in order to preserve privacy?” Differential privacy is a specific privacy guarantee, but one might also consider weaker notions, so rather than guaranteeing privacy the modest goal in the lower bound arguments will simply be to prevent privacy catastrophes.

164 160 Lower Bounds and Separation Results A mechanism is blatantly non-private Definition 8.1. if an adversary that agrees with the real database can construct a candidate database c ∈ ( ) entries, i.e., ∥ c − d ∥ in all but n o ( n ) . d o 0 In other words, a mechanism is blatantly non-private if it permits a reconstruction attack that allows the adversary to correctly guess ( the secret bit of all but ) members of the database. (There is no o n requirement that the adversary know on which answers it is correct.) M be a mechanism with distortion of magnitude Theorem 8.1. Let . Then there exists an adversary that can reconstruct the E bounded by E positions. 4 database to within An easy consequence of the theorem is that a privacy mechanism adding noise with magnitude always bounded by, say, 401 , permits n/ an adversary to correctly reconstruct 99% of the entries. Proof. d be the true database. The adversary attacks in two phases: Let 1. Query M Estimate the number of 1s in all possible sets: S n ] . [ ⊆ on all subsets For every candidate database 2. Rule out “distant” databases: ∑ n , 1 } [ , if ∃ S ⊆ c n ] ∈ { | 0 , then > E | c ) −M ( S such that i ∈ i S rule out c c is not ruled out, then output c and halt. . If Since M ( S ) never errs by more than E , the real database will not be ruled out, so this simple (but inefficient!) algorithm will output some . We will argue that the number of positions in candidate database c and differ is at most 4 · E . which c d d be the indices in which Let . = 0 I I } = { i | d = 0 , that is, i 0 i 0 I = Similarly, define { i | d − = 1 } . Since c was not ruled out, |M ( I ) 0 1 i ∑ ∑ c d | ≤ E . However, by assumption | ≤ ( I E ) − . It |M 0 i i ∈ I ∈ I i i 0 0 c and d differ in at most 2 E follows from the triangle inequality that I 2 ; the same argument shows that they differ in at most positions in E 0 and positions in . Thus, c I d agree on all but at most 4 E positions. 1 What if we consider more realistic bounds on the number of queries? √ We think of n as an interesting threshold on noise, for the following reason: if the database contains n people drawn uniformly at random

165 8.1. Reconstruction attacks 161 N from a population of size ≫ n , and the fraction of the population p satisfying a given condition is , then we expect the number of rows √ n ) ± Θ( np in the database satisfying the property to be roughly , by the properties of the binomial distribution. That is, the sampling error √ n is on the order of . We would like that the noise introduced for √ privacy is smaller than the sampling error, ideally ) . The next o n ( result investigates the feasibility of such small error when the number . The result is negative. n of queries is linear in Ignoring computational complexity, to see why there might exist a query-efficient attack we modify the problem slightly, looking at n n } 1 } d and query vectors v ∈ {− 1 , 1 ∈ {− 1 . The true , databases · v , and the response is a noisy version d answer is again defined to be of the true answer. Now, consider a candidate database c that is far n , say, ∥ c − d ∥ , with constant ∈ Ω( n ) . For a random v ∈ from {− 1 , 1 } d 0 R √ n } d ) · v ∈ Ω( ( n ) . To see this, fix x ∈ {− 1 probability we have 1 c − , n ∈ is a sum of independent random {− 1 , 1 v and choose . Then x · v } R variables x , and v n ∈ and variance {− 1 , 1 } , which has expectation 0 i i R is distributed according to a scaled and shifted binomial distribuiton. For the same reason, if d differ in at least αn rows, and v is c and c − d ) · v is binomially distributed with mean chosen at random, then ( and variance at least αn . Thus, we expect 0 · v and d · v to differ by at c √ n with constant probability, by the properties of the binomial least α anti distribution. Note that we are using the -concentration property of the distribution, rather than the usual appeal to concentration. This opens an attack for ruling out c when the noise is constrained √ o ( · n ) : compute the difference between c to be v and the noisy response √ r ) . If the magnitude of this difference exceeds n — which will v ( v — then rule out c . occur with constant probability over the choice of The next theorem formalizes this argument and further shows that the attack is resilient even to a large fraction of completely arbitrary responses: Using a linear number of ± 1 questions, an attacker can reconstruct almost the whole database if the curator is constrained to √ 1 ) + η . o ( of the questions within an absolute error of n answer at least 2 Theorem 8.2. For any η > 0 and any function α = α ( n ) , there is constant b and an attack using bn ± 1 questions that reconstructs a

166 162 Lower Bounds and Separation Results 2 α 2 ( database that agrees with the real database in all but at most ) η 1 entries, if the curator answers at least η + of the questions within an 2 absolute error of . α Proof. We begin with a simple lemma. ∑ k Lemma 8.3. Let Y = X where each X independent is a ± 2 i i =1 i Bernoulli random variable with mean zero. Then for any y and any ℓ +1 √ [ Y ∈ [2 ℓ 2( y + ℓ )]] ≤ ∈ N , P r . y, k ( ) k 1 k = 2 y ] = Proof. Y is always even and that . Note that P r ) [ ( Y 2 / 2 ) y k ( + ( ) k 1 k This expression is at most ( ) . Using Stirling’s approxima- ⌈ ⌉ 2 2 k/ √ n 2 nπ ( n/e ) , this n ! tion, which says that can be approximated by √ 2 . The claim follows by a union bound over the ℓ + 1 is bounded by πk + [2 Y 2( y in ℓ )] . possible values for y, n random vectors v ∈{− 1 , 1 } The adversary’s attack is to choose , bn obtain responses ( y , . . . , y such ) , and then output any database c 1 bn 1 y − ( Ac ) |≤ α for at least | that is the A + η of the indices , where i i i 2 bn n matrix whose rows are the random query vectors v . × d and let c be the reconstructed database. Let the true database be By assumption on the behavior of the mechanism, | ( Ad ) for a − y α |≤ i i / 2+ fraction of i ∈ [ bn ] . Since c was not ruled out, we also have that 1 η 2+ Ac | − ( . Since any two such sets |≤ α for a 1 / ) η fraction of i ∈ [ bn ] y i i 2 η fraction of i ∈ [ of indices agree on at least a ] , we have from the bn triangle inequality that for at least ηbn values of i , | [( c − d ) A ] 2 |≤ 2 α . i 2 α 2 in all but ( We wish to argue that agrees with entries. We c d ) η will show that if the reconstructed c is far from d , disagreeing on at least 2 α/η ) entries, the probability that a randomly chosen (2 A will satisfy | [ ( c − d )] A |≤ values of α for at least 2 ηbn 2 i will be extremely small — i so small that, for a random , it is extremely unlikely that there even A c exists a d that is not eliminated by the queries in A . far from n , z − d ) ∈ {− 2 , 0 c 2 } Assume the vector has Hamming weight = ( 2 α 2 c . We have argued that, since ) is , so c is far from d at least ( η produced by the attacker, | ( Az ) . | ≤ 2 α for at least 2 ηbn values of i i z bad with respect to A . We will show that, with We shall call such a . A , no z is bad with respect to A high probability over the choice of

167 8.1. Reconstruction attacks 163 2 α 2 i is the sum of at least ( For any , v ) z ± 2 random values. i η 2 α/η ) Letting and k = 2 α , we have by Lemma 8.3 that the = (2 ℓ v , so z lies in an interval of size 4 α is at most η probability that i the expected number of queries for which v | z | ≤ 2 α is at most ηbn . i Chernoff bounds now imply that the probability that this number ηbn is at most exp( 2 exceeds ηbn ) . Thus the probability of a particular − 4 ηbn exp( − d being bad with respect to A is at most z − = . ) c 4 n 3 possible Taking a union bound over the atmost z s, we get that ηb z − n ( with probability at least − − ln 3)) , no bad exp( exists. Taking 1 4 b > 4 ln 3 /η , the probability that such a bad z exists is exponentially n . small in Preventing blatant non-privacy is a very low bar for a privacy mech- anism, so if differential privacy is meaningful then lower bounds for pre- venting blatant non-privacy will also apply to any mechanism ensuring differential privacy. Although for the most part we ignore computa- tional issues in this monograph, there is also the question of the effi- ciency of the attack. Suppose we were able to prove that (perhaps under some computational assumption) there exist low-distortion mechanisms that are “hard” to break; for example, mechanisms for which producing c close to the original database is hard? Then, a candidate database although a low-distortion mechanism might fail to be differentially pri- vate in theory, it could conceivably provide privacy against bounded adversaries. Unfortunately, this is not the case. In particular, when the √ n ) , there is an efficient attack using exactly n ( noise is always in o fixed queries; moreover, there is even a computationally efficient attack 0 . 239 fraction may be requiring a linear number of queries in which a answered with wild noise. In the case of “internet scale” data sets, obtaining responses to 8 queries is infeasible, as n is extremely large, say, n ≥ 10 . What n happens if the curator permits only a sublinear number of questions? This inquiry led to the first algorithmic results in (what has evolved to ( ε, δ ) -differential privacy, in which it was shown how to maintain be) privacy against a sublinear number of counting queries by adding bino- √ n ) — less than the sampling error! — to each o ( mial noise of order true answer. Using the tools of differential privacy we can do this either

168 164 Lower Bounds and Separation Results using either (1) the Gaussian mechanism or (2) the Laplace mechanism and advanced composition. 8.2 Lower bounds for differential privacy The results of the previous section yielded lower bounds on distortion needed to ensure any reasonable notion of privacy. In contrast, the result in this section is specific to differential privacy. Although some of the details in the proof are quite technical, the main idea is elegant: suppose (somehow) the adversary has narrowed down the set of possible s 2 S vectors, where the L distance databases to a relatively small set of 1 ∆ between each pair of vectors is some large number . Suppose further that we can find a -dimensional query F , 1-Lipschitz in each of its k output coordinates, with the property that the true answers to the query look very different (in L norm) on the different vectors in our ∞ set; for example, the distance on any two elements in the set may be k Ω( ) . It is helpful to think geometrically about the “answer space” R . k Each element x in the set S gives rise to a vector F ( x ) in answer space. The actual response will be a perturbation of this point in answer space. Then a volume-based pigeon hole argument (in answer space) shows that, if with even moderate probability the (noisy) responses are “reasonably” close to the true answers, then cannot be very small. ε ( ε, 0) -differentially private mech- This stems from the fact that for M , for arbitrarily different databases x, y , any response in the anisms support of M ( x ) is also in the support of M ( y ) . Taken together with the construction of an appropriate collection of vectors and a (con- trived, non-counting) query, the result yields a lower bound on distor- k/ε . The argument appeals to Theorem 2.2 tion that is linear , which discusses group privacy. In our case the group in question corresponds to the indices contributing to the ( L ) distance between a pair of vec- 1 tors in S . 8.2.1 Lower bound by packing arguments We begin with an observation which says, intuitively, that if the “likely” response regions, when the query is F , are disjoint, then we can bound

169 8.2. Lower bounds for differential privacy 165 from below, showing that privacy can’t be too good. When ε x ∥ ) − F ( i x ∥ F is large, this says that to get very good privacy, even when ( ) j ∞ restricted to databases that differ in many places, we must get very erroneous responses on some coordinate of . F The argument uses the histogram representation of databases. In |X| the sequel, d = denotes the size of the universe from which database elements are drawn. s { x , . . . , x S Assume the existence of a set } , where = Lemma 8.4. 1 2 d x x , such that for i ̸ = j , ∥ N − ∈ ∥ ≤ ∆ . Further, let F : each x 1 i j i k d s N be a k -dimensional query. For 1 ≤ i ≤ → R , let B denote a 2 i k R B are mutually , the answer space, and assume that the region in i is an ( ε, 0) -differentially private mechanism for F such disjoint. If M 1) ln(2)( s − s 2 ] , Pr[ M ( x . ) ∈ B that, ∀ 1 1 / 2 , then ε ≥ ≤ i ≤ ≥ i i ∆ 1 − ] ( x . Since the regions ) ∈ Proof. Pr[ By assumption ≥ 2 M B j j s s − s B are disjoint, ∃ j ̸ = i ∈ [2 . ] such that Pr[ M ( x , . . . , B ) ∈ B 2 ] ≤ j i 1 2 s 2 1 regions B That is, for at least one of the , the probability that − j − s ) is mapped to this M . Combining this with dif- is at most 2 x ( B j i ferential privacy, we have − 1 2 x | [ Pr B ] j j M . ε ∆) exp( ≤ ≤ − s | x [ ] 2 Pr B j i M s S = { x Corollary 8.5. , . . . , x , and assume Let } be as in Lemma 8.4 2 1 that for any i ̸ = j , ∥ F ( x ball in ) − F ( x L ) ∥ denote the ≥ η . Let B i ∞ i j ∞ k be any 2 centered at x R of radius M η/ ε -differentially private . Let i mechansim for F satisfying s ≤ i ≤ 2 . : Pr[ M ( x ) ∈ B 1 ] ≥ 1 / 2 ∀ i i (ln 2)( s − 1) . Then ε ≥ ∆ s Proof. The regions B , . . . , B are disjoint, so the conditions of 2 1 Lemma are satisfied. The corollary follows by applying the lemma 8.4 and taking logarithms. In Theorem 8.8 below we will look at queries F that are sim- ply k independently and randomly generated (nonlinear!) queries. For

170 166 Lower Bounds and Separation Results S suitable F and (we will work to find these) the corollary says that 2 responses simultaneously have small / all if with probability at least 1 error, then privacy can’t be too good. In other words, ) . To obtain ( ε, 0) Claim 8.6 (Informal Restatement of Corollary 8.5 - ln(2)( s − 1) ≤ , the mechanism must add noise ε differential privacy for ∆ 2 norm greater than η/ with with probability exceeding 1 / 2 . L ∞ As a warm-up exercise, we prove an easier theorem that requires a large data universe. k n k 0 , 1 } : . Let M Theorem 8.7. X Let → R X be an ( ε, 0) - = { n differentially private mechanism such that for every database x ∈ X with probability at least / 2 M ( x ) outputs all of the 1-way marginals 1 j x 2 . That is, for each j ∈ [ k ] , the n/ th of with error smaller than M ( x ) should approximately equal the number of rows of component of whose j th bit is 1, up to an error smaller than n/ 2 . Then n ∈ x k/ε ) . Ω( Note that this bound is tight to within a constant factor, by the sim- ( 0) -differential privacy ple composition theorem, and that it separates ε, − o ( n ) ∈ from ( ε, δ ) -differential privacy, for δ , since, by the advanced 2 composition theorem (Theorem 3.20 ), Laplace noise with parameter √ = Ω( k ln(1 /δ ) /ε suffices for the former, in contrast to b k/ε ) needed 2 − log n for the latter. Taking and, say, δ = 2 k Θ( n ) , yields the sepa- ∈ ration. k Proof. ∈{ 0 , 1 } , consider the database w x For every string consisting w k identical rows, all of which equal of . Let B consist of all ∈ R n w w tuples of numbers that provide answers to the 1-way marginals on x with error less than n/ 2 . That is, k . = { ( a } , . . . , a 2 ) }∈ R | : ∀ i ∈ [ k ] | a < n/ − nw B i i 1 w k k . is the open ℓ B of radius n/ 2 around nw ∈{ 0 Put differently, } , n ∞ w B are mutually disjoint. Notice that the sets w If M is an accurate mechanism for answering 1-way marginals, then w for every B x when the database is the probability of landing in w w should be at least 1 / 2 : Pr[ M ( x n ) ∈ B ∆ = ] ≥ 1 / 2 . Thus, setting w w (ln 2)( s − 1) in Corollary 8.5 . ε ≥ and s = k we have ∆

171 8.2. Lower bounds for differential privacy 167 k, d, n ∈ N and ε ∈ (0 , 1 / 40] , where n ≥ Theorem 8.8. For any d k k/ε, d/ε : N } → R F with per-coordinate sen- , there is a query min { ( ε, 0) -differentially private mechanism such that any 1 sitivity at most adds noise of norm Ω(min { k/ε, d/ε } ) L 1 / 2 with probability at least ∞ . on some databases of weight at most n = need not be large here, in contrast to the require- d |X| Note that ment in Theorem 8.7 . ℓ Proof. { k, d } . Using error-correcting codes we can con- Let = min d s { x , where , . . . , x struct a set S } = s = ℓ/ 400 , such that each x N ∈ i 2 1 and in addition ≤ : ∥ x 1. ∥ ∀ i w = ℓ/ (1280 ε ) i 1 ∀ i ̸ = j , ∥ x w/ − x 10 ∥ 2. ≥ 1 i j We do not give details here, but we note that the databases in S are w < n , and so ∥ x w − x the ∥ w ≤ 2 of size at most . Taking ∆ = 2 1 j i set S satisfies the conditions of Corollary 8.5 . The remainder of our F to which we will apply Corollary 8.5 . effort is to obtain the queries d s = { x , the first step is to define a , . . . , x Given S } , where each x N ∈ i 2 1 s s 2 d 2 : N R → , L mapping from the space of histograms to vectors in . R S Intuitively (and imprecisely!), given a histogram x , the mapping lists, x w ∈ S , the for each . More precisely, letting distance from x to x L i i 1 x in our collection we define be an upper bound on the weight of any i the mapping as follows. For every x • ∈ S , there is a coordinate i in the mapping. i • The i th coordinate of L . ( x ) is max { w/ 30 −∥ x } − z ∥ 0 , 1 i S s Claim 8.9. x , . . . , x satisfy the conditions If 1 2 ∀ 1. ∥ x ; and ∥ w ≤ i 1 i 2. ∀ i ̸ = j ∥ x 10 − x w/ ∥ ≥ 1 j i − L is 1 -Lipschitz; in particular, if ∥ then the map , then = 1 z ∥ z 2 1 1 S ∥L . ( z 31 ) −L ≥ ( z w ) ∥ , assuming ≤ 1 1 2 1 S S

172 168 Lower Bounds and Separation Results d ≥ z ∈ N we have that if is close to some Proof. Since we assume w 31 S , meaning w/ 30 > ∥ x x − z ∥ ∈ , then z cannot be close to any other 1 i i ′ ∥ z x − z ∈ ≤ 1 . Thus, for any z S , z , and the same is true for all ∥ 1 2 j 1 , if z z ∥ ∥ ≤ 1 − A denotes the set of coordinates where at such that 2 1 ( ( z least one of ) or L is either empty or is a L z A ) is non-zero, then 2 1 S S singleton set. Given this, the statement in the claim is immediate from the fact that the mapping corresponding to any particular coordinate is clearly 1 -Lipschitz. . Corresponding to any r ∈ We can finally describe the queries F s d 2 → } f {− : N 1 , we define R , as 1 , r d ∑ ) = f ( ( , L r x x ) · i i r S =1 i . · r which is simply the inner product F will be a random map L S s k d 2 R F : Pick r : , . . . , r independently and uniformly ∈{− 1 , N } 1 → 1 k at random and define . )) x ( , . . . , f ( x F x f ( ) = ( ) r r 1 k F ( x ) is simply the result of the inner product of L That is, ( x ) with k S randomly chosen vectors. 1 ± x S L Note that for any ( ∈ ) has one coordinate with value w/ 30 x S s 2 r we have ∈ {− 1 , 1 } (and the others are all zero), so ∀ and x ∈ S i j x ) | = w/ 30 . Now consider any ( , x . It follows ∈ S , where h ̸ = x f | j r h i s 2 1 , 1 } r that for any , ∈{− i w/ 1 ≥ 15] / |≥ ) 2 x ( ) − ( x f | Pr [ f j r r h i i r i ( r ) ) ). A basic application of the = − ( r (this event occurs when i i j h Chernoff bound implies that 1 , s r [ For at least Pr / 10 of the i ,...,r r 1 k 30 k/ − | ( x f ) − f . 2 ( x − ) |≥ w/ 15] ≥ 1 j r r h i i ∈ ( S , x x ) of databases such that x Now, the total number of pairs , x j i i j 2 k/ 200 s 2 is at most ≤ 2 . Taking a union bound this implies Pr , s r of the [ ∀ h ̸ = j, For at least 1 / 10 i r ,...,r 1 k k/ − 40 ( 2 | ) − f f ( x x ) |≥ w/ 15] ≥ 1 − j r r h i i

173 8.2. Lower bounds for differential privacy 169 r This implies that we can fix such that the following is true. , . . . , r 1 k ( = j, For at least 1 / 10 of the r h s , | ∀ 15 w/ ̸ x |≥ ) − f ) x ( f j r r i h i i x . ̸ = x 15 ∈ S , Thus, for any F ( x w/ ) − F ( x ≥ ) ∥ ∥ ∞ j j h h w and s = ℓ/ ∆ = 2 > 3 εw (as we did above), and Setting 400 = η 15 , we satisfy the conditions of Corollary 8.5 and conclude w/ ∆ ( s − ≤ /ε , proving the theorem (via Claim 8.6 ). 1) The theorem is almost tight: if k ≤ d then we can apply the Laplace mechanism to each of the k sensitivity 1 component queries F with parameter , and we expect the maximum distortion to be in k/ε k ln ) . On the other hand, if k ≤ k/ε then we can apply the Laplace Θ( d d -dimensional histogram representing the database, mechanism to the Θ( d ln d/ε and we expect the maximum distortion to be . ) The theorem actually shows that, given knowledge of the set and S x S , the adversary knowledge that the actual database is an element ∈ x if the L can completely determine norm of the distortion is too ∞ small. How in real life might the adversary obtain a set S of the type used in the attack? This can occur when a non-private database system has been running on a dataset, say, x . For example, x could be a vector n in , 1 } { and the adversary may have learned, through a sequence of 0 3 2 / n x linear queries, that ∈C . Of course, , a linear code of distance, say if the database system is not promising privacy there is no problem. The problem arises if the administrator decides to replace the existing system with a differentially private mechanism — after several queries have received noise-free responses. In particular, if the administrator chooses to use ( ε, δ ) -differential privacy for subsequent k queries then the distortion might fall below the k/ε ) lower bound, permitting the Ω( 8.8 . attack described in the proof of Theorem The theorem also emphasizes that there is a fundamental difference between auxiliary information about (sets of) members of the database and information about the database as a whole . Of course, we already knew this: being told that the number of secret bits sums to exactly 5 , 000 completely destroys differential privacy, and an adversary that already knew the secret bit of every member of the database except one individual could then conclude the secret bit of the remaining individual.

174 170 Lower Bounds and Separation Results Suppose k d , so ℓ = k in Theorem 8.8 . Additional Consequences. ≤ lower bound on noise for queries sketched in k/ε The linear in k the previous section immediately yields a separation between counting queries and arbitrary 1-sensitivity queries, as the SmallDB construction / 2 3 n answers (more than) while maintain- queries with noise roughly n ing differential privacy. Indeed, this result also permits us to conclude α -net for large sets of arbitrary low sensitivity that there is no small α queries, for o ( n ) (as otherwise the net mechanism would yield an ∈ ε, algorithm of desired accuracy). ( 0) 8.3 Bibliographic notes 8.1 , are due to The first reconstruction attacks, including Theorem Dinur and Nissim [ 18 ], who also gave an attack requiring only poly- 2 ( n nomial time computation and O n ) queries, provided the noise log √ n . Realizing that attacks requiring ) n random linear ( o is always n is “internet scale,” are infeasible, Dinur, Dwork, and queries, when Nissim gave the first positive results, showing that for a sublinear number of subset sum queries, a form of privacy (now known to imply ( ε, δ ) -differential privacy) can be achieved by adding noise scaled √ ( ) n to [ 18 ]. This was exciting because it suggested that, if we think o of the database as drawn from an underlying population, then, even for a relatively large number of counting queries, privacy could be achieved with distortion smaller than the sampling error. This even- tulaly lead, via more general queries [ 31 , 6 ], to differential privacy. The view of these queries as a privacy-preserving programming prim- itive [ 6 ] inspired McSherry’s Privacy Integrated Queries programming ]. 59 platform [ The reconstruction attack of Theorem 8.2 appears in [ 24 ], where Dwork, McSherry, and Talwar showed that polynomial time recon- 0 . 239 fraction of the responses have wild, struction is possible even if a √ n ) . o ( arbitrary, noise, provided the others have noise The geometric approach, and in particular Lemma , is due to 8.4 Hardt and Talwar [ 45 ], who also gave a geometry-based algorithm proving these bounds tight for small numbers k ≤ n of queries, under a

175 8.3. Bibliographic notes 171 commonly believed conjecture. Dependence on the conjecture was later removed by Bhaskara et al. [ 5 ]. The geometric approach was extended to arbitrary numbers of queries by Nikolov et al. [ 66 ], who gave an algo- rithm with instance-optimal mean squared error. For the few queries case this leads, via a boosting argument, to low expected worst-case error. Theorem 8.8 is due to De [ 17 ].

176 9 Differential Privacy and Computational Complexity Our discussion of differential privacy has so far ignored issues of compu- tational complexity, permitting both the curator and the adversary to be computationally unbounded. In reality, both curator and adversary may be computationally bounded. Confining ourselves to a computationally bounded curator restricts what the curator can do, making it harder to achieve differential pri- vacy. And indeed, we will show an example of a class of counting queries that, under standard complexity theoretic assumptions, does not per- mit efficient generation of a synthetic database, even though inefficient algorithms, such as SmallDB and Private Multiplicative Weights, are known. Very roughly, the database rows are digital signatures, signed with keys to which the curator does not have access. The intuition will be that any row in a synthetic database must either be copied from the original — violating privacy — or must be a signature on a new message, i.e., a forgery — violating the unforgeability property of a digital signature scheme. Unfortuately, this state of affairs is not lim- ited to (potentially contrived) examples based on digital signatures: it is even difficult to create a synthetic database that maintains relatively 172

177 173 1 Q of accurate two-way marginals. On the positive side, given a set -row database with rows drawn from a universe , n queries and an X a synthetic database can be generated in time polynomial in n , |X| , . and |Q| If we abandon the goal of a synthetic database and content ourselves with a data structure from which we can obtain a relatively accurate approximation to the answer to each query, the situation is much more interesting. It turns out that the problem is intimately related to the problem, in which the goal is to discourage piracy while tracing traitors distributing digital content to paying customers. If the adversary is restricted to polynomial time, then it becomes easier to achieve differential privacy. In fact, the immensely power- ful concept of secure function evaluation yields a natural way avoid the trusted curator (while giving better accuracy than randomized response), as well as a natural way to allow multiple trusted cura- tors, who for legal reasons cannot share their data sets, to respond to queries on what is effectively a merged data set. Briefly put, secure function evaluation is a cryptographic primitive that permits a collec- tion of p , p parties n , . . . , p , of which fewer than some fixed frac- n 1 2 tion are faulty (the fraction varies according to the type of faults; for “honest-but-curious” faults the fraction is 1), to cooperatively compute x f x , . . . , x ) , where ( is the input, or value , of party p any function , n i 1 i in such a way that no coalition of faulty parties can either disrupt the computation or learn more about the values of the non-faulty parties than can be deduced from the function output and the values of the members of the coalition. These two properties are traditionally called privacy . This privacy notion, let us call it SFE pri- and correctness V be the set of vacy , is very different from differential privacy. Let 2 values held by the faulty parties, and let be a non-faulty party. p i SFE privacy permits the faulty parties to learn x if x can be deduced i i from V ∪{ f ( x ; differential privacy would therefore not per- , . . . , x } ) n 1 f , . . . , x x . However, secure function evaluation mit exact release of ( ) 1 n 1 Recall that the two-way marginals are the counts, for every pair of attribute values, of the number of rows in the database having this pair of values. 2 . V = { x In the honest but curious case we can let } for any party P j j

178 174 Differential Privacy and Computational Complexity f can easily be modified to obtain protocols for computing a function , simply by defining a new function, differentially private protocols for f (∆ f /ε ) to the value of g . , to be the result of adding Laplace noise Lap f g . Since In principle, secure function evaluation permits evaluation of is differentially private and the SFE privacy property, applied to g , g says that nothing can be learned about the inputs that is not learnable together with g x , differential privacy is , . . . , x V from the value of ( ) n 1 ensured, provided the faulty players are restricted to polynomial time. Thus, secure function evaluation allows a computational notion of dif- ferential privacy to be achieved, even without a trusted curator, at no loss in accuracy when compared to what can be achieved with a trusted curator. In particular, counting queries can be answered with constant expected error while ensuring computational differential privacy, with no trusted curator. We will see that, without cryptography, the error 1 / 2 must be , proving that computational assumptions provably ) n Ω( buy accuracy, in the multiparty case. 9.1 Polynomial time curators In this section we show that, under standard cryptographic assump- tions, it is computationally difficult to create a synthetic database that will yield accurate answers to an appropriately chosen class of counting queries, while ensuring even a minimal notion of privacy. This result has several extensions; for example, to the case in which the set of queries is small (but the data universe remains large), and the case in which the data universe is small (but the set of queries is large). In addition, similar negative results have been obtained for certain natural families of queries, such as those corresponding to conjunctions. We will use the term syntheticize to denote the process of generating 3 . Thus, the results a synthetic database in a privacy-preserving fashion in this section concern the computational hardness of syntheticizing. Our notion of privacy will be far weaker than differential privacy, so hardness of syntheticizing will imply hardness of generating a synthetic 3 In Section 6 a syntheticizer took as input a synopsis; here we are starting with a database, which is a trivial synopsis.

179 9.1. Polynomial time curators 175 database in a differentially private fashion. Specifically, we will say that syntheticizing is hard if it is hard even to avoid leaking input items in their entirety. That is, some item is always completely exposed. Note that if, in contrast, leaking a few input items is not considered a privacy breach, then syntheticizing is easily achieved by releasing a randomly chosen subset of the input items. Utility for this “synthetic database” comes from sampling bounds: with high probability this sub- set will preserve utility even with respect to a large set of counting queries. security When introducing complexity assumptions, we require a in order to express sizes; for example, sizes of sets, lengths parameter of messages, number of bits in a decryption key, and so on, as well as to express computational difficulty. The security parameter, denoted κ , represents “reasonable” sizes and effort. For example, it is assumed that it is feasible to exhaustively search a set whose size is (any fixed) polynomial in the security parameter. Computational complexity is an asymptotic notion — we are con- cerned with how the difficulty of a task increases as the sizes of the objects (data universe, database, query family) grow. Thus, for exam- ple, we therefore need to think not just of a distribution on databases of a single size (what we have been calling in the rest of this mono- n graph), but of an ensemble of distributions, indexed by the security parameter. In a related vein, when we introduce complexity we tend to “soften” claims: forging a signature is not impossible — one might be lucky! Rather, we assume that no efficient algorithm succeeds with non-negligible probability, where “efficient” and “non-negligible” are defined in terms of the security parameter. We will ignore these fine points in our intuitive discussion, but will keep them in the formal theorem statements. Speaking informally, a distribution of databases is hard to syn- theticize (with respect to some family Q of queries) if for any effi- cient (alleged) syntheticizer, with high probability over a database drawn from the distribution, at least one of the database items can be extracted from the alleged syntheticizer’s output. Of course, to avoid triviality, we will also require that when this leaked item is excluded from the input database (and, say, replaced by a random different item),

180 176 Differential Privacy and Computational Complexity the probability that it can be extracted from the output is very small. This means that any efficient (alleged) syntheticizer indeed compro- mises the privacy of input items in a strong sense. 9.1 Definition below will formalize our utility requirements for a syntheticizer. There are three parameters: describes the accuracy α γ describes the requirement (being within α is considered accurate); fraction of the queries on which a successful synthesis is allowed to be will be the probability of failure. β inaccurate, and For an algorithm A producing synthetic databases, we say that an ( x ) is ( α, γ output -accurate for a query set Q if | q ( A ( x )) − q ( x ) |≤ α A ) 1 − γ fraction of the queries q ∈Q . for a ( ( α, β, γ ) -Utility) . Let Q be a set of queries and X a data Definition 9.1 A universe. A syntheticizer α, β, γ ) -utility for n -item databases has ( Q X if for any n -item database x : with respect to and A ( x ) is ( α, γ ) -accurate for Q ] ≥ 1 − β Pr [ A . where the probability is over the coins of Q = {Q } be a query family ensemble, X = {X Let } ,... n n n =1 , 2 n =1 , 2 ,... be a data universe ensemble. An algorithm is said to be efficient if its . n, |Q poly( | ) , log( |X log( running time is )) | n n In the next definition we describe what it means for a family of distributions to be hard to syntheticize. A little more specifically we will say what it means to be hard to generate synthetic databases that provide ( ) -accuracy. As usual, we have to make this an asymptotic α, γ statement. ( ( μ, α, β, γ, Q ) -Hard-to-Syntheticize Database Distribu- Definition 9.2 be a query family ensemble, Q = {Q } = X Let . tion) n 2 , =1 n ,... } {X be a data universe ensemble, and let μ, α, β, γ ∈ [0 , 1] . n =1 , 2 n ,... Let n D an ensemble of distributions, where D be a database size and n is over collections of + 1 items from X n . n ′ ∼ D We denote by - ) ( n the experiment of choosing an x, i, x n i element database, an index i chosen uniformly from [ n ] , and an addi- ′ tional element gives us a pair of from X . A sample from D x n n i databases: x and the result of replacing the i th element of x (under

181 9.2. Some hard-to-Syntheticize distributions 177 ′ x . Thus, we think of D a canonical ordering) with as specifying a n i n distribution on -item databases (and their neighbors). ( -hard-to Syntheticize Q ) is if there exists an We say that μ, α, β, γ, D A such that for any alleged efficient syntheticizer efficient algorithm T the following two conditions hold: 1 − μ 1. x ∼D and the With probability over the choice of database A T , if A ( x ) and α -utility for a 1 − γ fraction coins of maintains T can recover one of the rows of x of queries, then A ( x ) : from Pr ′ x,i,x ) ∼ D ( n i coin flips of A,T A ( x ) maintains ( α, β, γ ) -utility) and ( x ∩ [( ( A ( x )) = ∅ )] ≤ μ T 2. every efficient algorithm A , and for every For ∈ [ n ] , if we draw i ′ ′ ′ x, i, x ( D , and replace with x from x ) to form x cannot , T i i i ′ except with small probability: from A ( x extract ) x i ′ [ ≤ ))] Pr x ( A ( T ∈ x μ. i ′ D ( ) ∼ x,i,x n i coin flips of A , T Later, we will be interested in offline mechanisms that produce arbi- trary synopses, not necessarily synthetic databases. In this case we will be interested in the related notion of (rather than hard hard to sanitize A to Syntheticize), for which we simply drop the requirement that pro- duce a synthetic database. 9.2 Some hard-to-Syntheticize distributions We now construct three distributions that are hard to syntheticize. signature scheme is given by a triple of (possibly randomized) A (Gen , Sign , Verify) : algorithms N Gen : 1 VK) → { (SK , • is used to generate a pair con- } n =1 2 ,... n , signing key and a (public) verification key. sisting of a (secret) It takes only the security parameter κ ∈ N , written in unary, as input, and produces a pair drawn from (SK , VK) , the distri- κ bution on (signature,verification) key pairs indexed by κ ; we let

182 178 Differential Privacy and Computational Complexity , ℓs ( , p ( κ ) ) ( κ ) denote the lengths of the signing key, verifica- p κ s v tion key, and signature, respectively. ( ℓ κ ) ℓs ( κ ) ×{ 0 → { 0 , 1 , 1 } Sign : SK • takes as input a signing key } κ from a pair drawn from (SK , VK) and a message m of length κ ( κ ℓ , and produces a signature on m ; ) ∗ ℓ ( κ ) 1 } • × { 0 , 1 } takes as input a × { 0 , → { 0 , 1 Verify : VK } κ σ , and m of length ℓ ( κ ) verification key, a string , and a message σ is indeed a valid signature of m under the given checks that verification key. Keys, message lengths, and signature lengths are all polynomial in κ . The notion of security required is that, given any polynomial (in κ ) number of valid (message, signature) pairs, it is hard to forge any new signature, even a new signature of a previously signed message (recall that the signing algorithm may be randomized, so there may exist mul- tiple valid signatures of the same message under the same signing key). Such a signature scheme can be constructed from any one-way function. Speaking informally, these are functions that are easy to compute — f ( x ) can be computed in time polynomial in the length (number of bits) of x , but hard to invert: for every probabilistic polynomial time κ , the algorithm, running in time polyomial in the security parameter probability, over a randomly chosen x in the domain of f , of finding valid pre-image of f ( x ) , grows more slowly than the inverse of any any κ . polynomial in Hard to Syntheticize Distribution I: Fix an arbitrary signature scheme. The set Q of counting queries contains one counting query q κ vk vk ∈ VK . The data universe X for each verification key consists of the κ κ set of all possible (message, signature) pairs of the form for messages of length κ ) signed with keys in VK ( . ℓ κ D The distribution on databases is defined by the following sam- κ κ Gen(1 ) to obtain pling procedure. Run the signature scheme generator ℓ ( κ ) n = κ messages in { 0 , 1 } ( sk, vk ) . Randomly choose and run the signing procedure for each one, obtaining a set of (message, signa- n ture) pairs all signed with key sk . This is the database x . Note that all the messages in the database are signed with the same signing key.

183 9.2. Some hard-to-Syntheticize distributions 179 ( A data universe item satisfies the predicate q m, σ if and only ) vk m ) = 1 if σ is a valid signature for vk, m, σ according to Verify( , i.e., vk . verification key x ∈ sk D Let be a database, and let be the signing key used, with κ R corresponding verification key . Assuming that the syntheticizer has vk produced y , it must be the case that almost all rows of y are valid vk (because the fractional count of x for the query signatures under is 1). By the unforgeability properties of the signature scheme, all vk x of these must come from the input database — the polynomial time poly( κ bounded curator, running in time , cannot generate generate a ) new valid (message, signature) pair. (Only slightly) more formally, the probability that an efficient algorithm could produce a (message, sig- nature) pair that is verifiable with key vk , but is not in x , is negligible, so with overwhelming probability any y that is produced by an effi- 4 x This contradicts (any cient syntheticizer will only contain rows of . reasonable notion of) privacy. Q (the set of verification keys) and In this construction, both X κ κ (the set of (message, signature) pairs) are large (superpolynomial in κ ). When both sets are small, efficient differentially private generation of synthetic datasets is possible. That is, there is a differentially private syntheticizer whose running time is polynomial in n = κ , |Q : | and |X | κ κ compute noisy counts using the Laplace mechanism to obtain a synopsis and then run the syntheticizer from Section . Thus, when both of 6 κ the running time of the syntheticizer these have size polynomial in κ . is polynomial in We now briefly discuss generalizations of the first hardness result to the cases in which one of these sets is small (but the other remains large). Hard to Syntheticize Distribution II: In the database distribution above, we chose a single ( sk, vk ) key pair and generated a database of 4 The quantification order is important, as otherwise the syntheticizer could have the signing key hardwired in. We first fix the syntheticizer, then run the generator and build the database. The probability is over all the randomness in the experi- ment: choice of key pair, construction of the database, and randomness used by the syntheticizer.

184 180 Differential Privacy and Computational Complexity sk messages, all signed using ; hardness was obtained by requiring the , in order for the sk syntheticizer to generate a new signature under . syntheticized database to provide an accurate answer to the query q vk To obtain hardness for syntheticizing when the size of the set of queries is only polynomial in the security parameter, we again use digital signatures, signed with a unique key, but we cannot afford to have a vk , as these are too numerous. query for each possible verification key To address this, we make two changes: 1. Database rows now have the form (verification key, mes- sage, signature). more precisely, the data universe consists of ∈ { ( vk, m, s ) : vk = VK X , m ∈ (key,message,signature) triples κ κ ( ℓs ) ℓ ( κ ) } ∈{ 0 , 1 } 1 , 0 { } . , s p We add to the query class exactly 2 ( κ ) queries, where p ) ( κ 2. v v is the length of the verification keys produced by running the κ Gen(1 . The queries have the form ( i, b ) generation algorithm ) 1 ≤ i ≤ p . The meaning of the query ( κ where and b ∈ { 0 , 1 } ) v “ ( i, b ) ” is, “What fraction of the database rows are of the form ( vk, m, s ) where Verify( vk, m, s ) = 1 and the i th bit of vk is b ?” By populating a database with messages signed according to a single key , we ensure that the responses to these queries should vk when 1 ≤ p ( κ ) i vk be close to one for all = b , and close to ≤ i vk zero when = 1 − b . i With this in mind, the hard to syntheticize distribution on databases is constructed by the following sampling procedure: Gen- κ erate a signature-verification key pair ( sk, vk ) Gen(1 ) , and choose ← ) κ ( ℓ x m , . . . , m uniformly from { 0 , 1 } messages κ = n . The database n 1 ] n j ∈ [ will have , rows; for the j th row is the verification key, the j th n message and its valid signature, i.e., the tuple ( vk, m . , Sign( m )) , sk j j ′ , i [ n ] . To generate the ( n + 1) st item x Next, choose uniformly from i sk ). just generate a new message-signature pair (using the same key Hard to Syntheticize Distribution III: To prove hardness for the case of a polynomial (in κ ) sized message space (but superpolynomial sized query set) we use a pseudorandom function . Roughly speaking, these are polynomial time computable functions with small descriptions that

185 9.2. Some hard-to-Syntheticize distributions 181 cannot efficiently be distinguished, based only on their input-output behavior, from truly random functions (whose descriptions are long). This result only gives hardness of syntheticizing if we insist on main- taining utility for all queries. Indeed, if we are interested only in ensur- ing on-average utility, then the base generator for counting queries yields an efficient algorithm for syntheticizing 6 described in Section is exponentially when the universe X is of polynomial size, even when Q large. κ } { Let to be a family of pseudo-random functions from [ ℓ ] f s , 1 } ∈{ s 0 ℓ , where ℓ ∈ ] κ ) . More specifically, we need that the set of all [ poly( [ ℓ ] is “small,” but larger than κ ; this way the κ -bit pairs of elements in ℓ string describing a function in the family is shorter than the ℓ log 2 [ ] to [ ℓ ] . Such a ℓ bits needed to describe a random function mapping family of pseudorandom functions can be constructed from any one-way function. Our data universe will be the set of all pairs of elements in [ ℓ ] : = { ( a, b ) : a, b ∈ [ ℓ ] } . Q X will contain two types of queries: κ κ 1. There will be one query for each function { f } in the fam- s 0 , 1 } s ∈{ ily. A universe element ( ) ∈X satisfies the query s if and only a, b f a ) = b . ( if s κ , truly random There will be a relatively small number, say 2. queries. Such a query can be constructed by randomly choosing, ( a, b ) for each , whether or not ( a, b ) will satisfy the query. ∈X The hard to syntheticize distribution is generated as follows. First, κ s ∈ { 0 we select a random string 1 } , specifying a function in our , family. Next, we generate, for n = κ distinct values a chosen at , . . . , a n 1 [ a, f ] without replacement, the universe element ( random from ℓ ( a )) . s The intuition is simple, relies only on the first type of query, and does not make use of the distinctness of the a gen- . Given a database x i erated according to our distribution, where the pseudo-random func- tion is given by s , the syntheticizer must create a synthetic database (almost) all of whose rows must satisfy the query s . The intuition is that it can’t reliably find input-output pairs that do not appear in x . A little more precisely, for an arbitrary element a ∈ [ ℓ ] such that no

186 182 Differential Privacy and Computational Complexity f is of the form ( ( a )) , the pseudo-randomness of a, f x says row in s s that an efficient syntheticizer should have probability at most negligi- /ℓ of finding bly more than 1 ( a ) . In this sense the pseudo-randomness f s gives us properties similar to, although somewhat weaker than, what we obtained from digital signatures. a ∈ ℓ ] , the syntheticizer can indeed guess Of course, for any given [ 1 /ℓ the value f with probability ( a ) , so without the second type of s query, nothing obvious would stop it from ignoring x , choosing an , and outputting a database of n copies of a a, b ) , where b arbitrary ( [ ℓ ] . The intuition is now that such is chosen uniformly at random from a synthetic database would give the wrong fraction — either zero or one, when the right answer should be about 1 / 2 — on the truly random queries. Formally, we have: κ κ f : { 0 , 1 } → { Theorem 9.1. 0 , 1 } Let be a one-way function. For every a > , and for every integer n = poly( κ ) , there exists a query 0 2+2 a ) κ )) , a data universe X of size family ( n Q exp(poly( of size , O and a distribution on databases of size that is ( μ, α, β, n , Q ) -hard-to- 0 syntheticize (i.e., hard to syntheticize for worst-case queries) for α ≤ 1+ a μ β ≤ 1 / 10 and 3 = 1 / 40 n 1 , . / The above theorem shows hardness of sanitizing with synthetic data. Note, however, that when the query set is small one can always simply release noisy counts for every query. We conclude that sanitiz- ing for small query classes (with large data universes) is a task that separates efficient syntheticizing from efficient synopsis generation (san- itization with arbitrary outputs). 9.2.1 Hardness results for general synopses The hardness results of the previous section apply only to syntheticiz- ers — offline mechanisms that create synthetic databases. There is a tight connection between hardness for more general forms of privacy- preserving offline mechanisms, which we have been calling offline query release mechanisms or synopsis generators, and the existence of traitor tracing schemes, a method of content distribution in which (short) key

187 9.2. Some hard-to-Syntheticize distributions 183 strings are distributed to subscribers in such a way that a sender can broadcast encrypted messages that can be decrypted by any subscriber, and any useful “pirate” decoder constructed by a coalition of malicious subscribers can be traced to at least one colluder. A (private-key, stateless) traitor-tracing scheme consists of algo- Encrypt , Decrypt and , . The Setup algorithm gen- rithms Setup Trace erates a key N subscriber keys k bk , . . . , k . for the broadcaster and 1 N The Encrypt algorithm encrypts a given bit using the broadcaster’s bk . The key algorithm decrypts a given ciphertext using any Decrypt of the subscriber keys. The algorithm gets the key bk and oracle Trace access to a (pirate, stateless) decryption box, and outputs the index i 1 , . . . , N } of a key k that was used to create the pirate box. ∈{ i collusion- An important parameter of a traitor-tracing scheme is its resistance : a scheme is -resilient if tracing is guaranteed to work as long t t keys are used to create the pirate decoder. When as no more than = N , tracing works even if all the subscribers join forces to try and t create a pirate decoder. A more complete definition follows. Definition 9.3. A scheme (Setup , Encrypt , Decrypt , Trace) as above is a t if (i) the ciphertexts it generates -resilient traitor-tracing scheme are semantically secure (roughly speaking, polynomial time algorithms cannot distinguish encryptions of 0 from encryptions of 1), and (ii) no can “win” in the following game with polynomial time adversary A , Setup , and Trace ): non-negligible probability (over the coins of A receives the number of users N A κ and and a security parameter (adaptively) requests the keys of up to users { i t , . . . , i . The adver- } t 1 sary then outputs a pirate decoder Dec . The Trace algorithm is run 5 to Dec ; it outputs the name with the key bk and black-box access ∈ i N ] of a user or the error symbol ⊥ . We say that an adversary A [ “wins” if it is both the case that Dec has a non-negligible advantage in decrypting ciphertexts (even a weaker condition than creating a usable , the output of Trace is not in and i pirate decryption device), , . . . , i } { 1 t meaning that the adversary avoided detection. 5 Black-box access to an algorithm means that one has no access to the algorithm’s internals; one can only feed inputs to the algorithm and observe its outputs.

188 184 Differential Privacy and Computational Complexity The intuition for why traitor-tracing schemes imply hardness results for counting query release is as follows. Fix a traitor tracing scheme. We must describe databases and counting queries for which query release is computationally hard. d n , the database x ∈{{ 0 , 1 } n For any given = will contain user κ } n users; here d keys from the traitor tracing scheme of a colluding set of is the length of the decryption keys obtained when the Setup algorithm κ for each . The query family Q is run on input will have a query q 1 c κ does asking “For what fraction of the rows ∈ [ n c i possible ciphertext ] decrypt to 1 under the key in row i ?” Note that, since every user c c of the bit 1 , the can decrypt, if the sender distributes an encryption 1 c to 1 , so the fraction of such : all the rows decrypt answer will be ′ . If instead the sender distributes an encryption c rows is of the bit 0 , 1 ′ 0 : since no row decrypts c the answer will be to 1 , the fraction of rows ′ decrypting to 1 is 0 . Thus, the exact answer to a query q c , where c c is an encryption of a 1-bit messages , is b itself. b Now, suppose there were an efficient offline differentially private Q query release mechanism for queries in . The colluders could use this algorithm to efficiently produce a synopsis of the database enabling a data analyst to efficiently compute approximate answers to the q queries . If these approximations are at all non-trivial, then the ana- c lyst can use these to correctly decrypt. That is, the colluders could use this to form a pirate decoder box. But traitor tracing ensures that, for any such box, the Trace algorithm can recover the key of at least one user, i.e., a row of the database. This violates differential privacy, con- tradicting the assumption that there is an efficient differentially private algorithm for releasing . Q This direction has been used to rule out the existence of efficient √ ̃ ( n ) O offline sanitizers for a particular class of 2 counting queries; this can be extended to rule out the existence of efficient on-line sanitiz- 2 ̃ ers answering Θ( n ) counting queries drawn adaptively from a second (large) class. The intuition for why hardness of offline query release for counting queries implies traitor tracing is that failure to protect privacy immedi- ately yields some form of traceability; that is, the difficulty of providing an object that yields (approximate) functional equivalence for a set of

189 9.3. Polynomial time adversaries 185 rows (decryption keys) while preserving privacy of each individual row (decryption key) — that is, the difficulty of producing an untraceable decoder — is precisely what we are looking for in a traitor tracing scheme. In a little more detail, given a hard-to-sanitize database distribu- n -item database tion and family of counting queries, a randomly drawn can act like a “master key,” where the secret used to decrypt messages counts of random queries on this database. For a randomly cho- is the of sen subset n ) queries, a random set of polylog( n ) rows S polylog( drawn from the database (very likely) yields good approximation to all queries in S . Thus, individual user keys can be obtained by ran- n/ polylog( n ) sets of polylog( n ) domly partitioning the database into rows and assigning each set to a different user. These sets are large enough that with overwhelming probability their counts on a random n close to the counts of the queries are all polylog( ) collection of say original database. To complete the argument, one designs an encryption scheme in which decryption is equivalent to computing approximate counts on small sets of random queries. Since by definition a pirate decryption box can decrypt, the a pirate box can be used to compute approximate counts. If we view this box as a sanitization of the database we conclude (because sanitizing is hard) that the decryption box can be “traced” to the keys (database items) that were used to create it. 9.3 Polynomial time adversaries Definition 9.4 (Computational Differential Privacy) . A randomized algo- n C if and : X rithm → Y is ε -computationally differentially private κ x, y differing in a single row, and for all nonuni- only if for all databases form polynomial (in κ ) algorithms T , ε T ( C , ( x )) = 1] ≤ e ( Pr[ T ( C ) Pr[ y )) = 1] + ν ( κ κ κ where ν ( · ) is any function that grows more slowly than the inverse of any polynomial and the agorithm C , runs in time polynomial in n κ log |X| , and κ .

190 186 Differential Privacy and Computational Complexity Intuitively, this says that if the adversary is restricted to polyno- mial time then computationally differentially private mechanisms pro- ε, ν ( κ )) -differentially private vide the same degree of privacy as do ( ( κ ) term; algorithms. In general there is no hope of getting rid of the ν for example, when encryption is involved there is always some (neglibly small) chance of guessing the decryption key. Once we assume the adversary is restricted to polynomial time, secure multiparty computation we can use the powerful techniques of distributed to provide online query release algorithms, replacing the trusted server with a distributed protocoal that simulates a trusted curator. Thus, for example, a set of hospitals, each holding the data of many patients, can collaboratively carry out statistical analyses of the union of their patients, while ensuring differential privacy for each patient. A more radical implication is that individuals can maintain their own data, opting in or out of each specific statistical query or study, all the while ensuring differential privacy of their own data. We have already seen one distributed solution, at least for the prob- lem of computing a sum of n bits: randomized response. This solution requires no computational assumptions, and has an expected error of √ Θ( n . In contrast, the use of cryptographic assumptions permits much ) more accurate and extensive analyses, since by simulating the curator it can run a distributed implementation of the Laplace mechanism, which has constant expected error. This leads to the natural question of whether there is some other approach, not relying on cryptographic assumptions, that yields better accuracy in the distributed setting than does randomized response. Or more generally, is there a separation between what can be accomplished with computational differential privacy and what can be achieved with “traditional” differential privacy? That is, does cryptography provably buy us something? In the multiparty setting the answer is yes. Still confining our atten- tion to summing n bits, we have: Theorem 9.2. For ε < 1 , every n -party ( ε, 0) -differentially private pro- tocol for computing the sum of n bits (one per party) incurs error 1 / 2 Ω( n ) with high probability.

191 9.4. Bibliographic notes 187 ( ε, δ -differential privacy provided δ ∈ A similar theorem holds for ) /n o ) (1 . X be uniform independent bits. The tran- (sketch) Let Proof. , . . . , X 1 n of the protocol is a random variable script = T ( P T ( X , ) , . . . T 1 1 ( P X . Con- ) , where for i ∈ [ n ] the protocol of player i is denoted P i n n T are still independent bits, each t , the bits X = , . . . , X ditioned on n 1 O ( ε with bias . Further, by differential privacy, the uniformity of the ) X , and Bayes’ Law we have: i Pr[ X = 1] = 1 | T = t ] X | t = T Pr[ i i ε = ε. 1 + 2 ≤ e < T | Pr[ T = t | X = 0 = 0] = X ] Pr[ t i i n To finish the proof we note that the sum of independent bits, √ ( n each with constant bias, falls outside any interval of size ) o with ∑ high probability. Thus, with high probability, the sum X is not in i i 2 / 1 2 / 1 the interval n − ) , output(T) + o ( n output(T) [ o )] . ( A more involved proof shows a separation between computational differential privacy and ordinary differential privacy even for the two- party case. It is a fascinating open question whether computational assumptions buy us anything in the case of the trusted curator. Initial results are negative: for small numbers of real-valued queries, i.e., for a number of queries that does not grow with the security parameter, there is a natural class of utility measures, including distances and L p mean-squared errors, for which any computationally private mechanism can be converted to a statistically private mechanism that is roughly as efficient and achieves almost the same utility. 9.4 Bibliographic notes The negative results for polynomial time bounded curators and the connection to traitor tracing are due to Dwork et al. [ 28 ]. The con- nection to traitor tracing was further investigated by Ullman [ ], who 82 showed that, assuming the existence of 1-way functions, it is computa- (1) o 2+ n tionally hard to answer arbitrary linear queries with differential privacy (even if without privacy the answers are easy to compute). In “Our Data, Ourselves,” Dwork, Kenthapadi, McSherry, Mironov, and

192 188 Differential Privacy and Computational Complexity Naor considered a distributed version of the precursor of differential privacy, using techniques from secure function evaluation in place of the trusted curator [ 21 ]. A formal study of differential computational privacy was initiated in [ 64 ], and the separation between the accuracy that can be achieved with ( ε, 0) -differential privacy in the multiparty and single curator cases in Theorem is due to McGregor et al. [ 58 ]. 9.2 The initial results regarding whether computational assumptions on the adversary buys anything in the case of a trusted curator are due to Groce et al. [ 37 ]. Construction of pseudorandom functions from any one-way function is due to Håstad et al. [ 40 ].

193 10 Differential Privacy and Mechanism Design One of the most fascinating areas of game theory is mechanism design, which is the science of designing incentives to get people to do what you want them to do. Differential privacy has proven to have interesting connections to mechanism design in a couple of unexpected ways. It provides a tool to quantify and control privacy loss, which is important if the people the mechanism designer is attempting to manipulate care about privacy. However, it also provides a way to limit the sensitivity of the outcome of a mechanism to the choices of any single person, which turns out to be a powerful tool even in the absence of privacy concerns. In this section, we give a brief survey of some of these ideas. Mechanism Design is the problem of when the algorithm design inputs to the algorithm are controlled by individual, self-interested agents, rather than the algorithm designer himself. The algorithm maps its reported inputs to some outcome, over which the agents have pref- erences. The difficulty is that the agents may mis-report their data if doing so will cause the algorithm to output a different, preferred out- come, and so the mechanism designer must design the algorithm so that the agents are always incentivized to report their true data. 189

194 190 Differential Privacy and Mechanism Design The concerns of mechanism design are very similar to the concerns of private algorithm design. In both cases, the inputs to the algorithm 1 are thought of as belonging to some third party which has preferences over the outcome. In mechanism design, we typically think of individ- uals as getting some explicit value from the outcomes of the mecha- nism. In private algorithm design, we typically think of the individual as experiencing some explicit harm from (consequences of) outcomes of the mechanism. Indeed, we can give a utility-theoretic definition of differential privacy which is equivalent to the standard definition, but makes the connection to individual utilities explicit: |X| : N is → R Definition 10.1. ε -differentially private An algorithm A : if for every function → R f , and for every pair of neighboring R + |X| databases ∈ N : x, y − ε ) E . )] z ( f [ [ f ( exp( )] ≤ E E ) ε exp( ≤ )] [ f ( z z ) ( A ∼ z z ∼ A ( y ) ) y ( A ∼ z x f We can think of as being some function mapping outcomes to an arbitrary agent’s utility for those outcomes. With this interpretation, a mechanism is -differentially private, if for every agent it promises ε that their participation in the mechanism cannot affect their expected exp( ε future utility by more than a factor of independent of what their ) utility function might be . Let us now give a brief definition of a problem in mechanism design. A mechanism design problem is defined by several objects. There are n agents i ∈ [ n ] , and a set of outcomes O . Each agent has a type, t ∈T i which is known only to her, and there is a utility function over outcomes T ×O → [0 , 1] . The utility that agent : gets from an outcome o ∈O u i n ( t o , o ) , which we will often abbreviate as u is ( u ) . We will write t ∈T i i n agent types, with t denoting the type of agent to denote vectors of all i , and denoting the vector of types of ≡ ( t , . . . , t i , t , . . . , t ) t − 1 i i +1 1 n i − all agents except i . The type of an agent i completely specifies agent t i her utility over outcomes — that is, two agents j such that ̸ t = = i j will evaluate each outcome identically: u . ( o ) = u ∈ O ( o ) for all o j i 1 In the privacy setting, the database administrator (such as a hospital) might already have access to the data itself, but is nevertheless acting so as to protect the interests of the agents who own the data when it endeavors to protect privacy.

195 10.1. Differential privacy as a solution concept 191 takes as input a set of reported types, one from each A mechanism M player, and selects an outcome. That is, a mechanism is a mapping n T → O . Agents will choose to report their types strategically M : so as to optimize their utility, possibly taking into account what (they think) the other agents will be doing. In particular, they need not report their true types to the mechanism. If an agent is always incentivized to report some type, no matter what her opponents are reporting, then . If reporting one’s reporting that type is called a dominant strategy true type is a dominant strategy for every agent, then the mechanism truthful dominant strategy truthful . is called , or equivalently, n M Given a mechanism T → O , truthful reporting Definition 10.2. : ε -approximate dominant strategy for player i if for every pair of is an ′ t types , t ∈ T , and for every vector of types : t i − i i ′ t , M − t , t )) ≥ u ( t , M ( t u ε. , t ( )) ( − i i − i i i i -approximate dominant strategy for every ε If truthful reporting is an player, we say that M is ε -approximately dominant strategy truthful. ε = 0 , then M is exactly truthful . If That is, a mechanism is truthful if no agent can improve her utility by misrepresenting her type, no matter what the other players report. Here we can immediately observe a syntactic connection to the definition of differential privacy. We may identify the type space T X . The input to the mechanism therefore con- with the data universe n , consisting of the reports of each agent. sists of a database of size In fact, when an agent is considering whether she should truthfully ′ t , she is deciding or lie, and misreport her type as t report her type i i which of two databases the mechanism should receive: ( t , . . . , t ) , or 1 n ′ t , . . . , t , t ( . Note that these two databases differ only , t , . . . , t ) − 1 i +1 n 1 i i . Thus, i ! That is, they are neighboring databases in the report of agent differential privacy gives a guarantee of approximate truthfulness! 10.1 Differential privacy as a solution concept One of the starting points for investigating the connection between dif- ferential privacy and game theory is observing that differential privacy

196 192 Differential Privacy and Mechanism Design condition than approximate truthfulness. Note that for is a stronger , and so the following proposition is immediate. ε ) ≤ 1 + 2 ε ≤ exp( ε 1 is ε -differentially private, then M If a mechanism Proposition 10.1. 2 ε -approximately dominant strategy truthful. M is also As a solution concept, this has several robustness properties that strategy proof mechanisms do not. By the composition property of ε differential privacy, the composition of -differentially private mech- 2 ε -approximately dominant strategy truthful. In con- anisms remains 4 trast, the incentive properties of general strategy proof mechanisms may not be preserved under composition. Another useful property of differential privacy as a solution con- ′ and t cept is that it generalizes to group privacy: suppose that ∈ t n k indices. Recall that T are not neighbors, but instead differ in i : E o ≤ )] by group privacy we then have for any player ( u [ i o t ( M ∼ ) ′ kε E exp( types changes the [ u ) ( o )] . That is, changes in up to k i t o ) ∼ M ( expected output by at most ≈ kε ) , when k ≪ 1 /ε . Therefore, differ- (1+ 2 -approximate entially private mechanisms make truthful reporting a kε even for coalitions of k agents — i.e., differential pri- dominant strategy vacy automatically provides robustness to collusion. Again, this is in contrast to general dominant-strategy truthful mechanisms, which in general offer no guarantees against collusion. Notably, differential privacy allows for these properties in very gen- eral settings without the use of money! In contrast, the set of exactly dominant strategy truthful mechanisms when monetary transfers are not allowed is extremely limited. We conclude with a drawback of using differential privacy as a solu- tion concept as stated: not only is truthfully reporting one’s type an is an approximate domi- any report approximate dominant strategy, nant strategy! That is, differential privacy makes the outcome approx- imately independent of any single agent’s report. In some settings, this M is a differ- shortcoming can be alleviated. For example, suppose that entially private mechanism, but that agent utility functions are defined to be functions both of the outcome of the mechanism, and of the ′ t reported type of the agent: formally, we view the outcome space as i ′ ′ = O× O . When the agent reports type t to the mechanism, and T i

197 10.2. Differential privacy as a tool in mechanism design 193 ∈O , then the utility experienced by the mechanism selects outcome o ′ ′ o, t ) . Now consider the o = ( the agent is controlled by the outcome i ′ T ×O [0 → underlying utility function , 1] u : . Suppose we have that a selection of the mechanism, truthful reporting is a dominant fixing o ′ , t t , and for all outcomes o ∈O : strategy — that is, for all types i i ′ . t ( o, t u )) ≥ u ( t , , ( o, t ( )) i i i i -differentially Then it remains the fact that truthful reporting to an ε n T : → O private mechanism 2 ε approximate dominant M remains a ′ that player strategy, because for any misreport t i might consider, we i have: u t , ( M ( t ) , t ))] )) = ( o, t [ u ( t ( , E i i i i ) o ∼ M ( t ′ (1 + 2 ε ) E ))] o, t ( , t ≥ ( u [ i i ) ,t t ( M ∼ o − i i ′ ′ ≥ o, t ( , ))] t ( u [ E i ( ,t o ) t ∼ M i − i i ′ ′ = ( , ( M ( t u . , t )) , t ) t i i − i i However, we no longer have that every report is an approximate i ’s utility can depend arbitrarily on dominant strategy, because player ′ ′ ′ o = ( o, t ) , and only o (and not player i ’s report t itself) is differentially i i private. This will be the case in all examples we consider here. 10.2 Differential privacy as a tool in mechanism design In this section, we show how the machinery of differential privacy can be used as a tool in designing novel mechanisms. Warmup: digital goods auctions 10.2.1 To warm up, let us consider a simple special case of the first application of differential privacy in mechanism design. Consider a digital goods auction , i.e., one where the seller has an unlimited supply of a good with zero marginal cost to produce, for example a piece of software or other digital media. There are n unit demand buyers for this good, . Informally, the valuation each with unknown valuation v ∈ [0 , 1] v i i of a bidder i represents the maximum amount of money that buyer i

198 194 Differential Privacy and Mechanism Design would be willing to pay for a good. There is no prior distribution on the bidder valuations, so a natural revenue benchmark is the revenue v . At a price [0 , 1] , each bidder i with ∈ p ≥ p best fixed price of the i will buy. Therefore the total revenue of the auctioneer is Rev ) = p ·|{ i : v ( ≥ p }| . p, v i The optimal revenue is the revenue of the best fixed price: OPT = . This setting is well studied: the best known result for Rev ( p, v ) max p exactly dominant strategy truthful mechanisms is a mechanism which √ − O achieves revenue at least OPT ( . n ) We show how a simple application of the exponential mechanism ( ) log n − achieves revenue at least OPT O . That is, the mechanism ε trades exact for approximate truthfulness, but achieves an exponen- tially better revenue guarantee. Of course, it also inherits the benefits of differential privacy discussed previously, such as resilience to collu- sion, and composability. The idea is to select a price from the exponential mechanism, using as our “quality score” the revenue that this price would obtain. Suppose we choose the range of the exponential mechanism to be = { α, 2 α, . . . , R } . The size of the range is |R| = 1 /α . What have 1 we lost in potential revenue if we restrict ourselves to selecting a price from R ? It is not hard to see that ) ≡ OPT αn. Rev ( p, v max ≥ OPT − R ∈R p ∗ is the price that achieves the optimal revenue, and This is because if p ∗ ∗ p we use a price − α ≤ p ≤ such that p , every buyer who bought p at the optimal price continues to buy, and provides us with at most α less revenue per buyer. Since there are at most n buyers, the total lost revenue is at most αn . So how do we parameterize the exponential mechanism? We have a family of discrete ranges , parameterized by α . For a vector of values R and a price v ∈ R , we define our quality function to be q ( v, p ) = p Rev ( v, p ) . Observe that because each value v , we can restrict ∈ [0 , 1] i sensitivity p ≤ 1 and hence, the attention to prices of q is ∆ = 1 : changing one bidder valuation can only change the revenue at a fixed

199 10.2. Differential privacy as a tool in mechanism design 195 v 1 . Therefore, if we require ε -differential privacy, ≤ price by at most i , we get that with high probability, the exponential by Theorem 3.11 such that p mechanism returns some price )) ( ( 1 1 ln OPT ) − O ( ≥ αn ) p, v ( Rev . − ε α α to minimize the two sources Choosing our discretization parameter of error, we find that this mechanism with high probability finds us a price that achieves revenue ) ( log n OPT p, v O ≥ ) ( − Rev . ε ε ? Note What is the right level to choose for the privacy parameter that here, we do not necessarily view privacy itself as a goal of our computation. Rather, ε is a way of trading off the revenue guarantee with an upper bound on agent’s incentives to deviate. In the literature on large markets in economics, a common goal when exact truthfulness is out of reach is “asymptotic truthfulness” – that is, the maximum incentive that any agent has to deviate from his truthful report tend to as the size of the market n grows large. To achieve a result like 0 ε that here, all we need to do is set to be some diminishing function in n . For example, if we take ε = 1 / log( n the number of agents , then we ) obtain a mechanism that is asymptotically exactly truthful (i.e., as the market grows large, the approximation to truthfulness becomes exact). We can also ask what our approximation to the optimal revenue is as n grows large. Note that our approximation to the optimal revenue is only additive, and so even with this setting of ε , we can still guarantee revenue at least (1 − o (1)) OPT, so long as OPT grows more quickly 2 . ) than with the size of the population n n log( Finally, notice that we could make the reported value v of each i agent i binding. In other words, we could allocate an item to agent i . If and extract payment of the selected posted price whenever v p ≥ p i we do this, the mechanism is approximately truthful, because the price is picked using a differentially private mechanism. Additionally, it is not the case that every report is an approximate dominant strategy: if an agent over-reports, she may be forced to buy the good at a price higher than her true value.

200 196 Differential Privacy and Mechanism Design Approximately truthful equilibrium selection mechanisms 10.2.2 We now consider the problem of approximately truthful equilibrium Nash Equilibrium : Suppose each selection. We recall the definition of a , and can choose to play any action a ∈A player has a set of actions . A i outcomes are merely choices of actions that the Suppose, moreover, that agents might choose to play, and so agent utility functions are defined n as T ×A u → [0 , 1] . Then: : n Definition 10.3. A set of actions a ∈ A is an ε -approximate Nash ′ i and for all actions a : equilibrium if for all players i ′ ( u ) ≥ u ε ( a − a , a ) i − i i i In other words, every agent is simultaneously playing an (approximate) best response to what the other agents are doing, assuming they are playing according to a . Roughly speaking, the problem is as follows: suppose we are given a game in which each player knows their own payoffs, but not others’ payoffs (i.e., the players do not know what the types are of the other agents). The players therefore do not know the equilibrium structure of this game. Even if they did, there might be multiple equilibria, with different agents preferring different equilibria. Can a mechanism offered by an intermediary incentivize agents to truthfully report their utilities and follow the equilibrium it selects? For example, imagine a city in which (say) Google Navigation is the dominant service. Every morning, each person enters their starting point and destination, receives a set of directions, and chooses his/her route according to those directions. Is it possible to design a naviga- tion service such that: Each agent is incentivized to both (1) report truthfully, and (2) then follow the driving directions provided? Both misreporting start and end points, and truthfully reporting start and end points, but then following a different (shorter) path are to be dis- incentivized. Intuitively, our two desiderata are in conflict. In the commuting example above, if we are to guarantee that every player is incentivized to truthfully follow their suggested route, then we must compute an

201 10.2. Differential privacy as a tool in mechanism design 197 equilibrium of the game in question given players’ reports. On the other i hand, to do so, our suggested route to some player must depend on the reported location/destination pairs of other players. This tension will pose a problem in terms of incentives: if we compute an equilibrium of the game given the reports of the players, an agent can potentially benefit by misreporting, causing us to compute an equilibrium of the wrong game. This problem would be largely alleviated, however, if the report of only has a tiny effect on the actions of agents j ̸ = i . In this agent i could hardly gain an advantage through his effect on other case, agent i players. Then, assuming that everyone truthfully reported their type, the mechanism would compute an equilibrium of the correct game, and by definition, each agent could do no better than follow the suggested i equilibrium action. In other words, if we could compute an approxi- mate equilibrium of the game under the constraint of differential pri- , then truthful reporting, followed by taking the suggested action vacy of the coordination device would be a Nash equilibrium. A moment’s reflection reveals that the goal of privately computing an equilibrium is not possible in small games, in which an agent’s utility is a highly sensitive function of the actions (and hence utility functions) of the other agents. But what about in large games? Formally, suppose we have an n player game with action set A , and n → has a utility function u . We say : A each agent with type t [0 , 1] i i ∆ -large if for all players i ̸ that this game is j , vectors of actions = n ′ a , and pairs of actions a ∈A , a : ∈A j j ∣ ∣ ∣ ∣ ′ a , a , a u ) ) − u ( ( a ∆ . ≤ ∣ ∣ j j i j − i − j In other words, if some agent j unilaterally changes his action, then his affect on the payoff of any other agent i ̸ = j is at most ∆ . Note that if agent changes his own action, then his payoff can change arbitrarily. j Many games are “large” in this sense. In the commuting example above, if Alice changes her route to work she may substantially increase or decrease her commute time, but will only have a minimal impact on the commute time of any other agent Bob. The results in this section , but hold more generally. ∆ = O (1 /n ) are strongest for

202 198 Differential Privacy and Mechanism Design First we might ask whether we need privacy at all— could it be the case that in a large game, any algorithm which computes an equilibrium of a game defined by reported types has the stability property that we people who want? The answer is no. As a simple example, consider n must each choose whether to go to the beach (B) or the mountains (M). People privately know their types— each person’s utility depends on his own type, his action, and the fraction of other people p who go to p the beach. A Beach type gets a payoff of 10 if he visits the beach, and − p ) if he visits the mountain. A mountain type gets a payoff 5 p 5(1 10(1 − from visiting the beach, and ) from visiting the mountain. Note p that this is a large (i.e., low sensitivity) game — each player’s payoffs are insensitive in the actions of others. Further, note that “everyone visits beach” and “everyone visits mountain” are both equilibria of the game, regardless of the realization of types. Consider the mechanism that attempts to implement the following social choice rule— “if the number of beach types is less than half the population, send everyone to the beach, and vice versa.” It should be clear that if mountain types are just in the majority, then each mountain type has an incentive to misreport as a beach type; and vice versa. As a result, even though the game is “large” and agents’ actions do not affect others’ payoffs significantly, simply computing equilibria from reported type profiles does not in general lead to even approximately truthful mechanisms. Nevertheless, it turns out to be possible to give a mechanism with the following property: it elicits the type t of each agent, and then com- i putes an α -approximate correlated equilibrium of the game defined by 2 (In some cases, it is possible to strengthen this the reported types. Nash equilibrium of the underlying result to compute an approximate n game.) It draws an action profile a ∈ A from the correlated equi- a i to each agent librium, and reports action . The algorithm has the i i , the joint distribution guarantee that simultaneously for all players a on reports to all players other than i is differentially private in − i 2 A correlated equilibrium is defined by a joint distribution on profiles of actions, n . For an action profile a drawn from the distribution, if agent A is told only a , i i then playing action a is a best response given the induced conditional distribution i . An a -approximate correlated equilibrium is one where deviating improves α over i − an agent’s utility by at most α .

203 10.2. Differential privacy as a tool in mechanism design 199 . When the algorithm computes a corre- the reported type of agent i lated equilibrium of the underlying game, this guarantee is sufficient for a restricted form of approximate truthfulness: agents who have the option to opt-in or opt-out of the mechanism (but not to misreport their type if they opt-in) have no disincentive to opt-out, because no agent i can substantially change the distribution on actions induced on the other players by opting out. Moreover, given that he opts in, no agent has incentive not to follow his suggested action, as his suggestion is part of a correlated equilibrium. When the mechanism computes a Nash equilibrium of the underlying game, then the mechanism becomes truthful even when agents have the ability to mis-report their type to the mechanism when they opt in. - More specifically, when these mechanisms compute an α approximate Nash equilibrium while satisfying -differential privacy, ε every agent following the honest behavior (i.e., first opting in and reporting their true type, then following their suggested action) forms (2 ε + α ) -approximate Nash equilibrium. This is because, by pri- an 2 -approximate dominant strat- ε vacy, reporting your true type is a egy, and given that everybody reports their true type, the mechanism computes an α -approximate equilibrium of the true game, and hence α -approximate best by definition, following the suggested action is an α -approximate response. There exist mechanisms for computing and ( ) 1 √ equilibrium in large games with α = . Therefore, by setting O nε ) ( 1 η ε -approximately truthful equilibrium selec- O , this gives an = 4 / 1 n tion mechanism for ) ( 1 . α = O η = 2 ε + / 1 4 n In other words, it gives a mechanism for coordinating equilibrium behavior in large games that is asymptotically truthful in the size of the game, all without the need for monetary transfers. 10.2.3 Obtaining exact truthfulness So far we have discussed mechanisms that are asymptotically truthful in large population games. However, what if we want to insist on mech- anisms that are exactly dominant strategy truthful, while maintaining

204 200 Differential Privacy and Mechanism Design some of the nice properties enjoyed by our mechanisms so far: for exam- ple, that the mechanisms do not need to be able to extract monetary payments? Can differential privacy help here? It can—in this section, we discuss a framework which uses differentially private mechanisms as a building block toward designing exactly truthful mechanisms without money. The basic idea is simple and elegant. As we have seen, the expo- nential mechanism can often give excellent utility guarantees while preserving differential privacy. This doesn’t yield an exactly truthful mechanism, but it gives every agent very little incentive to deviate from truthful behavior. What if we could pair this with a second mech- anism which need not have good utility guarantees, but gives each agent a strict positive incentive to report truthfully, i.e., a mechanism that essentially only punishes non-truthful behavior? Then, we could ran- domize between running the two mechanisms. If we put enough weight on the punishing mechanism, then we inherit its strict-truthfulness properties. The remaining weight that is put on the exponential mech- anism contributes to the utility properties of the final mechanism. The hope is that since the exponential mechanism is approximately strategy proof to begin with, the randomized mechanism can put small weight on the strictly truthful punishing mechanism, and therefore will have good utility properties. To design punishing mechanisms, we will have to work in a slightly non-standard environment. Rather than simply picking an outcome, we can model a mechanism as picking an outcome, and then an agent as reaction to that outcome, which together define his utility. choosing a Mechanisms will then have the power to restrict the reactions allowed . Formally, we will work in the by the agent based on his reported type following framework: Definition 10.4 (The Environment) . An environment is a set N of n players, a set of types t of outcomes, a set of ∈ T , a finite set O i reactions and a utility function u : T ×O× R → [0 , 1] . R ˆ r s optimal ( t, s, arg max R i ) ∈ We write to denote ) t, s, r ( u i i i ˆ R ∈ r i ˆ reaction among choices R ⊆ R to alternative s if he is of type t . i

205 10.2. Differential privacy as a tool in mechanism design 201 M A direct revelation mechanism defines a game which is played as follows: ′ t Each player . reports a type 1. i ∈T i ˆ and a subset 2. R s ⊆ R ∈O The mechanism chooses an alternative i . of reactions, for each player i ˆ R 3. ∈ i r chooses a reaction and experiences utility Each player i i t u , s, r ( ) . i i Agents play so as to maximize their own utility. Note that since there is no further interaction after the 3rd step, rational agents will pick ˆ = r , and so we can ignore this as a strategic step. Let ( t ) r , s, R i i i i R n M : T →O×R . = 2 R . Then a mechanism is a randomized mapping ( t, s, r ) = Let us consider the utilitarian welfare criterion: F ∑ n 1 , since each /n u ( t ∆ = 1 , s, r , Note that this has sensitivity ) i i i =1 n , [0 . Hence, if we simply choose an out- agent’s utility lies in the range 1] come s and allow each agent to play their best response reaction, the ε -differentially private mechanism, which exponential mechanism is an ( ) log |O| 3.11 , achieves social welfare at least OPT − O by Theorem εn with high probability. Let us denote this instantiation of the exponen- F tial mechanism, with quality score and privacy parameter , range O M . , as ε ε The idea is to randomize between the exponential mechanism (with good social welfare properties) and a strictly truthful mechanism which punishes false reporting (but with poor social welfare properties). If we mix appropriately, then we will get an exactly truthful mechanism with reasonable social welfare guarantees. Here is one such punishing mechanism which is simple, but not necessarily the best for a given problem: ′ P M ) t The commitment mechanism Definition 10.5. selects s ∈ O ( ′ ˆ R { r ( t uniformly at random and sets = , s, R ) } , i.e., it picks a random i i i i outcome and forces everyone to react as if their reported type was their true type. Define the of an environment as gap ) ( ′ )) , s, R γ t ( , s, r t ( u max − )) = min u ( t , s, R , t ( , s, r i i i i i i i i ′ s ∈O = t ,t i,t ̸ − i i i

206 202 Differential Privacy and Mechanism Design γ is a lower bound over players and types of the worst-case cost i.e., s (over ) of mis-reporting. Note that for each player, this worst-case is realized with probability at least 1 / . Therefore we have the following |O| simple observation: ′ , t Lemma 10.2. , t For all i , t : i − i i γ P ′ P M u ( t ( , t t . )) ≥ u ( t t , M , ( , t )) + i − i i i − i i |O| Note that the commitment mechanism is strictly truthful: every γ individual has at least a incentive not to lie. |O| This suggests an exactly truthful mechanism with good social wel- fare guarantees: P M The punishing exponential mechanism ( t ) defined Definition 10.6. ε 0 ≤ q ≤ 1 selects the exponential mechanism M with parameter ( t ) ε P − q and the punishing mechanism M with probability ( t ) with com- 1 q . plementary probability ′ t , t Observe that by linearity of expectation, we have for all , t : i i − i P P t )) , M t , t ( t t , t ( u )) = (1 − q ) · u ( t M , M , ( ( t , t ( u )) + q · i i i i − i i ε − i i i − ε ) ( ′ − u ( t , M ( ) q (1 , t ≥ )) − 2 ε t i ε i − i ( ) γ ′ P + , M u ( t q ( , t t )) + i − i i |O| γ P ′ t , = M u q ( t ( + , t ε )2 )) − (1 − q i i − ε i |O| ) ( γ ′ P . M = u ( t ( t , t + ε )) − 2 ε + q , 2 i i − i ε |O| The following two theorems show incentive and social welfare prop- erties of this mechanism. qγ P is strictly truthful. then M ε ≤ If Theorem 10.3. 2 ε |O|

207 10.2. Differential privacy as a tool in mechanism design 203 Note that we also have utility guarantees for this mechanism. Set- q ting the parameter so that we have a truthful mechanism: ˆ ( ))] R [ F E t, s, r ( t, s, P ˆ ∼M s, R ε ˆ q ) · E F ))] R ≥ t, s, [ (1 − t, s, r ( ( ˆ R ∼M s, ε ( ) ε |O| 2 ˆ − E = 1 · ))] [ F ( t, s, r ( t, s, R ˆ s, R ∼M ε γ ( )) ) ( ( 1 2 ε |O| ) max · F ( t, s, r O − 1 log |O| − ≥ t,s,r γ εn ) ( 1 ε |O| 2 ) − O ≥ max . − F ( t, s, r |O| log t,s,r εn γ Setting √ ) ( log |O| γ ∈ ε O n |O| we find: √ ) ( log |O| |O| ˆ F . t, s, r ( t, s, E R ))] ≥ max [ F ( t, s, r ) − O ( P ˆ ∼M R s, ε t,s,r γn Note that in this calculation, we assume that ε ≤ γ/ (2 |O| ) so that |O| 2 ε and the mechanism is well defined. This is true for ≤ 1 q = γ sufficiently large n . That is, we have shown: P n , M Theorem 10.4. For sufficiently large achieves social welfare at ε least √ ( ) |O| log |O| OPT − . O γn Note that this mechanism is truthful without the need for payments! Let us now consider an application of this framework: the facil- ity location game. Suppose that a city wants to build k hospitals to minimize the average distance between each citizen and their clos- est hospital. To simplify matters, we make the mild assumption that 3 Formally, let the city is built on a discretization of the unit line. 3 If this is not the case, we can easily raze and then re-build the city.

208 204 Differential Privacy and Mechanism Design 2 1 { 0 , ( m , ) = L , . . . , 1 } denote the discrete unit line with step-size m m k . | L ( m ) | = m +1 . Let T = R . = L ( m ) for all i and let |O| 1 L ( m ) /m = i i to be: Define the utility of agent { ∈ r If | ; s −| t r − , i i i , s, r t ( u ) = i i 1 − otherwise. , In other words, agents are associated with points on the line, and an outcome is an assignment of a location on the line to each of the k facilities. Agents can react to a set of facilities by deciding which one to go to, and their cost for such a decision is the distance between their own location (i.e., their type) and the facility that they have chosen. Note that r ( t . , s ) is here the closest facility r s ∈ i i i We can instantiate Theorem 10.4 . In this case, we have: |O| = k ′ ( and γ = 1 m , because any two positions t differ by at ̸ = t + 1) /m i i least 1 /m . Hence, we have: P M Theorem 10.5. instantiated for the facility location game is ε strictly truthful and achieves social welfare at least: √ k log km ( m + 1) m . O − OPT n This is already very good for small numbers of facilities k , since we expect that OPT . = Ω(1) 10.3 Mechanism design for privacy aware agents In the previous section, we saw that differential privacy can be useful as a tool to design mechanisms, for agents who care only about the outcome chosen by the mechanism . We here primarily viewed privacy as a tool to accomplish goals in traditional mechanism design. As a side affect, these mechanisms also preserved the privacy of the reported player types. Is this itself a worthy goal? Why might we want our mechanisms to preserve the privacy of agent types? A bit of reflection reveals that agents might care about privacy. Indeed, basic introspection suggests that in the real world, agents value the ability to keep certain “sensitive” information private, for example,

209 10.3. Mechanism design for privacy aware agents 205 health information or sexual preferences. In this section, we consider the question of how to model this value for privacy, and various approaches taken in the literature. Given that agents might have preferences for privacy, it is worth as an addi- considering the design of mechanisms that preserve privacy tional goal , even for tasks such as welfare maximization that we can already solve non-privately. As we will see, it is indeed possible to approximately optimize privately generalize the VCG mechanism to any social choice problem, with a smooth trade-off social welfare in between the privacy parameter and the approximation parameter, all while guaranteeing exact dominant strategy truthfulness. However, we might wish to go further. In the presence of agents with preferences for privacy, if we wish to design truthful mechanisms, we must somehow model their preferences for privacy in their utility func- tion, and then design mechanisms which are truthful with respect to these new “privacy aware” utility functions. As we have seen with dif- ferential privacy, it is most natural to model privacy as a property of the mechanism itself. Thus, our utility functions are not merely functions of the outcome, but functions of the outcome and of the mechanism itself. In almost all models, agent utilities for outcomes are treated as linearly separable, that is, we will have for each agent i , . ( o, M , t ) ≡ μ ) ( u ) − c , t ( o, M o i i i M μ ( o ) represents agent i s utility for outcome o and c the Here o, ) , t ( i i i experiences when outcome o (privacy) cost that agent is chosen with mechanism M . We will first consider perhaps the simplest (and most naïve) model for the privacy cost function c , differential . Recall that for ε ≪ 1 i privacy promises that for each agent , and for every possible utility i n ′ , and deviation t function f : t , type vector ∈T ∈T i i ′ ( E . )] o ( f [ E ε 2 [ f |≤ ( o )] − E )] o f [ | i i i t o M ) ( ∼ o ∼ M ( t ) ,t ,t ∼ o ) ( M t i − i i − i If we view f , as representing the “expected future utility” for agent i i it is therefore natural to model agent i ’s cost for having his data used in an ε -differentially private computation as being linear in ε . That is,

210 206 Differential Privacy and Mechanism Design as being parameterized by some value v we think of agent ∈ R , and i i take: , o, M c ) = εv ( , t i i where ε is the smallest value such that M is ε -differentially private. v o to represent a quantity like E . In this )] Here we imagine ( f [ i i ( M ∼ o t ) or the type profile o c t . setting, does not depend on the outcome i Using this naïve privacy measure, we discuss a basic problem in private data analysis: how to collect the data, when the owners of the data value their privacy and insist on being compensated for it. In this setting, there is no “outcome” that agents value, other than payments, there is only dis-utility for privacy loss. We will then discuss shortcomings of this (and other) measures of the dis-utility for privacy loss, as well as privacy in more general mechanism design settings when do have utility for the outcome of the mechanism. agents A private generalization of the VCG mechanism 10.3.1 Suppose we have a general social choice problem, defined by an outcome space O , and a set of agents N with arbitrary preferences over the u : outcomes given by O → [0 , 1] . We might want to choose an outcome i ∑ n 1 social welfare F ( o ) = o . It is well ∈ O to maximize the ) o u ( i =1 i n known that in any such setting, the VCG mechanism can implement the ∗ o outcome which exactly maximizes the social welfare, while charging payments that make truth-telling a dominant strategy. What if we want to achieve the same result, while also preserving privacy? How must the trade off with our approximation to the optimal privacy parameter ε social welfare? Recall that we could use the exponential mechanism to choose o an outcome , with quality score F . For privacy parameter ∈ O ε M ∝ defined to be Pr[ , this would give a distribution ] = o M ε ε ) ( εF o ) ( exp . Moreover, this mechanism has good social welfare prop- 2 n erties: with probability 1 − β , it selects some o such that: F ( o ) ≥ ( ) |O| 2 ∗ − o F ( ) ln . But as we saw, differential privacy only gives β εn ε -approximate truthfulness.

211 10.3. Mechanism design for privacy aware agents 207 M is the solution to the following However, it can be shown that ε exact optimization problem: ) ( 2 o E ) D [ F ( = arg max )] + , H ( M ε ∼D o D∈ O ∆ εn where represents the Shannon Entropy of the distribution D . In other H words, the exponential mechanism is the distribution which exactly maximizes the expected social welfare, the entropy of the distri- plus 2 / ( εn bution weighted by . This is significant for the following reason: ) it is known that any mechanism that exactly maximizes expected player utilities in any finite range (known as maximal in distributional range mechanisms) can be paired with payments to be made exactly domi- nant strategy truthful. The exponential mechanism is the distribution that exactly maximizes expected social welfare, plus entropy. In other words, if we imagine that we have added a single additional player whose utility is exactly the entropy of the distribution, then the expo- nential mechanism is maximal in distributional range. Hence, it can be paired with payments that make truthful reporting a dominant strategy real players. Moreover, it can for all players — in particular, for the n be shown how to charge payments in such a way as to preserve privacy. The upshot is that for any social choice problem, the social welfare can be approximated in a manner that both preserves differential privacy, and is exactly truthful. 10.3.2 The sensitive surveyor’s problem In this section, we consider the problem of a data analyst who wishes to conduct a study using the private data of a collection of individuals. However, he must these individuals to hand over their data! convince Individuals experience costs for privacy loss. The data analyst can mit- igate these costs by guaranteeing differential privacy and compensating them for their loss, while trying to get a representative sample of data. Consider the following stylized problem of the sensitive surveyor Alice. She is tasked with conducting a survey of a set of n individuals N , to determine what proportion of the individuals i ∈ N satisfy some property ( i ) . Her ultimate goal is to discover the true value of this P 1 ) i , but if that is not possible, she will be ( }| i ∈ N : P |{ = s statistic, n

212 208 Differential Privacy and Mechanism Design s ˆ | ˆ s − such that the error, | , is minimized. s satisfied with some estimate We will adopt a notion of accuracy based on large deviation bounds, 1 -accurate if and say that a surveying mechanism is | ˆ s − s |≥ α ] ≤ α . Pr[ 3 The inevitable catch is that individuals value their privacy and will not participate in the survey for free. Individuals experience some cost as a function of their loss in privacy when they interact with Alice, and must be compensated for this loss. To make matters worse, these individuals are rational (i.e., selfish) agents, and are apt to misreport their costs to Alice if doing so will result in a financial gain. This places Alice’s problem squarely in the domain of mechanism design, and requires Alice to develop a scheme for trading off statistical accuracy with cost, all while managing the incentives of the individuals. As an aside, this stylized problem is broadly relevant to any organi- zation that makes use of collections of potentially sensitive data. This includes, for example, the use of search logs to provide search query completion and the use of browsing history to improve search engine ranking, the use of social network data to select display ads and to recommend new links, and the myriad other data-driven services now available on the web. In all of these cases, value is being derived from the statistical properties of a collection of sensitive data in exchange 4 for some payment. Collecting data in exchange for some fixed price could lead to a biased estimate of population statistics, because such a scheme will result in collecting data only from those individuals who value their privacy less than the price being offered. However, without interacting with the agents, we have no way of knowing what price we can offer so that we will have broad enough participation to guarantee that the answer we collect has only small bias. To obtain an accurate estimate of the statistic, it is therefore natural to consider buying private data using an auction — as a means of discovering this price. There are two obvious obstacles which one must confront when conducting an auction for private data, and an additional obstacle which is less obvious but more insidious. The first obstacle is that one must have a quantitative 4 The payment need not be explicit and/or dollar denominated — for example, it may be the use of a “free” service.

213 10.3. Mechanism design for privacy aware agents 209 formalization of “privacy” which can be used to measure agents’ costs under various operations on their data. Here, differential privacy pro- , because exp( ε ) ≈ (1 + ε ) ε vides an obvious tool. For small values of , and so as discussed earlier, a simple (but possibly naive) first cut at a model is to view each agent as having some linear cost for participating in a private study. We here imagine that each agent i has an unknown v value for privacy , and experiences a cost c when his private ( ε ) = εv i i i 5 ε -differentially private manner. data is used in an The second obstacle is that our objective is to trade off with statistical accuracy , and the latter is not well-studied objective in mechanism design. The final, more insidious obstacle, is that an individual’s cost for privacy loss may be highly correlated with his private data itself! Sup- pose we only know Bob has a high value for privacy of his AIDS status, but do not explicitly know his AIDS status itself. This is already dis- closive because Bob’s AIDS status is likely correlated with his value for privacy, and knowing that he has a high cost for privacy lets us update our belief about what his private data might be. More to the point, suppose that in the first step of a survey of AIDS prevalence, we ask each individual to report their value for privacy, with the intention of then running an auction to choose which individuals to buy data from. If agents report truthfully, we may find that the reported values natu- rally form two clusters: low value agents, and high value agents. In this case, we may have learned something about the population statistic even before collecting any data or making any payments— and there- fore, the agents will have already experienced a cost. As a result, the agents may misreport their value, which could introduce a bias in the survey results. This phenomenon makes direct revelation mechanisms problematic, and distinguishes this problem from classical mechanism design. Armed with a means of quantifying an agent i ’s loss for allowing his data to be used by an ε -differentially-private algorithm ( c , ( ε ) = εv ) i i we are almost ready to describe results for the sensitive surveyor’s problem. Recall that a differentially private algorithm is some mapping n : T M → O , for a general type space T . It remains to define what 5 As we will discuss later, this assumption can be problematic.

214 210 Differential Privacy and Mechanism Design is. We will consider two models. In both exactly the type space T ∈ { 0 , b } which models, we will associate with each individual a bit 1 i represents whether they satisfy the sensitive predicate i ) , as well as P ( + v ∈ R . a value for privacy i In the insensitive value model , we calculate the ε parameter of the 1. T 0 , 1 } : i.e., { private mechanism by letting the type space be = we measure privacy cost only with respect to how the mechanism , and ignore how it treats the reported b treats the sensitive bit i 6 values for privacy, . v i 2. In the sensitive value model , we calculate the ε parameter of the + T = ( { 0 , 1 }× R private mechanism by letting the type space be ) : i.e., we measure privacy with respect to how it treats the pair ( b , v for each individual. ) i i Intuitively, the insensitive value model treats individuals as ignoring the potential privacy loss due to correlations between their values for privacy and their private bits, whereas the sensitive value model treats individuals as assuming these correlations are worst-case, i.e., their values are just as disclosive as their private bits b . It is known that v i i in the insensitive value model, one can derive approximately optimal direct revelation mechanisms that achieve high accuracy and low cost. By contrast, in the sensitive value model , no individually rational direct revelation mechanism can achieve any non-trivial accuracy. This leaves a somewhat unsatisfying state of affairs. The sensitive value model captures the delicate issues that we really want to deal with, and yet there we have an impossibility result! Getting around this result in a satisfying way (e.g., by changing the model, or the powers of the mechanism) remains an intriguing open question. 10.3.3 Better measures for the cost of privacy In the previous section, we took the naive modeling assumption that the cost experienced by participation in an ε -differentially private mecha- nism M was c . This measure ( o, M , t ) = εv v for some numeric value i i i 6 That is, the part of the mapping dealing with reported values need not be differentially private.

215 10.3. Mechanism design for privacy aware agents 211 is problematic for several reasons. First, although differential privacy promises that any agent’s loss in utility is upper bounded by a quantity , there is no reason to believe that that is (approximately) linear in ε lower bounded by such a quantity. That is, while tak- agents’ costs are is well motivated, there is little support for making ( o, M , t ) ≤ εv ing c i i the inequality an equality. Second, (it turns out) any privacy measure (not just a linear function) ε which is a deterministic function only of leads to problematic behavioral predictions. So how else might we model c ? One natural measure is the mutual i i , and the outcome between the reported type of agent information of the mechanism. For this to be well defined, we must be in a world where each agent’s type . Each is drawn from a known prior, t ∼T t i i σ : T → T , determining what type he agent’s strategy is a mapping i reports, given his true type. We could then define , ( o, M , σ ) = I ( T c M ( t )) T , σ ( ; i − i where I is the mutual information between the random variable T representing the prior on agent i s type, and M ( t , the random )) , σ ( T i − i s variable representing the outcome of the mechanism, given agent strategy. This measure has significant appeal, because it represents how “related” the output of the mechanism is to the true type of agent . However, in addition to requiring a prior over agent types, observe i an interesting paradox that results from this measure of privacy loss. Consider a world in which there are two kinds of sandwich breads: Rye (R), and Wheat (W). Moreover, in this world, sandwich preferences T is uni- are highly embarrassing and held private. The prior on types M form over R and W, and the mechanism i a simply gives agent sandwich of the type that he purports to prefer. Now consider two pos- . σ and σ sible strategies, corresponds to truthfully σ truthful truthful random reporting sandwich preferences (and subsequently leads to eating the preferred sandwich type), while σ randomly reports independent random of true type (and results in the preferred sandwich only half the time). The cost of using the random strategy is I ; M ( t T ( , σ , ( T )) = 0 − i random since the output is independent of agent i ’s type. On the other hand, the cost of truthfully reporting is , since ( T ; M ( t I )) = 1 , σ T ( i − truthful

216 212 Differential Privacy and Mechanism Design i the sandwich outcome is now the identity function on agent s type. However, from the perspective of any outside observer, the two strate- i receives a uniformly gies are indistinguishable! In both cases, agent random sandwich. Why then should anyone choose the random strat- they are choosing randomly, they believes egy? So long as an adversary should choose the honest strategy. Another approach, which does not need a prior on agent types, is as follows. We may model agents as having a cost function that satisfies: c i ( ) ) = o ] Pr[ M ( t , t i − i M , t | | = ln c max ) ( o, . i ′ ′ ] Pr[ M ( t ) = , t o ∈T ,t t i i − i i M ε -differentially private, then is Note that if ) ( ) = Pr[ t ( M ] o , t i i − max ≤ ε. max ln max ′ n ′ ∈T t o ∈O Pr[ M ( t ] o , t ) = ∈T ,t t i − i i i That is, we can view differential privacy as bounding the worst-case privacy loss over all possible outcomes, whereas the measure proposed here considers only the privacy loss for the outcome o (and type vec- tor t ) actually realized. Thus, for any differentially private mechanism ε , M ( o, M , t ) | ≤ c for all o, t , but it will be important that the cost | i can vary by outcome. We can then consider the following allocation rule for maximizing ∑ n 7 |O| ( F social welfare = 2 u ) = ( o ) . o We discuss the case when i i =1 (which does not require payments), but it is possible to analyze the general case (with payments), which privately implements the VCG mechanism for any social choice problem. For each outcome o 1. , choose a random number r from the ∈ O o distribution Pr[ r . = x ] ∝ exp( − ε | x | ) o ∗ o r = arg max 2. . ( F ( o ) + Output ) o ∈O o The above mechanism is ε -differentially private, and that it is truthful i , and for the two for privacy aware agents, so long as for each agent ′ ′ outcomes ∈ O , | μ . Note that this will be true ( o ) − μ ε ( o o, o ) | > 2 i i 7 This allocation rule is extremely similar to, and indeed can be modified to be identical to the exponential mechanism.

217 10.4. Bibliographical notes 213 for small enough ε so long as agent utilities for outcomes are distinct. The analysis proceeds by considering an arbitrary fixed realization of ′ from truthful t r the random variables , and an arbitrary deviation o i th agent. There are two cases: In the first case, the reporting for the i deviation does not change the outcome o of the mechanism. In this the agent’s utility for the outcome μ case, , nor his cost for neither i c change at all, and so the agent does not benefit from privacy loss i ′ o deviating. In the second case, if the outcome changes from when o to ′ deviates, it must be that μ agent ( o i ) < μ . By differential ( o ) − 2 ε i i ′ ( ( o, M , t ) − c privacy, however, c o | , M , t ) |≤ 2 ε , and so the change in i i privacy cost cannot be enough to make it beneficial. Finally, the most conservative approach to modeling costs for pri- ε vacy generally considered is as follows. Given an -differentially private mechanism M , assume only that , ( o, M , t ) c εv ≤ i i v for some number . This is similar to the linear cost functions that we i considered earlier, but crucially, here we assume only an upper bound. This assumption is satisfied by all of the other models for privacy cost that we have considered thus far. It can be shown that many mecha- nisms that combine a differentially private algorithm with a punishing mechanism that has the ability to restrict user choices, like those that 10.2.3 , maintain their truthfulness properties we considered in Section in the presence of agents with preferences for privacy, so long as the values v are bounded. i 10.4 Bibliographical notes This section is based off of a survey of Pai and Roth [ 70 ] and a survey of Roth [ 73 ]. The connections between differential privacy and mech- anism design were first suggested by Jason Hartline and investigated by McSherry and Talwar in their seminal work, “Mechanism Design 61 ], where they considered the application of via Differential Privacy” [ differential privacy to designing approximately truthful digital goods auctions. The best result for exactly truthful mechanisms in the digital goods setting is due to Balcan et al. [ 2 ].

218 214 Differential Privacy and Mechanism Design The problem of designing exactly truthful mechanisms using differ- ential privacy as a tool was first explored by Nissim, Smorodinsky, and 69 ], who also first posed a criticism as differential pri- Tennenholtz in [ vacy (by itself) used as a solution concept. The example in this section of using differential privacy to obtain exactly truthful mechanisms is taken directly from [ 69 ]. The sensitive surveyors problem was first con- 36 ], and expanded on by [ 56 , 34 , 75 sidered by Ghosh and Roth [ 16 ]. , Fleischer and Lyu [ ] consider the Bayesian setting discussed in this 34 56 section, and Ligett and Roth [ ] consider the worst-case setting with take-it-or-leave-it offers, both in an attempt to get around the impossi- 36 bility result of [ ]. Ghosh and Ligett consider a related model in which participation decisions (and privacy guarantees) are determined only in equilibrium [ 35 ]. The question of conducting mechanism design in the presence of agents who explicitly value privacy as part of their utility function was first raised by the influential work of Xiao [ ], who considered (among 85 other measures for privacy cost) the mutual information cost function. Following this, Chen et al. [ ] and Nissim et al. [ 67 ] showed how in 15 two distinct models, truthful mechanisms can sometimes be designed even for agents who value privacy. Chen Chong, Kash, Moran, and Vadhan considered the outcome-based cost function that we discussed in this section, and Nissim, Orlandi, and Smorodinsky considered the conservative model of only upper bounding each agent’s cost by a linear function in ε > The “sandwich paradox” of valuing privacy according to mutual information is due to Nissim, Orlandi, and Smorodinsky. Huang and Kannan proved that the exponential mechanism could 49 ]. Kearns be made exactly truthful with the addition of payments [ Pai, Roth, and Ullman showed how differential privacy could be used to derive asymptotically truthful equilibrium selection mechanisms [ 54 ] by privately computing correlated equilibria in large games. These results 71 ], who showed how to pri- were strengthened by Rogers and Roth [ vately compute approximate Nash equilibria in large congestion games, which leads to stronger incentive properties of the mechanism. Both of these papers use the solution concept of “Joint Differential Privacy,”

219 10.4. Bibliographical notes 215 which requires that for every player i , the joint distribution on messages sent to other players j ̸ = i be differentially private in i s report. This solution concept has also proven useful in other settings of private mechanism design settings, including an algorithm for computing pri- vate matchings by Hsu et al. [ 47 ].

220 11 Differential Privacy and Machine Learning One of the most useful tasks in data analysis is machine learning: the problem of automatically finding a simple rule to accurately predict cer- tain unknown characteristics of never before seen data. Many machine learning tasks can be performed under the constraint of differential pri- vacy. In fact, the constraint of privacy is not necessarily at odds with the goals of machine learning, both of which aim to extract informa- tion from the distribution from which the data was drawn, rather than from individual data points. In this section, we survey a few of the most basic results on private machine learning, without attempting to cover this large field completely. The goal in machine learning is very often similar to the goal in pri- vate data analysis. The learner typically wishes to learn some simple rule that explains a data set. However, she wishes this rule to general- ize — that is, it should be that the rule she learns not only correctly describes the data that she has on hand, but that it should also be able to correctly describe new data that is drawn from the same dis- tribution. Generally, this means that she wants to learn a rule that captures distributional information about the data set on hand, in a way that does not depend too specifically on any single data point. Of 216

221 217 course, this is exactly the goal of private data analysis — to reveal dis- about the private data set, without revealing tributional information too much about any single individual in the dataset. It should come as no surprise then that machine learning and private data analysis are closely linked. In fact, as we will see, we are often able to perform nearly as accurately, with nearly the same private machine learning number of examples as we can perform non-private machine learning. Let us first briefly define the problem of machine learning. Here, we will follow Valiant’s PAC Probably Approximately Correct ) model (Or d { of machine learning. Let , 1 } X be the domain of “unlabeled = 0 x ∈ X as a vector containing d boolean examples.” Think of each x ∈ X as being paired with labels attributes. We will think of vectors ∈{ y , 1 } . 0 A } is a pair ( x, y ) ∈X×{ 0 , 1 Definition 11.1. : a vector labeled example paired with a label. A learning problem is defined as a distribution over labeled exam- D f : X →{ ples. The goal will to be to find a function , 1 } that correctly 0 labels almost all of the examples drawn from the distribution. Definition 11.2. Given a function f : X → { 0 , 1 } and a distribution D over labeled examples, the of f on D is: error rate ) ( D ) = Pr ] y = ̸ err f ( x f, [ ∼D x,y ( ) We can also define the error rate of over a finite sample D : f 1 ( f, D ) = ( . }| y |{ err x, y ) ∈ D : f ( x ) ̸ = | D | algorithm A learning gets to observe some number of labeled exam- D , and has the goal of finding a function ples drawn from with as f small an error rate as possible when measured on D . Two parameters in measuring the quality of a learning algorithm are its running time, and the number of examples it needs to see in order to find a good hypothesis. Definition 11.3. An algorithm A is said to PAC-learn a class of func- tions , there exists an over d dimensions if for every α, β > 0 C

222 218 Differential Privacy and Machine Learning d, 1 /α, log(1 /β )) such that for every distribution D over m = poly( takes as input labeled examples drawn from D A labeled examples, m ∈ C such that with probability 1 − β : and outputs a hypothesis f ∗ err( ≤ min err( α ) + D f, f D , ) ∗ C ∈ f ∗ ∗ If err ( f min , D ) = 0 , the learner is said to operate in the C f ∈ setting (i.e., there exists some function in the class which realizable perfectly labels the data). Otherwise, the learner is said to operate setting. If A also has run time that is polynomial in agnostic in the /α 1 log(1 /β ) , then the learner is said to be efficient . If there is d, , and C , then C is said to be PAC-learnable. an algorithm which PAC-learns The above definition of learning allows the learner to have direct access to labeled examples. It is sometimes also useful to consider mod- els of learning in which the algorithm only has oracle access to some noisy information about D . A statistical query is some function φ : X ×{ 0 , 1 }→ Definition 11.4. , 1] . A statistical query oracle for a distribution over labeled exam- [0 τ D with tolerance τ is an oracle O ples such that for every statistical D φ : query ∣ ∣ ∣ ∣ τ ( φ ) − E O )] x, y ( [ φ ≤ τ ∣ ∣ ( x,y ) ∼D D In other words, an SQ oracle takes as input a statistical query , and φ ± τ of the expected outputs some value that is guaranteed to be within φ on examples drawn from D . value of The statistical query model of learning was introduced to model the problem of learning in the presence of noise. Definition 11.5. An algorithm A is said to SQ-learn a class of func- tions C over d dimensions if for every α, β > 0 there exists an )) m ( d, 1 /α, log(1 /β poly such that A makes at most m queries of tol- = τ β − 1 , and with probability , outputs a hypothesis O /m = 1 τ erance to D ∈ C such that: f ∗ ( f, D α ≤ min ) + D , err err ( f ) ∗ C ∈ f

223 11.1 Sample complexity of differentially private machine learning 219 Note that an SQ learning algorithm does not get any access to D except through the SQ oracle. As with PAC learning, we can talk about an SQ learning algorithm operating in either the realizable or the agnostic setting, and talk about the computational efficiency of the learning algorithm. We say that a class is SQ learnable if there exists C an SQ learning algorithm for C . 11.1 The sample complexity of differentially private machine learning Perhaps the first question that one might ask, with respect to the relationship between privacy and learning, is “When is it possible to privately perform machine learning”? In other words, you might ask for a PAC learning algorithm that takes as input a dataset (implic- itly assumed to be sampled from some distribution D ), and then f privately output a hypothesis that with high probability has low error over the distribution. A more nuanced question might be, “How additional many samples are required to privately learn, as compared with the number of samples already required to learn without the con- straint of differential privacy?” Similarly, “How much additional run- time is necessary to privately learn, as compared with the run-time required to learn non-privately?” We will here briefly sketch known results for ε, 0) -differential privacy. In general, better results for ( ε, δ ) - ( differential privacy will follow from using the advanced composition theorem. information theoretic result A foundational in private machine learning is that private PAC learning is possible with a polynomial num- ber of samples if and only if non-private PAC learning is possible with a polynomial number of samples, even in the agnostic setting. In fact, the increase in sample complexity necessary is relatively small — how- computational efficiency . One way to ever, this result does not preserve do this is directly via the exponential mechanism. We can instantiate = the exponential mechanism with a range , equal to the class of R C D , we can use the quality score queries to be learned. Given a database 1 D : i.e., we seek to minimize the y |{ ( x, y ) ∈ }| : f ( x ) ̸ = f, D ) = − q ( | D | fraction of misclassified examples in the private dataset. This is clearly

224 220 Differential Privacy and Machine Learning 1 a /n sensitive function of the private data, and so we have via our utility theorem for the exponential mechanism that with probability β f ∈ C that correctly labels an 1 − , this mechanism returns a function 1 C 2(log +log | ) | β fraction of the points in the database correctly. − OPT εn Recall, however, that in the learning setting, we view the database n i.i.d. draws from some distribution over labeled D as consisting of . Recall the discussion of sampling bounds in Lemma . 4.3 D examples A Chernoff bound combined with a union bound tells us that with high consists of n i.i.d. samples drawn from D , then for all probability, if D √ log | C | ( f, D f − err ( f, D ) | ≤ O ( ∈ C : | err . Hence, if we wish to find ) ) n α of the optimal error on the distri- a hypothesis that has error within 2 C , it suffices to draw a database n ≥ log | consisting of | /α D bution D ∗ f samples, and learn the best classifier D . on Now consider the problem of privately PAC learning, using the exponential mechanism as described above. Recall that, by , it is highly unlikely that the exponential mechanism 3.11 Theorem will return a function f with utility score that is inferior to that of the ∗ f ) log by more than an additive factor of O ((∆ u/ε optimal | C | ) , where in this case u 1 /n . That is, ∆ , the sensitivity of the utility function, is with high probability the exponential mechanism will return a function ∈ f such that: C ( ) ) C | (log | ∗ err( err( , D ) + O f, D f ) min ≤ ∗ C ∈ f εn √ ( ) ) | C | (log | log | C ∗ . + O f , O D err( ≤ min ) + ∗ f ∈ C n εn Hence, if we wish to find a hypothesis that has error within of the α optimal error on the distribution D , it suffices to draw a database D consisting of: )) ( ( log | | | log C | C , max ≥ O n , 2 α εα which is not asymptotically any more than the database size that is required for non-private learning, whenever ε ≥ α .

225 11.1. Sample complexity of differentially private machine learning 221 1 is that (ignoring computa- A corollary of this simple calculation is PAC learnable if and only if C tional efficiency), a class of functions it is privately PAC learnable. that is SQ Can we say something stronger about a concept class C C is efficiently SQ learnable, then the learn- learnable? Observe that if ing algorithm for need only access the data through an SQ oracle, C which is very amenable to differential privacy: note that an SQ oracle ( x, y ) ∈ [0 answers an expectation query defined over a predicate 1] , φ , x, y E sensitive when estimated on a [ φ ( /n )] , which is only 1 x,y ∼D ) ( D database n from D . Moreover, the learning which is a sample of size algorithm does not need to receive the answer exactly, but can be run a : | E that has the property that: with any answer τ |≤ [ φ ( x, y )] − a ∼D x,y ( ) noisy answers on low sensitivity that is, the algorithm can be run using . The benefit of this is that we can answer such queries computa- queries tionally efficiently, using the Laplace mechanism — but at the expense of requiring a potentially large sample size. Recall that the Laplace m 1 mechanism can answer sensitive queries with ( ε, 0) -differential /n m log m O ( α privacy and with expected worst-case error = ) . Therefore, εn an m queries with SQ learning algorithm which requires the answers to m m log m log (max( , accuracy α . can be run with a sample size of n = O )) 2 εα α SQ Let us compare this to the sample size required for a non-private SQ learner needs to make m queries to tolerance α learner. If the , then 2 by a Chernoff bound and a union bound, a sample size of O (log m/α ) suffices. Note that for ε = O (1) and error α = O (1) , the non-private algorithm potentially requires exponentially fewer samples. However, at the error tolerance ≤ 1 /m as allowed in the definition of SQ learn- α ing, the sample complexity for private SQ learning is no worse than the sample complexity for non-private SQ learning, for = Θ(1) . ε information theoretically The upshot is that , privacy poses very little hinderance to machine learning. Moreover, for any algorithm that 2 then the reduction to accesses the data only though an SQ oracle, 1 Together with corresponding lower bounds that show that for general , it is C 2 o (log | C | /α not possible to non-privately PAC learn using a sample with ) points. 2 And in fact, almost every class (with the lone exception of parity functions ) of functions known to be PAC learnable is also learnable using only an SQ oracle.

226 222 Differential Privacy and Machine Learning private learning is immediate via the Laplace mechanism, and preserves computational efficiency! Differentially private online learning 11.2 In this section, we consider a slightly different learning problem, known . This problem will appear learning from expert advice as the problem of somewhat different from the classification problems that we discussed in the previous section, but in fact, the simple algorithm presented here is extremely versatile, and can be used to perform classification among many other tasks which we will not discuss here. Imagine that you are betting on horse races, but unfortunately know nothing about horses! Nevertheless, you have access to the opinions of experts , who every day make a prediction about which horse is k some going to win. Each day you can choose one of the experts whose advice you will follow, and each day, following your bet, you learn which horse actually won. How should you decide which expert to follow each day, and how should you evaluate your performance? The experts are not perfect (in fact they might not even be any good!), and so it is not reasonable to expect you to make the correct bet all of the time, or even most of the time if none of the experts do so. However, you might have a weaker goal: can you bet on horses in such a way so that you do almost as well as ? the best expert, in hindsight Formally, an online learning algorithm A operates in the following environment: Each day t = 1 , . . . , T : 1. A chooses an expert a ∈{ 1 , . . . , k } (a) t t A observes a loss ℓ (b) } ∈ [0 , 1] for each expert i ∈ { 1 , . . . , k i t . and experiences loss ℓ a t T ≤ T t ℓ ≡{ } ℓ For a sequence of losses , we write: =1 t T ∑ 1 ≤ T t ( L ℓ ) = ℓ i i T =1 t

227 11.2. Differentially private online learning 223 i T to denote the total average loss of expert rounds, and write over all T ∑ 1 t ≤ T ℓ ( L ℓ ) = A a t T =1 t to denote the total average loss of the algorithm. of the algorithm is defined to be the difference between regret The best expert in the loss that it actually incurred, and the loss of the hindsight: T ≤ ≤ T ≤ T ( A, ℓ ℓ ( Regret ) − min . L ) ( ℓ ) = L i A i The goal in online learning is to design algorithms that have the guaran- ≤ T all possible loss sequences ℓ , even adversarialy chosen, the tee that for T →∞ . In fact, this is possible regret is guaranteed to tend to zero as using the multiplicative weights algorithm (known also by many names, e.g., the Randomized Weighted Majority Algorithm, Hedge, Exponen- tiated Gradient Descent, and multiplicative weights being among the most popular). Remark 11.1. We have already seen this algorithm before in Section 4 — this is just the multiplicative weights update rule in another guise! In fact, it would have been possible to derive all of the results about the private multiplicative weights mechanism directly from the regret bound we state in Theorem 11.1 . Algorithm 15 The Multiplicative Weights (or Randomized Weighted Majority (RWM)) algorithm, version 1. It takes as input a stream of 2 1 , a , . . . and outputs a stream of actions a , ℓ losses . It is param- , . . . ℓ 2 1 eterized by an update parameter η . RWM η ) : ( For i ∈{ 1 , . . . , k } each w . ← 1 , let i for t = 1 , . . . do w Choose = i with probability proportional to action a i t t t and set w ← Observe ℓ · exp( − ηℓ k ] ) , for each i ∈ [ w i i i end for It turns out that this simple algorithm already has a remarkable regret bound.

228 224 Differential Privacy and Machine Learning For any adversarially chosen sequence of losses of Theorem 11.1. T 1 T ≤ ℓ the Randomized Weighted Majority algo- , . . . , ℓ = ( ) , ℓ length T has the guarantee that: η rithm with update parameter ln( k ) ≤ T )] ≤ [Regret(RWM( + η ) , ℓ η , (11.1) E ηT √ ln k gives: η = k is the number of experts. Choosing where T √ ln k ≤ T η )] ≤ 2 ) , ℓ E . [Regret(RWM( T This remarkable theorem states that even faced with an adversar- ial sequence of losses, the Randomized Weighted Majority algorithm in hindsight, k can do as well, on average, as the best expert among minus only an additional additive term that goes to zero at a rate of √ k ln k ln ) . In other words, after at most T ≤ O ( rounds, the regret 4 2 T α of the randomized weighted majority algorithm is guaranteed to be at most α ! Moreover, this bound is the best possible. Can we achieve something similar, but under the constraint of dif- ferential privacy? Before we can ask this question, we must decide what is the input database , and at what granularity we would like to protect privacy? Since the input is the collection of loss vectors T ≤ T ≤ T 1 , . . . , ℓ ℓ , it is natural to view ℓ = ( ℓ as the database, and to ) ˆ T ≤ as one that differs in the entire loss view a neighboring database ℓ vector in any single timestep: i.e., one in which for some fixed timestep i t t i ˆ ˆ , for all i ̸ = t , but in which ℓ t and = ℓ ℓ can differ arbitrarily. ℓ The output of the algorithm is the sequence of actions that it chooses, , and it is this that we wish to be output in a differentially a , . . . , a 1 T private manner. Our first observation is that the randomized weighted majority algo- rithm chooses an action at each day t in a familiar manner! We here rephrase the algorithm in an equivalent way: It chooses an action a with probability proportional to: t ∑ j t − 1 η ℓ exp( ) , which is simply the exponential mechanism with qual- − j =1 i ∑ j − t 1

229 11.2. Differentially private online learning 225 The Multiplicative Weights (or Randomized Weighted Algorithm 16 Majority (RWM)) algorithm, rephrased. It takes as input a stream of 2 1 , . . . and outputs a stream of actions a ℓ , a , ℓ , . . . . It is param- losses 1 2 η . eterized by an update parameter RWM ) : ( η t , . . . do for = 1 action = i with probability proportional to Choose a t ∑ j t − 1 − exp( ) ℓ η i j =1 t Observe ℓ end for each round t , the randomized weighted majority algorithm chooses an a differential privacy, so to achieve in a way that preserves 2 η action t privacy it suffices to set η = ε/ 2 ε . Moreover, over the course of the run of the algorithm, it will choose an action T times. If we want the entire run of the algorithm to be ( ε, δ ) -differentially private for some ε and δ , we can thus simply apply our composition theorems. Recall that by Theorem 3.20 , since there ′ 0) ε steps in total, if each step of the algorithm is are ( -differentially T , √ ′ = ε/ private for 8 T ln(1 /δ ) , then the entire algorithm will be ε ε, δ ) ( differentially private. Thus, the following theorem is immediate by set- ′ ting = ε / 2 : η For a sequence of losses of length T , the algorithm Theorem 11.2. ε √ -differentially private. ) ε, δ ( is RWM ( η ) with η = ) /δ T 32 ln(1 without modifying the original Remarkably, we get this theorem randomized weighted majority algorithm at all , but rather just by set- ting η appropriately. In some sense, we are getting privacy for free! We can therefore use Theorem 11.1 , the utility theorem for the RWM algorithm, without modification as well: Theorem 11.3. For any adversarially chosen sequence of losses of T 1 T ≤ ℓ , = ( ℓ T , . . . , ℓ length ) the Randomized Weighted Majority

230 226 Differential Privacy and Machine Learning ε √ = algorithm with update parameter η has the guarantee 32 /δ ) ln(1 T that: √ /δ 32 ln(1 ε k ) ln ≤ T √ √ ≤ E )] [Regret(RWM( + η ) , ℓ ) /δ ln(1 32 T ε T √ 128 ln(1 /δ ) ln k √ ≤ , ε T is the number of experts. k where is an independently t Since the per-round loss at each time step ) with values bounded in chosen random variable (over the choices of a t − 1 , 1] , we can also apply a Chernoff bound to get a high probability [ guarantee: Theorem 11.4. For any adversarially chosen sequence of losses of ≤ T 1 T length = ( T , , . . . , ℓ ℓ ) the Randomized Weighted Majority algo- ℓ ε √ produces a sequence of rithm with update parameter η = T ln(1 ) 32 /δ 1 β : − actions such that with probability at least √ √ /δ k/β ln 128 ln(1 k ) ln T ≤ √ ≤ η ) ) + Regret(RWM( , ℓ T ε T ) ( √ ) ln(1 /δ ) ln( k/β √ O = . T ε This bound is nearly as good as the best possible bound achievable even without privacy (i.e., the RWM bound) — the regret bound is √ ln( k /δ ) ) ln(1 Ω( larger only by a factor of ) . (We note that by using a ε different algorithm with a more careful analysis, we can remove this √ ln k ). Since we are in fact using the same algorithm, extra factor of efficiency is of course preserved as well. Here we have a powerful exam- ple in machine learning where privacy is nearly “free.” Notably, just as with the non-private algorithm, our utility bound only gets better the longer we run the algorithm, while keeping the privacy guarantee the 3 same. 3 Of course, we have to set the update parameter appropriately, just as we have to do with the non-private algorithm. This is easy when the number of rounds T is known ahead of time, but can also be done adaptively when the number of rounds is not known ahead of time.

231 11.3. Empirical risk minimization 227 Empirical risk minimization 11.3 In this section, we apply the randomized weighted majority algorithm discussed in the previous section to a special case of the problem of empirical risk minimization to learn a linear function. Rather than examples assuming an adversarial model, we will assume that are drawn from some known distribution, and we wish to learn a classifier from some finite number of samples from this distribution so that our loss will be low on samples drawn from the same distribution. new d examples x ∈ [ − 1 , 1] Suppose that we have a distribution D over , d d − 1 , 1] x , and for each vector ∈ ∈ [0 , 1] [ with and for each such vector θ θ ∥ . = 1 , we define the loss of θ on example x to be Loss ( θ, x ) = ∥ θ, x ⟩ ⟨ 1 ∗ θ to minimize the expected loss over examples We wish to find a vector drawn from D : ∗ = arg min . ] ⟩ θ, x ⟨ [ θ E x ∼D d =1 ∥ : θ ∥ 1] , [0 ∈ θ 1 This problem can be used to model the task of finding a low error linear D is through classifier. Typically our only access to the distribution d 1] [ − some collection of examples , ⊂ S drawn i.i.d. from D , which 1 serves as the input to our learning algorithm. We will here think of this sample S as our private database, and will be interested in how well ∗ we can privately approximate the error of as a function of | S | (the θ of the learning algorithm). sample complexity Our approach will be to reduce the problem to that of learning with expert advice, and apply the private version of the randomized weighted majority algorithm as discussed in the last section: The experts will be the d standard basis vectors { e , . . . , e 1. , } 1 d . 0) , 0 , . . . , , 1 e where = (0 , . . . , 0 i ︸︷︷︸ i d Given an example x ∈ [ − 1 , 1] ) , we define a loss vector ℓ ( x 2. ∈ d ⟨ , 1] [ by setting ℓ ( x ) . In other = 1 e } , x ⟩ for each i ∈{ 1 , . . . , d − i i words, we simply set ℓ ( x ) . = x i i t At time t , we choose a loss function ℓ ∼ D by sampling x 3. and t setting ℓ = ℓ ( x ) .

232 228 Differential Privacy and Machine Learning from S | S | = T , then we can run D Note that if we have a sample of size the RWM algorithm on the sequence of losses as described above for a rounds. This will produce a sequence of outputs a , . . . , a total of , T 1 T ∑ T 1 T a ≡ . (Recall and we will define our final classifier to be θ i =1 i T a is a standard basis vector a , and so we ∈ { e that each , . . . , e } i i 1 d T have ∥ ∥ θ ). = 1 1 We summarize the algorithm below: Algorithm 17 An algorithm for learning linear functions. It takes as d ⊂ [ − 1 , 1] input a private database of examples , S = ( x S , . . . , x , ) 1 T and and privacy parameters δ ε . ) S, ε, δ ( LinearLearner : ε √ Let η ← T ln(1 /δ ) 32 t = 1 to T = | S | do for a vector = e Choose with probability proportional to t i ∑ j t − 1 exp( − ℓ ) η =1 i j t Let loss vector = ( ⟨ e ℓ , x . ⟩ , ⟨ e ) , x ⟩ ⟩ , . . . , ⟨ e , x t 2 t t 1 d end for ∑ T 1 T . a θ Output = t t =1 T We have already seen that LinearLearner is private, since it is sim- ply an instantiation of the randomized weighted majority algorithm η : with the correct update parameter LinearLearner ( S, ε, δ Theorem 11.5. is ( ε, δ ) -differentially private. ) It remains to analyze the classification accuracy of LinearLearner, which amounts to considering the regret bound of the private RWM algorithm. Theorem 11.6. consists of T i.i.d. samples x ∼ D , then with If S T β − θ 1 such that: , LinearLearner outputs a vector probability at least ) ( √ ) d/β ) ln( /δ ln(1 T ∗ √ , min [ ⟨ θ E , x ⟩ ] ≤ E , x ] + O θ ⟨ [ ⟩ ∼D x x ∼D ∗ θ T ε where d is the number of experts.

233 11.3. Empirical risk minimization 229 11.4 Proof. , we have the following guarantee with proba- By Theorem 1 − β/ 2 : bility at least ) ⟩ ⟨ ( √ T T ∑ ∑ ) 1 d/β ) ln( /δ 1 ln(1 √ ⟨ a , x e x ⟩ ≤ min , O + i t t t T T 1 ∈{ ,...,d } i T ε =1 t =1 t ⟨ ⟩ ) ( √ T ∑ d/β ) ln( 1 /δ ln(1 ) ∗ √ . min x O = + θ , t ∗ d ∗ T 1] : =1 θ ∥ θ ∥ ∈ [0 , T ε 1 t =1 In the first equality, we use the fact that the minimum of a linear func- tion over the simplex is achieved at a vertex of the simplex. Noting that each ∼D independently and that each ⟨ x x , e , ⟩ is bounded in [ − 1 , 1] t i t we can apply Azuma’s inequality twice to bound the two quantities with probability at least β/ 2 : − 1 ∣ ∣ T T ∣ ∣ ∑ ∑ 1 1 ∣ ∣ , x , x a ⟩− E ⟨ ⟩ ⟨ a ∣ ∣ t t t x ∼D ∣ ∣ T T t =1 t =1 √ ∣ ∣ T ∣ ∣ ∑ ln(1 /β ) 1 ∣ ∣ T ⟩ ⟨ θ a , x ⟩− ⟨ E , x ≤ O = ∣ ∣ t t ∼D x ∣ ∣ T T =1 t and √ ∣ ∣ ⟩ ⟨ T ∣ ∣ ∑ d/β ln( ) 1 ∣ ∣ ⟩ − E ⟨ e , x x max ≤ . O e , ∣ ∣ i t i ∼D x ∣ ∣ T T ∈{ ,...,d i } 1 =1 t Hence we also have: √ ∣ ∣ ⟩ ⟨ T ∣ ∣ ∑ d/β ln 1 ∣ ∣ ∗ ∗ , x − ⟨ θ ⟩ E . ≤ x O max , θ ∣ ∣ t x ∼D ∗ ∗ d ∣ ∣ T T 1] , θ ∥ : ∥ θ [0 ∈ =1 1 =1 t Combining these inequalities gives us our final result about the output T of the algorithm : θ ) ( √ /δ ) d/β ) ln( ln(1 ∗ T √ θ . , x ⟩≤ min E O + ⟩ , x ⟨ θ E ⟨ ∼D x x ∼D ∗ d ∗ ∥ : , ∥ [0 ∈ =1 θ θ 1] ε T 1

234 230 Differential Privacy and Machine Learning Bibliographical notes 11.4 The PAC model of machine learning was introduced by Valiant in 1984 ], and the SQ model was introduced by Kearns [ 53 ]. The randomized [ 83 weighted majority algorithm is originally due to Littlestone and War- muth [ 57 ], and has been studied in many forms. See Blum and Mansour 9 ] or Arora et al. [ 1 ] for a survey. The regret bound that we use for [ 1 the randomized weighted majority algorithm is given in [ ]. Machine learning was one of the first topics studied in differential privacy, beginning with the work of Blum et al. [ ], who showed that 7 algorithms that operate in the SQ-learning framework could be con- verted into privacy preserving algorithms. The sample complexity of differentially private learning was first considered by Kasiviswanathan, Lee, Nissim, Raskhodnikova, and Smith, “What can we Learn Pri- vately?” [ 52 ], which characterize the sample complexity of private learn- ing up to polynomial factors. For more refined analysis of the sample complexity of private learning, see [ 3 , 4 , 12 , 19 ]. There is also extensive work on efficient machine learning algorithms, including the well known frameworks of SVMs and empir- ical risk minimizers [ 13 , 55 , 76 ]. Spectral learning techniques, includ- ing PCA and low rank matrix approximation have also been studied [ 7 , 14 , 33 , 42 , 43 , 51 ]. Private learning from expert advice was first considered by Dwork 26 ]. The fact that the randomized weighted majority algorithm is et al. [ privacy preserving without modification (when the update parameter is set appropriately) is folklore (following from advanced composition ]. For a more general 32 ]) and has been widely used; for example, in [ 48 [ study of private online learning, see [ 50 ], and for a more general study 50 , 13 ]. of empirical risk minimization, see [

235 12 Additional Models So far, we have made some implicit assumptions about the model of private data analysis. For example, we have assumed that there is some trusted curator who has direct access to the private dataset, and we have assumed that the adversary only has access to the output of the algorithm, not to any of its internal state during its execution. But what if this is not the case? What if we trust no one to look at our data, even to perform the privacy preserving data analysis? What if some hacker might gain access to the internal state of the private algorithm while it is running? In this section, we relax some of our previously held assumptions and consider these questions. In this section we describe some additional computational models that have received attention in the literature. • local model is a generalization of randomized response (see The ), and is motivated by situations in which individuals do 2 Section not trust the curator with their data. While this lack of trust can be addressed using secure multiparty computation to simulate the role played by the trusted curator, there are also some techniques that do not require cryptography. 231

236 232 Additional Models events , each of which may be The next two models consider streams of associated with an individual. For example, an event may be a search by a particular person on an arbitrary term. In a given event stream, the (potentially many) events associated with a given individual can be arbitrarily interleaved with events associated with other individuals. In pan-privacy the curator is trusted, but may be subject to com- • pulsory non-private data release, for example, because of a sub- poena, or because the entity holding the information is purchased by another, possibly less trustworthy, entity. Thus, in pan-privacy the internal state of the algorithm is also differentially private, as is the joint distribution of the internal state and the outputs. • The continual observation model addresses the question of main- taining privacy when the goal is to continually monitor and report statistics about events, such as purchases of over-the-counter medications that might be indicative of an impending epidemic. Some work addresses pan-privacy under continual observation. 12.1 The local model So far, we have considered a centralized model of data privacy, in which there exists a database administrator who has direct access to the pri- vate data. What if there is instead no trusted database administrator? Even if there is a suitable trusted party, there are many reasons not to want private data aggregated by some third party. The very existence of an aggregate database of private information raises the possibility that at some , it will come into the hands of an untrusted future time party, either maliciously (via data theft), or as a natural result of orga- nizational succession. A superior model — from the perspective of the owners of private data — would be a local model, in which agents could (randomly) answer questions in a differentially private manner about their own data, without ever sharing it with anyone else. In the con- text of predicate queries, this seems to severely limit the expressivity of a private mechanism’s interaction with the data: The mechanism can ask each user whether or not her data satisfies a given predicate, and

237 12.1. The local model 233 the user may flip a coin, answering truthfully only with slightly higher probability than answering falsely. In this model what is possible? The local privacy model was first introduced in the context of learn- ing. The local privacy model formalizes randomized response: there is no central database of private data. Instead, each individual maintains possession of their own data element (a database of size 1), and answers questions about it only in a differentially private manner. Formally, the |X| elements from some domain is a collection of database N X , x ∈ n and each ∈ x is held by an individual. x i (Local Randomizer) . An ε -local randomizer R : X → Definition 12.1 W ε -differentially private algorithm that takes as input a database is an of size = 1 . n In the local privacy model, algorithms may interact with the database only through a local randomizer oracle: (LR Oracle) . An LR oracle LR Definition 12.2 ( · , · ) takes as input an D and an i [ n index ∈ ε -local randomizer R and outputs a random value ] w ∈ W chosen according to the distribution R ( x is the ) , where x D ∈ i i i th individual in the database. element held by the Definition 12.3 ((Local Algorithm)) . An algorithm is ε -local if it D via the oracle LR , with the following restric- accesses the database D tion: If are the algorithm’s invocations ( i, R ) ) LR i, R ( , . . . , LR 1 D D k of LR -local randomizer, then on index i , where each R ε is an j J D ε ··· + ε + ≤ ε . 1 k Because differential privacy is composable, it is easy to see that ε -local algorithms are ε -differentially private. Observation 12.1. -local algorithms are ε -differentially private. ε That is to say, an ε -local algorithm interacts with the data using only a sequence of ε -differentially private algorithms, each of which computes only on a database of size 1 . Because nobody other than its owner ever touches any piece of private data, the local setting is far more secure: it does not require a trusted party, and there is no central party who might be subject to hacking. Because even the algorithm

238 234 Additional Models never sees private data, the internal state of the algorithm is always differentially private as well (i.e., local privacy implies pan privacy, described in the next section). A natural question is how restrictive the local privacy model is. In this section, we merely informally discuss results. The interested reader can follow the bibliographic references at the end of this section for more information. We note that an alternative fully distributed model. name for the local privacy model is the We recall the definition of the statistical query (SQ) model, intro- x of size duced in Section , 11 . Roughly speaking, given a database n the statistical query model allows an algorithm to access this database n ) number of noisy linear queries to the by making a polynomial (in database, where the error in the query answers is some inverse polyno- . Formally: mial in n , statistical query : X ×{ 0 φ 1 }→ is some function A Definition 12.4. 1] . A statistical query oracle for a distribution over labeled exam- [0 , τ ples τ is an oracle O D with tolerance such that for every statistical D φ query : ∣ ∣ ∣ ∣ τ ( φ ) − E )] x, y ( [ O φ ≤ τ ∣ ∣ ) ( x,y ∼D D φ , and In other words, an SQ oracle takes as input a statistical query outputs some value that is guaranteed to be within ± τ of the expected value of φ on examples drawn from D . Definition 12.5. An algorithm A is said to SQ-learn a class of functions /α, α, β > there exists an m = poly( if for every 1 0 log(1 /β )) such C d, τ makes at most m queries of tolerance τ = 1 /m to O that A , and with D 1 − β , outputs a hypothesis probability ∈ C such that: f ∗ f, D ) ≤ min α ) + D , err( f err( ∗ C f ∈ More generally, we can talk about an algorithm (for performing any computation) as operating in the SQ model if it accesses the data only through an SQ oracle: Definition 12.6. An algorithm A is said to operate in the SQ model if there exists an m such that A makes at most m queries of tolerance τ , and does not have any other access to the database. /m to O τ = 1 D A is efficient if m is polynomial in the size of the database, D .

239 12.1. The local model 235 It turns out that up to polynomial factors in the size of the database and in the number of queries, any algorithm that can be implemented in the SQ model can be implemented and analyzed for privacy in the local privacy model, and vice versa. We note that there is a distinction between an algorithm being implemented in the SQ model, and its privacy analysis being carried out in the local model: almost all of the algorithms that we have presented in the end access the data using noisy linear queries, and so can be thought of as acting in the SQ model. However, their privacy guarantees are analyzed in the centralized model of data privacy (i.e., because of some “global” part of the analysis, as in the sparse vector algorithm). In the following summary, we will also recall the definition of PAC 11 : learning, also introduced in Section Definition 12.7. An algorithm A is said to PAC-learn a class of func- C if for every α, β > 0 , there exists an m = poly( d, 1 /α, log(1 /β )) tions D A takes as over labeled examples, such that for every distribution input m labeled examples drawn from D and outputs a hypothesis ∈ C such that with probability 1 − β : f ∗ f, D ) ≤ min α ) + D , err( f err( ∗ C f ∈ ∗ ∗ , , the learner is said to operate in the ) = 0 err( If min D f f C ∈ realizable setting (i.e., there exists some function in the class which perfectly labels the data). Otherwise, the learner is said to operate in the agnostic setting. If A also has run time that is polynomial in d, /α , and log(1 /β ) , then the learner is said to be efficient . If there is 1 an algorithm which PAC-learns C , then C is said to be PAC-learnable. Note that the main distinction between an SQ learning algorithm and a PAC learning algorithm, is that the PAC learning algorithm gets direct access to the database of examples, whereas the SQ learning algorithm only has access to the data through a noisy SQ oracle. What follows is some of our understanding of the limitations of the SQ model and problems which separate it from the centralized model of data privacy.

240 236 Additional Models A single sensitivity-1 query can be answered to error O in the 1. (1) centralized model of data privacy using the Laplace mechanism, √ Θ( but requires error n ) in the local data privacy model. 2. The set of function classes that we can (properly) learn in the local privacy model is exactly the set of function classes that we can properly learn in the SQ model (up to polynomial factors in the database size and query complexity of the algorithm). In contrast, the set of things we can (properly or agnostically) learn in the centralized model corresponds to the set of things we can learn in the PAC model. SQ learning is strictly weaker, but this is not a huge handicap, since parity functions are essentially the only interesting class that is PAC learnable but not SQ learnable. We remark that we refer explicitly to proper learning here (mean- ing the setting in which there is some function in the class which perfectly labels the data). In the PAC model there is no informa- tion theoretic difference between proper and agnostic learning, but in the SQ model the difference is large: see the next point. 3. The set of queries that we can release in the local privacy model are exactly those queries that we can agnostically learn in the SQ model. In contrast, the set of things we can release in the central- ized model corresponds to the set of things we can agnostically learn in the PAC model. This is a much bigger handicap — even conjunctions (i.e., marginals) are not agnostically learnable in the SQ model. This follows from the information theoretic reduction from agnostic learning (i.e., distinguishing ) to query release that using the iterative construction mechanism. 5 we saw in Section We note that if we are only concerned about computationally bounded adversaries, then in principle distributed agents can use secure mul- tiparty computation to simulate private algorithms in the centralized setting. While this does not actually give a differential privacy guar- antee, the result of such simulations will be indistinguishable from the result of differentially private computations, from the point of view of a computationally bounded adversary. However, general secure mul- tiparty computation protocols typically require huge amounts of mes- sage passing (and hence sometimes have unreasonably large run times),

241 12.2. Pan-private streaming model 237 whereas algorithms in the local privacy model tend to be extremely simple. 12.2 Pan-private streaming model The goal of a pan-private algorithm is to remain differentially private even against an adversary that can, on rare occasions, observe the algo- rithm’s internal state. Intrusions can occur for many reasons, including mission creep , when data collected for one pur- hacking, subpoena, or pose are used for a different purpose (“Think of the children!”). Pan- private streaming algorithms provide protection against all of these. not necessarily provide Note that ordinary streaming algorithms do privacy against intrusions, as even a low-memory streaming algorithm can hold a small number of data items in memory, which would be completely exposed in an intrusion. On the technical side, intrusions can be known to the curator (subpoena) or unknown (hacking). These can have very different effects, as a curator aware of an intrusion can take protective measures, such as re-randomizing certain variables. 12.2.1 Definitions We assume a data stream of unbounded length composed of elements in a universe X . It may be helpful to keep in mind as motivation data analysis on a query stream, in which queries are accompanied by the IP address of the issuer. For now, we ignore the query text itself; the universe X is the universe of potential IP addresses. Thus, intuitively, user-level privacy protects the presence or absence of an IP address in the stream, indpendent of the number of times it arises, should it actually be present at all. In contrast, event-level privacy merely protects the privacy of individual accesses. For now, we focus on user- level privacy. As usual in differentially private algorithms, the adversary can have arbitrary control of the input stream, and may have arbitrary auxil- iary knowledge obtained from other sources. It can also have arbitrary computational power.

242 238 Additional Models We assume the algorithm runs until it receives a special signal, at which point it produces (observable) outputs. The algorithm may optionally continue to run and produce additional outputs later, again in response to a special signal. Since outputs are obvservable we do not provide privacy for the special signals. A streaming algorithm experiences a sequence of internal states. and produces a (possibly unbounded) sequence of outputs. Let denote I the set of possible internal states of the algorithm, and σ the set of possible output sequences. We assume that the adversary can only observe internal states and the output sequence; it cannot see the data in the stream (although it may have auxiliary knowledge about of the input some of these data) and it has no access to the length sequence. X -Adjacent Data Streams) . We think of data streams ( Definition 12.8 have finite length. Data streams prefixes as being of unbounded length; ′ S S are X -adjacent if they differ only in the presence or absence and all u ∈ X . We define X -adjacency occurrences of a single element of for stream prefixes analogously. An algorithm Alg mapping data stream User-Level Pan-Privacy. I × σ , is pan-private against a single intrusion prefixes to the range ′ ′ if for all sets ⊆ I of internal states and σ I ⊆ σ of output sequences, ′ and for all pairs of adjacent data stream prefixes S, S ′ ′ ε ′ ′ ′ (I ) , σ Pr[ )] ≤ e Alg Pr[ Alg ( S ( S ∈ (I ) , σ ∈ )] , where the probability spaces are over the coin flips of the algorithm . Alg This definition speaks only of a single intrusion. For multiple intru- sions we must consider interleavings of observations of internal states and outputs. The relaxation to event-level privacy is obtained by modifying the notion of adjacency so that, roughly speaking, two streams are event- adjacent if they differ in a single instance of a single element in X ; that is, one instance of one element is deleted/added. Clearly, event-level privacy is a much weaker guarantee than user-level privacy.

243 12.2. Pan-private streaming model 239 If we assume the existence of a very small amount of Remark 12.1. secret storage, not visible to the adversary, then many problems for which we have been unable to obtain pan-private solutions have (non- amount of secret stor- pan-) private streaming solutions. However, the age is not so important as its existence , since secret storage is vulnerable to the social pressures against which pan-privacy seeks to protect the data (and the curator). Pan-Private Density Estimation. Quite surprisingly, pan-privacy can user-level privacy of many common streaming com- be achieved even for putations. As an example, consider the problem of density estimation : X of data elements and a stream σ given a universe , the goal is to estimate the fraction of that acutally appears in the stream. For X example, the universe consists of all teenagers in a given community (represented by IP addresses), and the goal is to understand what frac- tion visit the Planned Parenthood website. Standard low-memory streaming solutions for density estimation involve recording the results of deterministic computations of at least some input items, an approach that is inherently not pan-private. Here is a simple, albeit high-memory, solution inspired by random- b a ized response. The algorithm maintains a bit for each IP address a (which may appear any number of times in the stream), initialized uni- formly at random. The stream is processed one element at a time. On input a the algorithm flips a bit biased to 1 ; that is, the biased bit will take value 0 with probability 1 / 2 − ε , and value 1 with probabil- ity 1 2 + ε . The algorithm follows this procedure independent of the / a number of times IP address appears in the data stream. This algo- rithm is ( ε, 0) -differentially private. As with randomized response, we can estimate the fraction of “real” 1’s by z = 2( y −|X| / 2) / |X| , where y is the actual number of 1’s in the table after the stream is processed. z . As To ensure pan-privacy, the algorithm publishes a noisy version of √ |X| , / with randomized response, the error will be on the order of 1 yielding meaningful results when the density is high. Other problems enjoying user-level pan-private algorithms include: • Estimating, for any t , the fraction of elements appearing exactly t times;

244 240 Additional Models Estimating the -cropped mean : roughly, the average, over all ele- • t ments, of the minimum of t and the number of occurrences of the element in the data stream; Estimating the fraction of k -heavy hitters (elements of X that • k appear at least times in the data stream). fully dynamic Variants of these problems can also be defined for data, in which counts can be decremented as well as incremented. For example, density estimation (what fraction appeared in the stream?) becomes “How many (or what fraction) of elements have a (net) count equal to zero?” These, too, can be solved with user-level pan-privacy, using dif- ferentially private variations of sketching techniques from the streaming literature. 12.3 Continual observation Many applications of data analysis involve repeated computations, either because the entire goal is one of monitoring of, for example, traf- fic conditions, search trends, or incidence of influenza. In such applica- tions the system is required to continually produce outputs. We there- fore need techniques for achieving differential privacy under continual observation . As usual, differential privacy will require having essentially the same distribution on outputs for each pair of adjacent databases, but how should we define adjacency in this setting? Let us consider two example scenarios. Suppose the goal is to monitor public health by analyzing statis- 1 Individuals can inter- tics from an H1N1 self-assessment Web site. act with the site to learn whether symptoms they are experiencing may be indicative of the H1N1 flu. The user fills in some demographic data (age, zipcode, sex), and responds to queries about his symptoms ◦ 100 . 4 (fever over F?, sore throat?, duration of symptoms?). We would expect a given individual to interact very few times with the H1N1 self-assessment site (say, if we restrict our attention to a six-month 1 https://h1n1.cloudapp.net provided such a service during the winter of 2010; user-supplied data were stored for analysis with the user’s consent.

245 12.3. Continual observation 241 period). For simplicity, let us say this is just once. In such a setting, it is sufficient to ensure event-level privacy, in which the privacy goal is to hide the presence or absence of a single event (interaction of one user with the self-assessment site). Suppose again that the goal is to monitor public health, this time by analyzing search terms submitted to a medical search engine. Here it may no longer be safe to assume an individual has few interactions with the Web site, even if we restrict attention to a relatively short privacy, ensuring period of time. In this case we would want user-level that the entire set of a user’s search terms is protected simultaneously. We think of continual observation algorithms as taking steps at discrete time intervals; at each step the algorithm receives an input, computes, and produces output. We model the data as arriving in a stream, at most one data element in each time interval. To capture the fact that, in real life, there are periods of time in which nothing hap- pens, null events are modeled by a special symbol in the data stream. t time periods” corresponds to processing Thus, the intuitive notion of “ t a sequence of elements in the stream. For example, the motivation behind the counter primitive below is to count the number of times that something has occurred since the algorithm was started (the counter is very general; we don’t specify a priori what it is counting). This is modeled by an input stream over { 0 , 1 } . Here, “0” means “nothing happened,” “1” means the event of interest occurred, and for t , 2 , . . . , T the algorithm outputs an = 1 approximation to the number of 1s seen in the length prefix of the t stream. There are three natural options: 1. Use randomized response for each time period and add this ran- domized value to the counter; 2. Add noise distributed according to Lap (1 /ε ) to the true value for each time step and add this perturbed value to the counter; Compute the true count at each time step, add noise distributed 3. according to Lap ( T /ε ) to the count, and release this noisy count. √ All of these options result in noise on the order of at least Ω( T /ε ) . The hope is to do much better by exploiting structure of the query set.

246 242 Additional Models ′ X and S be the universe of possible input symbols. Let be Let S . Then X stream prefixes (i.e., finite streams) of symbols drawn from ′ ′ (“ S is adjacent to S, S S ”) if and only if there exist a, b Adj( ) ∈ X a in S to instances of b , so that if we change some of the instances of ′ ′ . More formally, Adj( S, S then we get ) iff ∃ a, b S ∈X ∃ R ⊆ [ | S | ] , and ′ | S is a set of indices in the stream prefix R . Here, such that = S b a : R → , and S | at a is the result of replacing all the occurrences of S b a : R → these indices with b . Note that adjacent prefixes are always of the same length. To capture event-level privacy, we restrict the definition of adja- cency to the case | R | ≤ 1 . To capture user-level privacy we do not constrain the size of R in the definition of adjacency. As noted above, one option is to publish a noisy count at each time step; the count published at time t reflects the approximate number s in the length prefix of the stream. The privacy challenge is 1 of t statistics, so that early items in the stream are subject to nearly T ε, 0) -differential privacy we would be adding noise scaled to T /ε , for ( which is unacceptable. In addition, since the s are the “interesting” 1 elements of the stream, we would like that the distortion be scaled to 1 the number of s seen in the stream, rather than to the length of the stream. This rules out applying randomized response to each item in the stream independently. The algorithm below follows a classical approach for converting static algorithms to dynamic algorithms. T Assume 2 . The intervals are the natural ones cor- is a power of responding to the labels on a complete binary tree with T leaves, where the leaves are labeled, from left to right, with the intervals [0 , 0] , [1 , 1] , . . . , [ T − 1 , T − 1] and each parent is labeled with the inter- val that is the union of the intervals labeling its children. The idea is to compute and release a noisy count for each label s, t ] ; that is, the [ [ s, t ] is a noisy count of the released value corresponding to the label number of 1s in positions s, s + 1 , . . . , t of the input stream. To learn the analyst the approximate cumulative count at time ∈ [0 , T − 1] t uses the binary representation of t to determine a set of at most log T 2

247 12.3. Continual observation 243 Figure 12.1: Event-level private counter algorithm (not pan-private). , t , and computes the sum of the [0 ] disjoint intervals whose union is 2 corresponding released noisy counts. See Figure 12.1 . ∈ [0 , T − 1] appears in at most Each stream position t T 1 + log 2 intervals (because the height of the tree is T ), and so each ele- log 2 ment in the stream affects at most T released noisy counts. 1 + log 2 Thus, adding noise to each interval count distributed according to ((1 + log ( T ) /ε ) ensures Lap ε, 0) -differential privacy. As for accuracy, 2 since the binary representation of any index [0 , T − 1] yields a dis- ∈ t [0 T intervals whose union is joint set of at most , t ] we can apply log 2 12.2 below to conclude that the expected error is tightly con- Lemma 3 / 2 ) (log centrated around T . The maximum expected error, over all 2 5 / 3 T t ) , is on the order of (log times . 2 Lemma 12.2. , . . . , Y be independent variables with distri- Y Let Let 1 k √ ∑ ∑ 2 and Y bution Lap b ) Y ( b = max b . Let . Let ν ≥ = , ( ) b i max i i i i i i √ 2 2 ν 2 < λ < 0 . Then and b max ( ) 2 λ exp Pr[ − Y > λ ] . ≤ 2 8 ν 2 This algorithm can be optimized slightly (for example, we never use the count corresponding to the root, eliminating one level from the tree), and it can be modified to handle the case in which T is not a power of 2 and, more interestingly, when T is not known a priori .

248 244 Additional Models Proof. is E [exp( hY Y )] = 1 / (1 − The moment generating function of i i 1 2 2 − ) , where | < 1 /b h . Using the inequality (1 − x h ) | ≤ 1 + 2 x ≤ b i i 2 2 0 ≤ x < 1 / 2 , we have E [exp( hY exp(2 )] ≤ exp(2 h x b ) for ) , if i i √ | < 1 / 2 b | . We now calculate, for 0 < h < 1 / h 2 b : max i ] = Pr[exp( hY ) > exp( hλ )] Pr[ Y > λ − hλ ) E [exp( hY )] exp( ≤ ∏ hλ ) − = exp( E [exp( hY )] i i 2 2 . h − ν + 2 exp( hλ ≤ ) √ 2 2 2 ν By assumption, 0 < λ < . We complete the proof by setting b max √ 2 λ/ 4 ν 1 < h / = 2 b . max Let { b Corollary 12.3. } ∈ , b δ be as in Lemma 12.2 Y, ν, . For max i i √ √ ∑ 2 b , b } (0 , 1) and > ν > | max ln(2 /δ ) { , we have that Pr[ | Y max i i √ 8 ln(2 /δ ν ≤ δ . )] b = (log ’s are the same (e.g., b In our case, all the ). Taking T ) /ε i 2 √ ν = kb we have the following corollary: √ √ √ ( Corollary 12.4. kb ) < 2 For all 2 kb = 2 λ < α 2 kν , 2 / − α 8 Pr[ ≤ e . ] Y > λ Note that we have taken the unusual step of adding noise to the before counting, rather than after. In terms of the outputs it count makes no difference (addition is commutative). However, it has an inter- esting effect on the algorithm’s internal states: they are differentially private! That is, suppose the intrusion occurs at time t , and consider i T intervals containing step [0 , t ] . Since there are at most log any i ∈ 2 (in the algorithm we abolished the interval corresponding to the root), x is protected affects at most log x T of the noisy counts, and so i i 2 against the intrusion for exactly the same reason that it is protected in is 12.1 the algorithm’s outputs. Nevertheless, the algorithm in Figure not pan-private even against a single intrusion. This is because, while its internal state and its outputs are each independently differentially private, the joint distribution does not ensure ε -differential privacy. To

249 12.3. Continual observation 245 see why this is so, consider an intruder that sees the internal state at time x t and knows the entire data stream except , and let I = [ a, b ] +1 t be an interval containing both and t + 1 . Since the adversary knows t the contribution from the stream occur- x , it can subtract from c I ,t [0 ] t c at ring up through time (that is, it subtracts off from the observed I t the values x , all of which it knows). From this the , x , . . . , x time +1 t a a c intruder learns the value of the Laplace draw to which was initial- I c ized. When b , the adversary subtracts is published at the end of step I from the published value this initial draw, together with the contribu- x , which it does not know. What x tions of all elements in except +1 t a,b [ ] x . remains is the unknown +1 t Pan-private counting 12.3.1 12.1 Although the algorithm in Figure is easily modified to ensure event-level pan-privacy against a single intrusion , we give a different algorithm here in order to introduce a powerful bijection technique which has proved useful in other applications. This algorithm main- tains in its internal state a single noisy counter, or accumulator, as well as noise values for each interval. The output at any given time period is the sum of the accumulator and the noise values for the t t . When an interval ends, its associated noise intervals containing I η value, , is erased from memory. I Theorem 12.5. The counter algorithm of Figure 12.2 , when run with T, ε parameters ( ε, 0) - , and suffering at most one intrusion, yields an pan-private counter that, with probability at least 1 − β has maximum 2 . 5 O (log(1 /β ) · log T /ε error, over its T outputs, of ) . We note also that in every round individually (rather than in all rounds simultaneously), β · O (log(1 /β ) with all but probability, the error has magnitude at most 1 . 5 T /ε ) . log Proof. The proof of accuracy is the same as that for the algorithm in , relying on Corollary 12.4 . We focus here on the proof of 12.1 Figure pan-privacy. ∗ ∗ and During an intrusion between atomic steps t t + 1 , that is, ∗ immediately following the processing of element t in the input stream

250 246 Additional Models Figure 12.2: Event-level pan-private counter algorithm. (recall that we begin numbering the elements with 0), the view of the adversary consists of (1) the noisy cumulative count (in the variable η “count”), (2) the interval noise values in memory when the intrusion S occurs, and (3) the complete sequence of all of the algorithm’s outputs ′ 1 , . . . , t . Consider adjacent databases x and x in rounds , which differ 0 , ′ , say, without loss of generality, x x = 1 and in time t , and an = 0 t t ∗ t (we will discuss the intrusion immediately following time period t ≥ ∗ < t below). We will describe a bijection between the vector of case t ′ and executions on x noise values used in executions on , such that x x and corresponding noise values induce identical adversary views on ′ ε x e , and the probabilities of adjacent noise values differ only by an multiplicative factor. This implies ε -differential pan-privacy. ∗ t t By assumption, the true count just after the time period ≥ ′ x x . Fix an is larger when the input is than it is when the input is arbitrary execution . This amounts to when the input stream is x E x fixing the randomness of the algorithm, which in turn fixes the noise ′ values generated. We will describe the corresponding execution E by x describing how its noise values differ from those in E . x The program variable was initialized with Laplace noise. Counter ′ E By increasing this noise by 1 in the value of Counter just after step x ∗ ′ t E is identical in E and . The noise variables in memory immedi- x x ∗ ately following period t are independent of the input; these will be

251 12.3. Continual observation 247 ′ ′ unchanged in E identi- E . We will make the sequence of outputs in x x log E T interval noise values cal by changing a collection of to those in x not in memory when the adversary intrudes, so that the that are η S noise values in all rounds up through t − 1 is unchanged, sum of all ′ x t than for x . on is larger by 1 for database but the sum from round the initialization noise for Counter , we now need to increased Since we the sum of interval noise values for periods 0 , . . . , t − 1 by 1, decrease t and leave unchanged the sum of interval noise values from period . To do this, we find a collection of disjoint intervals whose union is 0 − 1 } . There is always such a collection, and it is always of size , . . . , t { T . We can construct it iteratively by, for i decreasing from at most log i t − 1) ⌋ to 0, choosing the interval of size 2 ⌊ that is contained in log( 0 , . . . , t − 1 } and is not contained in a previously chosen interval (if { such an interval exists). Given this set of disjoint intervals, we notice ∗ t − 1 < t ≤ t also that they all end by time , and so their noises are ∗ not in memory when the adversary intrudes (just following period ). t In total (taking into account also changing the initial noise value for Counter ), the complete view seen by the adversary is identical and the ′ and x probabilities of the (collection of) noise values used for x differ ε by at most an e multiplicative factor. ∗ ∗ ≥ t . If t Note that we assumed < t then the initial noise added t ′ Counter to E will be the same as in E , and we need to add 1 to in x x the sum of interval noises in every time period from t through T (the sum of interval noises before time t remains unchanged). This is done log intervals that as above, by finding a disjoint collection of at most T { t, . . . , T exactly covers 1 } . The noise values for these intervals are − ∗ t < t not yet in memory when the intrusion occurs in time , and the proof follows similarly. A logarithmic (in T ) lower bound 12.3.2 Given the upper bound of Theorem 12.5 , where the error depends only poly-logarithmically on T , it is natural to ask whether any dependence is inherent. In this section we show that a logarithmic dependence on T is indeed inherent.

252 248 Additional Models Any differentially private event-level algorithm for Theorem 12.6. rounds must have error Ω(log T ) (even with ε = 1 ). counting over T ε = 1 Let Proof. . Suppose for the sake of contradiction that there exists that T a differentially private event-level counter for streams of length guarantees that with probability at least / 3 , its count at all time 2 T periods is accurate up to a maximum error of ) / 4 . Let k (log = 2 T ) / 4 . We construct a set S of T /k (log inputs as follows. Divide the 2 T time periods into T /k consecutive phases, each of length k (except, i = 1 , . . . , T /k , the i -th input x ∈ possibly, the last one). For i has S i th phase. That is, x 0 input bits everywhere except during the = i (( k +1)) i k ( − k · i T /k ) · 0 ◦ 1 ◦ . 0 1 ≤ i ≤ T /k , we say an output matches Fo if just before the i i k/ 2 and at the end of the i th phase th phase the output is less than i k/ . By accuracy, on input x the output is at least the output should 2 i match 2 / 3 . By ε differential privacy, this with probability at least means that for every i, j ∈ [ T /k ] such that i ̸ = j , the output on input i x j with probability at least should match 2 1 / k ) ε − ε log( T − · 2 e e = √ 1 2 / T log( ) − = 1 = e T . / j This is a contradiction, because the events that the output matches j are disjoint for different , and yet the sum of their probabilities on i input x exceeds 1. 12.4 Average case error for query release 4 and 5 In Sections , we considered various mechanisms for solving the private query release problem, where we were interested in worst case . That is, given a class of queries Q , of size error = k , we wished |Q| k ˆ a ∈ R each such that for to recover a vector of answers query f , ∈Q i | . In other words, if we ( x ) − ˆ f α |≤ α for some worst-case error rate a i i k a ∈ R ) denote the vector of true answers, with a , then we ≡ f let ( x i i ∥ a − ˆ a ∥ ≤ require a bound of the form: α . In this section, we consider ∞ a weakened utility guarantee, on the ℓ (rather than ℓ ) error: a bound ∞ 2 a of the form a − ˆ ∥ ∥ α ≤ . A bound of this form does not guarantee 2

253 12.4. Average case error for query release 249 every that we have low error for query, but it does guarantee that on average, we have small error. Although this sort of bound is weaker than worst-case error, the mechanism is particularly simple, and it makes use of an elegant geometric view of the query release problem that we have not seen until now. |X| x ∈ Recall that we can view the database with as a vector x N x = n . We can similarly also view the queries f ∥ ∈ Q as vectors ∥ i 1 |X| N ∈ , such that f f ( x ) = ⟨ f . It will therefore be helpful to view , x ⟩ i i i ×|X| k i ∈ A our class of queries , with the R th row of A as a matrix Q k being the vector . We can then see that our answer vector a ∈ R f is, i in matrix notation: x = a. · A when viewed as a linear A Let’s consider the domain and range of |X| denote the unit = { x ∈ R map. Write : ∥ x ∥ ball in = 1 } B ℓ 1 1 1 ∥ x nB , since |X| x ∈ dimensional space. Observe that = n . We will ∥ 1 1 nB refer to as “Database Space.” Write K = AB . Note similarly that 1 1 ∈ for all x , a = A · x nB nK . We will refer to nK as “answer ∈ 1 space.” We make a couple of observations about K : Note that because K is centrally symmetric, so is K — that is, K = − B . Note also that 1 k 1 |X| equal to ± A R , . . . , ± A K ⊂ is a convex polytope with vertices A , together with their negations. the columns of The following algorithm is extremely simple: it simply answers every query independently with the Laplace mechanism, and then projects back into answer space . In other words, it adds independent Laplace noise to every query, which as we have seen, by itself leads to distortion √ k (or at least that is linear in k , if we relax to ε, δ ) -differential pri- ( ̃ of answers is likely not consistent vacy). However, the resulting vector a any database with ∈ nB in database space. Therefore, rather than y 1 returning ̃ a , it instead returns some consistent answer vector ˆ a ∈ nK that is as close to a as possible. As we will see, this projection step ̃ improves the accuracy of the mechanism, while having no effect on privacy (since it is just post-processing!) We first observe that Project is differentially private. k ×|X| A ∈ [0 , 1] Theorem 12.7. For any , Project ( x, A, ε ) preserves ( ε, δ ) -differential privacy.

254 250 Additional Models The Algorithm 18 K -Projected Laplace Mechanism. It takes as input k ×|X| nB [0 , a database x ∈ 1] A , and a privacy parameters a matrix , ∈ 1 ε and δ . x, A, ε, δ Project ( ): A = x a Let · √ ∈ [ k ] , sample ν For ∼ Lap( each 8 k ln(1 /δ ) /ε ) , and let ̃ a = a + ν . i i 2 . Output ˆ a ∥ a ∥ ˆ a − ̃ = arg min nK a ∈ ˆ 2 Proof. a is the output of the Laplace mechanism We simply note that ̃ sensitivity k queries, which is ( ε, δ ) -differentially private by Theo- on 1 3.6 and 3.20 . Finally, since ˆ a is derived from ̃ a rems without any further access to the private data, the release of a is differentially private by the ˆ 2.1 post-processing guarantee of differential privacy, Proposition . For any class of linear queries A and database x , let Theorem 12.8. a A · x denote the true answer vector. Let ˆ a denote the output of the = mechanism Project : ˆ a = Project ( x, A, ε ). With probability at least 1 − β : √ ) /β |X| ) ln(2 /δ 192 ln(1 kn 2 . ≤ − ∥ a a ∥ ˆ 2 ε To prove this theorem, we will introduce a couple of simple concepts k ◦ K , its from convex geometry. For a convex body is K ⊂ R polar body ◦ k defined to be { y ∈ R K : ⟨ y, x ⟩≤ 1 for all x ∈ K } . The Minkowski = defined by a convex body is Norm K x ∥ . ≡ min ∥ r ∈ R such that x ∈ rK } { K The dual norm x ∥ of is the Minkowski norm induced by the polar ∥ K ◦ K body of ∥ x ∥ , i.e., . This norm also has the following form: K ◦ x ∥ ∥ = max ⟨ x, y ⟩ . K ∈ K y The key fact we will use is , which is satisfied by all Holder’s Inequality centrally symmetric convex bodies K : ◦ x, y ⟩|≤∥ x ∥ |⟨ ∥ y ∥ . K K

255 12.4. Average case error for query release 251 12.8 . Proof of Theorem The proof will proceed in two steps. First we 2 a, a ∥ a ∥ ≤ 2 ⟨ ˆ a − ˆ ̃ a − a ⟩ , and then we will use Holder’s will show that: − 2 inequality to bound this second quantity. Lemma 12.9. 2 ˆ a ∥ ≤ ∥ a 2 − ˆ a − a, ̃ a − a ⟩ ⟨ 2 Proof. We calculate: 2 a a ∥ a ∥ = ˆ ˆ − − a, ˆ a − a ⟩ ⟨ 2 ⟨ ˆ a − a, ̃ a − a ⟩ = ⟨ ˆ a − a, ˆ a − ̃ a ⟩ + ≤ ⟨ ˆ a − a, ̃ a − a ⟩ . 2 The inequality follows from calculating: 2 ̃ − a, ̃ a − a ⟩ = ∥ ̃ a − a ∥ ⟨ ⟩ a ⟨ ˆ a − ̃ a, ˆ a − a + 2 2 ˆ − ̃ a ∥ ˆ ⟩ + ≥ ∥ a a − ̃ a, ̃ a − a ⟨ 2 = ⟨ ˆ a − ̃ a, ˆ a − a ⟩ , ′ Where the final inequality follows because by choice of , for all a ˆ ∈ a 2 2 ′ : ˆ a ∥ ∥ nK a ̃ − − ∥ a ̃ ≤∥ a . 2 2 a − a We can now complete the proof. Recall that by definition, ν , ̃ = the vector of i.i.d. Laplace noise added by the Laplace mechanism. By 12.9 and Holder’s inequality, we have: Lemma 2 a ∥ a ∥ ˆ − ≤ 2 ⟨ ˆ a − a, ν ⟩ 2 ◦ ∥ ≤ a − a ∥ . ∥ ν 2 ∥ ˆ K K We bound these two terms separately. Since by definition ˆ a, a ∈ nK , we have max( ∥ ˆ a ∥ − , ∥ a ∥ a ) ≤ n , and so by the triangle inequality, ∥ ˆ K K ∥ n . a ≤ 2 K ◦ ∥ ∥ Next, observe that since ν , and since the max- = max ⟨ y, ν ⟩ K y K ∈ imum of a linear function taken over a polytope is attained at a vertex, i ◦ we have: ∥ ν ∥ = max ⟩| , ν . A |⟨ K ] [ ∈ i |X| i k i ∈ R Because each is such that ∥ A A ∥ , and recalling that for ≤ 1 ∞ , we can apply Lemma by q , if Z ∼ Lap ( b any scalar , then qZ ∼ Lap ( qb ) )

256 252 Additional Models 12.2 Lemma to bound the weighted sums of Laplace random variables i : 1 − β . Doing so, we have that with probability at least , ν A ⟩ ⟨ √ |X| ) ln( /δ ln(1 8 /β k ) i . ⟩|≤ A |⟨ max , ν ε [ |X| ∈ i ] Combining all of the above bounds, we get that with probability β : 1 − √ /β |X| ) ln( /δ ln(1 ) nk 16 2 ≤ ∥ a ˆ − a ∥ . 2 ε ∑ k 2 2 ∥ Let’s interpret this bound. Observe that ∥ − ˆ a a , ) ( a = − ˆ a i i i =1 2 and so this is a bound on the sum of squared errors over all queries. per-query squared error of this mechanism is only: Hence, the average √ k ∑ |X| ) ln( /δ ln(1 /β ) n 16 1 2 ) − a ˆ ≤ a ( . i i ε k =1 i In contrast, the private multiplicative weights mechanism guaran- √ 4 2 / 1 / 1 ̃ , and so matches /ε n log |X| ) a max ( tees that − ˆ a | | ≤ O i i ] k [ ∈ i the average squared error guarantee of the projected Laplace mecha- √ ̃ ( n nism, with a bound of: log |X| O ) . However, the multiplicative /ε weights mechanism (and especially its privacy analysis) its much more complex than the Projected Laplace mechanism! In particular, the pri- vate part of the K -Projected Laplace mechanism is simply the Laplace mechanism itself, and requires no coordination between queries. Inter- estingly — and, it turns out, necessarily — coordination occurs in the projection phase. Since projection is in post-precessing, it incurs no further privacy loss; indeed, it can be carrie out (online, if necessary) by the data analyst himself. 12.5 Bibliographical notes The local model of data privacy has its roots in randomized response, as first proposed by Warner in 1965 [ 84 ]. The local model was formalized by Kasiviswanathan et al. [ 52 ] in the context of learning, who proved that private learning in the local modal is equivalent to non-private

257 12.5. Bibliographical notes 253 learning in the statistical query (SQ) model. The set of queries which can be released in the local model was shown to be exactly equal to in the SQ model by agnostically learned the set of queries that can be Gupta et al. [ ]. 38 Pan-Privacy was introduced by Dwork et al. [ 27 ], and further explored by Mir et al. [ ]. The pan-private density estimation, as well 62 as a low-memory variant using hashing, appear in [ ]. 27 Privacy under continual observation was introduced by Dwork et al. [ 26 ]; our algorithm for counting under continual observation is from that paper, as is the lower bound on error. Similar algorithms were given by Chan et al. [ 11 ]. The proof of concentration of measure inequal- ity for the sums of Laplace random variables given in Lemma 12.2 is from [ ]. 11 The Projected Laplace mechanism for achieving low average error was given by Nikolov et al.[ 66 ], who also give instance optimal algo- rithms for the (average error) query release problem for any class of queries. This work extends a line of work on the connections between 45 ], differential privacy and geometry started by Hardt and Talwar [ 5 ] and Dwork et al. [ 30 ]. and extended by Bhaskara et al. [ Dwork, Naor, and Vadhan proved an exponential gap between the number of queries that can be answered (with non-trivial error) by 29 ]. The lesson stateless and stateful differentially private mechanisms [ learned — that coordination is essential for accurately and privately answering very large numbers of queries — seems to rule out the inde- pendent noise addition in the Projected Laplace mechanism. The state- fulness of that algorithm appears in the projection step, resolving the paradox.

258 13 Reflections Toward practicing privacy 13.1 Differential Privacy was designed with internet-scale data sets in mind. Reconstruction attacks along the lines of those in Section 8 can be polynomial time bounded adversary asking only O ( n ) carried out by a n n is on the order of hundreds of . When queries on databases of size millions, and each query requires a linear amount of computation, such an attack is unrealistic, even thought the queries can be parallelized. This observation led to the earliest steps toward differential privacy: If sublinear number of counting queries, the adversary is restricted to a √ ) n noise per query — less than the sampling error! — is suf- ( o then ficient for preserving privacy (Corollary 3.21 ). To what extent can differential privacy be brought to bear on smaller data sets, or even targeted attacks that isolate a small sub- set of a much larger database, without destroying statistical utility? First, an analysis may require a number of queries that begins to look something like the size of this smaller set. Second, letting n now denote the size of the smaller set or small database, and letting k be the num- √ k/n are harder to ber of queries, fractional errors on the order of √ n is small. Third, the ignore when ln(1 /δ ) /ε factor in the advanced 254

259 13.1. Toward practicing privacy 255 composition theorem becomes significant. Keeping in mind the recon- √ o n ) , there appears to be little room struction attacks when noise is ( ≈ to maneuver for arbitrary sets of low-sensitivity queries. k n There are several promising lines of research for addressing these concerns. The Query Errors Don’t Tell the Whole Story. As an example of this phenomenon, consider the problem of linear regression. The input is d x, y ) , where x ∈ R a collection of labeled data points of the form ( d ∈ , for arbitrary dimension d . The goal is to find θ R R y that ∈ and y “as well as possible,” given x , under the assumption that “predicts” the relationship is linear. If the goal is simply to “explain” the given data set, differential privacy may well introduce unacceptable error. Certainly the specific algorithm that simply computes n ∑ 2 | argmin | y θ · x − i i θ =1 i and adds appropriately scaled Laplace noise independently to each ̃ coordinate of that differs substantially from θ may produce a θ . But if θ future, unseen inputs the goal is to learn a predictor that will do well for ( x, y ) then a slightly different computation is used to avoid overfitting, and the (possibly large) difference between the private and non-private coefficient vectors does translate into a gap in classification error! not A similar phenomenon has been observed in model fitting. Less Can Be More. Many analyses ask for more than they actually use. Exploitation of this principle is at the heart of Report Noisy Max, where for the accuracy “price” of one measurement we learn one of the largest of many measurements. By asking for “less” (that is, not requiring that all noisy measurements be released, but rather only ask- ing for the largest one), we obtain “more” (better accuracy). A familiar principle in privacy is to minimize collection and reporting. Here we see this play out in the realm of what must be revealed , rather than what must be used in the computation. Quit When You are NOT Ahead. This is the philosophy behind Propose-Test-Release, in which we test in a privacy-preserving way

260 256 Reflections that small noise is sufficient for a particular intended computation on the given data set. This can be Algorithms with Data-Dependent Accuracy Bounds. viewed as a generalization of Quit When You are Not Ahead. Algo- rithms with data-dependent accuracy bounds can deliver excellent results on “good” data sets, as in Propose-Test-Release, and the accu- racy can degrade gradually as the “goodness” decreases, an improve- ment over Propose-Test-Release. When (potentially large) sets of linear Exploit “Nice” Query Sets. queries are presented as a batch it is possible, by analyzing the geometry of the query matrix to obtain higher quality answers than would be 1 . obtained were the queries answered independently Further Relaxation of Differential Privacy We have seen that ( ε, δ ) - differential privacy is a meaningful relaxation of differential privacy that can provide substantially improved accuracy bounds. Moreover, such a relaxation can be essential to these improvements. For example, ε, δ ) -differential pri- ( Propose-Test-Release algorithms can only offer δ > 0 . What about other, but still meaningful, relaxations of vacy for Concentrated Differential Privacy is such a relax- differential privacy? ( ε, δ ) -differential privacy and that per- ation that is incomparable to mits better accuracy. Roughly speaking, it ensures that large privacy loss happens with very small probability; for example, for all k the 2 kε falls exponentially in k probability of privacy loss . In contrast, ( ε, δ -differential privacy is consistent with having infinite privacy loss ) δ 2 ε can happen in with probability ; on the other hand, privacy lost concentrated differential privacy with constant probability, while in ε, δ ( -differential privacy it will only occur with probability bounded ) by δ , which we typically take to be cryptographically small. Why might we feel comfortable with this relaxation? The answer lies in behavior under composition. As an individual’s data participate 1 k K AB More accurately, the analysis is of the object = , where A is the query 1 k B matrix and is the feasible region in is the k -dimensional L K ball; note that 1 1 answer space when the database has one element.

261 13.1. Toward practicing privacy 257 in many databases and many different computations, perhaps the real worry is the combined threat of multiple exposures. This is captured by privacy under composition. Concentrated differential privacy permits better accuracy while yielding the same behavior under composition as ) differential privacy. ( ε, 0) (and ( ε, δ ) Differential privacy also faces a number of cultural challenges. One of the most significant is non-algorithmic thinking. Differential privacy is a property of an algorithm. However, many people who work with data describe their interactions with the data in fundamentally non- look at the data.” Similarly, data algorithmic terms, such as, “First, I cleaning is often described in non-algorithmic terms. If data are rea- sonably plentiful, and the analysts are energetic, then the “Raw Data” application of the Subsample and Aggregate methodology described in Example 7.3 suggests a path toward enabling non-algorithmic, inter- actions by trusted analysts who will follow directions. In general, it seems plausible that on high-dimensional and on internet-scale data sets non-algorithmic interactions will be the exception. ε we applied Theorem 3.20 to con- 3.7 What about ? In Example = 1 with clude that to bound the cumulative lifetime privacy loss at ε − 32 e probability − , over participation in 10 , 000 databases, it is suf- 1 ficient that each database be (1 / 801 , 0) -differentially private. While k = 10 000 may be an overestimate, the dependence on k is fairly , √ ), and in the worst case these bounds are tight, ruling out a weak ( k ε over the lifetime = 1 more relaxed bound than 801 for each database / 0 of the database . This is simply too strict a requirement in practice. Perhaps we can ask a different question: Fix ε , say, ε = 1 or ε / 10 ; now ask: How can multiple ε ’s be apportioned? Permitting = 1 ε privacy loss per query is too weak, and ε loss over the lifetime of the database is too strong. Something in between, say, per study or ε ε per researcher, may make sense, although this raises the questions of who is a “researcher” and what constitutes a “study.” This affords sub- stantially more protection against accidental and intentional privacy compromise than do current practices, from enclaves to confidentiality contracts. A different proposal is less prescriptive. This proposal draws from second-generation regulatory approaches to reducing environmental

262 258 Reflections degradation, in particular pollution release registries such as the Toxic Release Inventory that have been found to encourage better practices through transparency. Perhaps a similar effect could arise with private data analysis: an Epsilon Registry describing data uses, granularity of privacy protection, a “burn rate” of privacy loss per unit time, and a cap on total privacy loss permitted before data are retired, when accom- panied with a financial penalty for infinite (or very large) loss, can lead to innovation and competition, deploying the talents and resources of a larger set of researchers and privacy professionals in the search for differentially private algorithms. The differential privacy lens 13.2 An online etymological dictionary describes the original 18th century meaning of the term of the word “statistics” as “science dealing with data about the condition of a state or community.” This resonates with differential privacy in the breach: if the presence or absence of the data of a small number of individuals changes the outcome of an analysis then in some sense the outcome is “about” these few individuals, and is not describing the condition of the community as a whole. Put differ- ently, stability to small perturbations in the data is both the hallmark of differential privacy and the essence of a common conception of the term “statistical.” Differential privacy is enabled by stability (Section 7 ) and ensures stability (by definition). In some sense it forces all queries to be statistical in nature. As stability is also increasingly understood to be a key necessary and sufficient condition for learnability, we observe a tantalizing moral equivalence between learnability, differential privacy, and stability. With this in mind, it is not surprising that differential privacy is also a means to ends other than privacy, and indeed we saw this with game theory in Section 10 . The power of differential privacy comes from its amenability to composition. Just as composition allows us to build com- plex differentially private algorithms from smaller differentially private building blocks, it provides a programming language for constructing stable algorithms for complex analytical tasks. Consider, for example, the problem of eliciting a set of bidder values, and using them to price

263 13.2. The differential privacy lens 259 Walrasian equilib- a collection of goods that are for sale. Informally, rium prices are prices such that every individual can simultaneously bundle of goods favorite , while ensuring purchase their given the prices that demand exactly equals the supply of each good. It would seem at first blush, then, that simply computing these prices, and assigning each person their favorite bundle of goods given the prices would yield a mechanism in which agents were incentivized to tell the truth about their valuation function — since how could any agent do better than receiving their favorite bundle of goods? However, this argument fails — because in a Walrasian equilibrium, agents receive their favorite bundle of goods given the prices , but the prices are computed as a function of the reported valuations, so an industrious but dishonest agent could potentially gain by manipulating the computed prices. However, this problem is solved (and an approximately truthful mechanism results) if the equilibrium prices are computed using a differentially private algorithm — precisely because individual agents have almost no effect on the distribution of prices computed. Note that this application is made possible by the use of the tools of differential privacy, but is com- pletely orthogonal to privacy concerns. More generally, this connection is more fundamental: computing equilibria of various sorts using algo- rithms that have the stability property guaranteed by differential pri- vacy leads to approximately truthful mechanisms implementing these equilibrium outcomes. Differential privacy also helps in ensuring generalizability in adap- tive data analysis. Adaptivity means that the questions asked and hypotheses tested depend on outcomes of earlier questions. General- izability means that the outcome of a computation or a test on the data set is close to the ground truth of the distribution from which the data are sampled. It is known that the naive paradigm of answering queries with the exact empirical values on a fixed data set fails to gener- alize even under a limited amount of adaptive questioning. Remarkably, answering with differential privacy not only ensures privacy, but with high probability it ensures generalizability even for exponentially many adaptively chosen queries. Thus, the deliberate introduction of noise using the techniques of differential privacy has profound and promising implications for the validity of traditional scientific inquiry.

264 Appendices

265 A The Gaussian Mechanism d |X| f → R N be an arbitrary d -dimensional function, and define its : Let Gaussian sensitivity to be ∆ . The f = max ℓ ∥ ∥ f ( x ) − f ( y ) 2 2 2 x,y adjacent 2 to each of (0 , σ σ ) N adds noise scaled to Mechanism with parameter components of the output. d the 2 ε Theorem A.1. (0 , 1) be arbitrary. For c Let > 2 ln(1 . 25 /δ ) , the ∈ Gaussian Mechanism with parameter σ ≥ c ∆ -differentially f /ε is ( ε, δ ) 2 private. D Proof. , and the mechanism will and a query There is a database f D ( η , where the noise is normally distributed. We are adding f return )+ 2 , σ N ) . For now, assume we are talking about real-valued func- noise (0 tions, so ∆ f = ∆ f = ∆ f. 1 2 We are looking at ∣ ∣ 2 2 ∣ ∣ σ 1 / 2 ( − ) x e ∣ ∣ (A.1) . ln ∣ ∣ 2 2 − )( x +∆ f ) 2 / ( σ 1 ∣ ∣ e We are investigating the probability, given that the database is D , of observing an output that occurs with a very different probability 261

266 262 The Gaussian Mechanism ′ , where the probability under D than under an adjacent database D space is the noise generation algorithm. The numerator in the ratio ( ) + x when the database f above describes the probability of seeing D this same is D , the denominator corresponds the probability of seeing ′ . This is a ratio of probabilities, so it is D when the database is value always positive, but the logarithm of the ratio may be negative. Our random variable of interest — the privacy loss — is 2 2 1 / 2 σ ( ) x − e ln 2 2 2 σ f − x +∆ ( ) 1 / )( e and we are looking at its absolute value. ∣ ∣ 2 2 ∣ ∣ ( − 1 / 2 σ x ) e 2 2 2 ∣ ∣ − ( f ) +∆ x − x )[ σ ( 2 1 / ] = ln | | e ln ∣ ∣ 2 2 ( f ) / 2 1 σ − )( x +∆ ∣ ∣ e 1 2 2 2 )] [ x |− − ( x = + 2 x ∆ f + ∆ f | 2 σ 2 1 2 (A.2) | (2 ∆ f + (∆ f ) x ) | . = 2 2 σ 2 whenever ε/ ε ∆ f − ∆ f / 2 . To x < σ This quantity is bounded by with probability at least 1 − δ , we ε ensure privacy loss bounded by require 2 x |≥ σ ∆ ε/ ∆ f − Pr[ f / 2] < δ, | | | we will find σ such that x and because we are concerned with 2 ≥ σ . ε/ ∆ f − ∆ f / 2] < δ/ 2 Pr[ x ε 1 ≤ ∆ f . We will assume throughout that ≤ We will use the tail bound σ 2 2 − t / 2 σ √ ≤ Pr[ x > t . ] e π 2 We require: σ 1 2 2 2 / − t σ √ e < δ/ 2 t π 2 √ 1 2 2 t / 2 σ − σ 2 ⇔ πδ/ e 2 < t √ t 2 2 t 2 σ / ⇔ e 2 πδ > 2 / σ √ 2 2 ) + t ⇔ / 2 σ ln( > ln(2 / t/σ 2 πδ ) .

267 263 2 = σ Taking ε/ ∆ f − ∆ f / 2 , we get t √ 2 2 2 2 πδ f / 2) /σ ) + ( σ σ ε/ ∆ f − ∆ f / 2) ∆ / 2 σ f > ln(2 / − 2 ε/ ) ln(( ∆ ) ( √ 1 2 . = ln π δ = c ∆ f /ε ; we wish to bound c . We begin by finding the Let us write σ conditions under which the first term is non-negative. ] ) [( ( ) 2 ε 1 ∆ f f ∆ ε 1 f ) (∆ 2 2 = − − σ c 2 f ∆ ∆ 2 ε σ 2 σ f ] [ ) ( 1 f ∆ ∆ f 2 = − c σ 2 ε ] ) [ ( f ε ∆ ∆ f 2 − c = ε 2 ∆ f c ε . − = c 2 c 1 ε 2 and c ≥ 1 , we have c − ε/ (2 c Since ≥ c − 1 / 2 . So ln( ε − ( σ ≤ 1 ) ∆ f σ ∆ f 2 2 2 c ≥ 3 / 0 . We can therefore focus on the t )) /σ provided term. > 2 )] ( ( [ ) 2 2 2 2 c ∆ f 1 1 1 σ ε = f − − ∆ 2 2 2 2 f σ 2 σ ∆ ε 2 [ )] ] ( [ 2 2 2 ε 1 1 c 2 − ) f (∆ = 2 2 f ) 2 (∆ c 2 ε ) ( 2 2 2 1 1 ε c − = 2 2 c ε 2 1 2 2 2 ε − ε + . ( / 4 c c ) = 2 2 2 2 c 1 c ≤ − ε + ε ε / 4 ( Since ) with respect to c is positive the derivative of 2 2 2 2 3 / 2) , so c + − ε in the range we are considering ε ( / 4 c c ≥ c ≥ − 8 / 9 and it suffices to ensure ( ) √ 2 1 2 > c . − 8 / 9 2 ln δ π

268 264 The Gaussian Mechanism In other words, we need that √ 2 8 / 9 8 / 9 e 2 ln( c 2 ) = ln(2 /π ) + ln( e /π ) + 2 ln(1 /δ ) + 2 ln(1 /δ ) , > ) + ln( 8 / 9 2 c /π < 1 . 55 , is satisfied whenever ) e > 2 ln(1 . 25 /δ ) . (2 which, since R as R = R | ≤ ∪ R x , where R | = { x ∈ Let us parition : R 1 1 2 | ∆ and R c = { x ∈ R : | x } > c ∆ f /ε } . Fix any subset S ⊆ R , and f /ε 2 define } = { S ( x ) + x | x ∈ R f 1 1 | x { f ( x ) + S = x ∈ R . } 2 2 We have Pr ] S ∈ x ) + x [ f ( x ) + x ∈ S ] = Pr ( f [ 1 2 2 ,σ ) ) ,σ (0 ∼N ∼N x x (0 x x [ f ( ) + ] ∈ S + Pr 2 2 (0 x ∼N ) ,σ Pr x δ ] + S ∈ x [ f ( ≤ ) + 1 2 ) ,σ (0 ∼N x ( ) ε Pr e δ, + ( ] S [ f ≤ y ) + x ∈ 1 2 ) ,σ (0 ∼N x ( ε, δ ) -differential privacy for the Gaussian mechanism in one yielding dimension. m To extend this to functions in R High Dimension. , define ∆ f = ∆ v f . We can now repeat the argument, using Euclidean norms. Let 2 v ∥ ∆ f . For a fixed pair of databases x, y be any vector satisfying ∥ ≤ v = f ( x ) − f ( y ) , since this is what our noise must we are interested in σ obscure. As in the one dimensional, case we seek conditions on under which the privacy loss ∣ ∣ 2 2 ∣ ∣ ( ∥ 1 / 2 σ μ ) ∥ x − − e ∣ ∣ ln ∣ ∣ 2 2 2 σ x ) ∥ ( + v − μ ∥ − 1 / ∣ ∣ e

269 265 is bounded by ; here x is chosen from N (0 , Σ) , where (Σ) is a diagonal ε 2 σ , . . . , 0) . , whence matrix with entries = (0 μ ∣ ∣ 2 2 ∣ ∣ 2 σ ( − ∥ x − μ ∥ 1 / ) e 2 2 2 ∣ ∣ ∥ v + x −∥ ] ∥ μ − )[ ∥ σ − μ x 2 / 1 − ( e ln | = ln | ∣ ∣ 2 2 ) − − v + x ∥ μ σ 2 / 1 ∥ ( ∣ ∣ e ∣ ∣ ∣ ∣ 1 2 2 ∣ ∣ . = ∥ ∥ x + v ∥ −∥ )) ( x ∣ ∣ 2 2 σ We will use the fact that the distribution of a spherically symmetric nor- mal is independent of the orthogonal basis from which its constituent . normals are drawn, so we may work in a basis that is alligned with v , and draw b x by first drawing signed lengths , . . . , b Fix such a basis m 1 [ i ] 2 , then defining ) , for i ∈ [ m ] λ x , and finally letting ∼N (0 = λ , σ b i i i ∑ m [ i ] . = v x . Assume without loss of generality that b x is parallel to 1 i =1 2 2 ∥ −∥ |∥ We are interested in + v ∥ x | . x ∑ m [1] [ i ] v and edge x + Consider the right triangle with base x =2 i . The hypotenuse of this triangle is x + v . orthogonal to v m ∑ i [ 2 2 ] 2 [1] + x v + ∥ = + ∥ ∥ x ∥ v ∥ ∥ x =2 i m ∑ 2 [ i ] 2 . ∥ x ∥ = ∥ x ∥ =1 i [1] [1] 2 2 we have ∥ v + x v ∥ is parallel to Since ∥ v ∥ + λ x ) = ( . Thus, 1 2 2 2 v ∥ v −∥ x ∥ ∥ = ∥ x ∥ + + 2 λ ∼ ·∥ v ∥ . Recall that ∥ v ∥ ≤ ∆ f , and λ 1 N , so we are now exactly back in the one-dimensional case, writing (0 , σ ) A.2 ): x instead of in Equation ( λ 1 ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ 1 1 2 2 2 ∣ ∣ ∣ ∣ )) ∥ x ∥ (2 λ −∥ x + v ∥ ( ) ≤ ∆ f − (∆ f ) 1 ∣ ∣ ∣ ∣ 2 2 2 σ 2 σ and the rest of the argument proceeds as above. The argument for the high dimensional case highlights a weakness of ( ε, δ ) -differential privacy that does not exist for ( ε, 0) -differential privacy. Fix a database x . In the ( ε, 0) -case, the guarantee of indis- tinguishability holds for all adjacent databases simultaneously . In the

270 266 The Gaussian Mechanism ε, δ ) ( case indistinguishability only holds “prospectively,” i.e., for any adjacent to x , the probability that the mechanism will allow the y fixed x from y is small. In the proof above, this is adversary to distinguish v manifested by the fact that we fixed f ( x ) − f ( y ) ; we did not have = to argue about all possible directions of v simultaneously, and indeed we cannot, as once we have fixed our noise vector x ∼N (0 , Σ) , so that x the output on o = f ( is ) + x , there may exist an adjacent y such x that output o = f ( x ) + x is much more likely when the database is y than it is on . x A.1 Bibliographic notes Theorem A.1 is folklore initially observed by the authors of [ 23 ]. A gen- 66 ]. eralization to non-spherical gaussian noise appears in [

271 B ( ) -DP Composition Theorems for ε,δ 3.16 B.1 Extension of Theorem T 7→ ( D ) : D Let T -d.p. function, Theorem B.1. D ) ∈C ) be an ( ε, δ ( 1 1 1 s ) ∈ C - , T and for any ( D, s ε, δ ) : ( D, s ( ) 7→ T be an ( D, s ∈ C ) 1 1 1 2 1 2 1 2 . Then we show that for any d.p. function given the second input s 1 ′ neighboring S ⊆C D, D ×C , for any , we have, using the notation in 1 2 our paper 2 ε ′ , T (B.1) ) ∈ P ) ≤ e (( T P S (( T δ. , T ) + 2 ) ∈ S 1 1 2 2 For any C ⊆C Proof. , define 1 1 ( ) ε ′ ) = P P ( T μ ∈ C ) ) − e ( C , ( T C ∈ 1 1 1 1 1 + then C μ is a measure on is μ C -d.p. As a ) ≤ δ since and ) ( ( ε, δ T 1 1 1 result, we have for all s ∈C , 1 1 ε ′ T ∈ ds . ) ≤ e ( P (B.2) ( T ) ∈ ds ds ) + μ ( P 1 1 1 1 1 Also note that by the definition of ( ε, δ ) -d.p., for any s , ∈C 1 1 ( ) ε ′ , s ∧ ) ∈ S ) ≤ ) + e P P (( (( T T , s δ ) ∈ S 1 2 1 2 1 ( ) ε ′ (B.3) (( T ≤ , s e ∈ S ) P ∧ 1 + δ. ) 1 2 267

272 268 ε, δ Composition Theorems for ( ) -DP ) give ( B.3 ): ) and ( B.2 Then ( B.1 ∫ ) ∈ S ) ≤ P (( T , T P (( , s ) ∈ S ) T ( T ∈ ds P ) 1 1 2 1 1 2 S 1 ∫ (( ) ) ε ′ ≤ (( T ) , s ds ) ∈ S ) 1 + ∧ P δ e P ( T ∈ 1 1 2 1 S 1 ∫ (( ) ) ε ′ ≤ ) (( T ) + , s ds ) ∈ S e δ ∧ 1 P P ( T ∈ 1 1 2 1 S 1 ∫ ) (( ) ε ′ ≤ ∈ T , s P ) (( S ) e ∧ 1 2 1 S 1 ε ′ ds P × ( T ( ∈ ds δ ) + μ ( e )) + 1 1 1 ∫ 2 ε ′ ′ δ P ≤ (( T ) + , s S ) ∈ S ) e P ( T ( ∈ ds μ ) + 1 1 1 2 1 S 1 2 ε ′ (B.4) P ≤ (( T e , T δ. ) ∈ S ) + 2 1 2 denotes the projection of S onto S In the equations above, . C 1 1 ( T } , s The event ) ∈ S { refers to { ( T (or ( D, s } ) , s S ) ∈ 1 1 2 1 2 ′ T D ( { ( , s ). ) , s } ) ∈ S 1 1 2 Using induction, we have: (general composition theorem for ( ) -d.p. algorithms) . Corollary B.2 ε, δ T : : D 7→ Let T ( D ) be ( ε, δ ) -d.p., and for k ≥ 2 , T 1 1 k ) ( , . . . , s D, s -d.p., for all ε, δ ) 7→ T ( ( D, s be , . . . , s ∈ C ) 1 1 − 1 k 1 k k − k ⊗ k − 1 ′ given C ) ∈ ( s and D, D . Then for all neighboring , . . . , s 1 j 1 − k j =1 ⊗ k all ⊆ C S j =1 j kε ′ T ) + , . . . , T S ) ∈ kδ. ) ≤ e P P (( (( T ∈ , . . . , T ) S 1 1 k k

273 Acknowledgments We would like to thank many people for providing careful comments and corrections on early drafts of this book, including Vitaly Feldman, Justin Hsu, Simson Garfinkel, Katrina Ligett, Dong Lin, David Parkes, Ryan Rogers, Guy Rothblum, Ian Schmutte, Jon Ullman, Salil Vadhan, Zhiwei Steven Wu, and the anonymous referees. This book was used in a course taught by Salil Vadhan and Jon Ullman, whose students also provided careful feedback. This book has also benefited from con- versations with many other colleagues, including Moritz Hardt, Ilya Mironov, Sasho Nikolov, Kobbi Nissim, Mallesh Pai, Benjamin Pierce, Adam Smith, Abhradeep Thakurta, Abhishek Bhowmick, Kunal Tal- war, and Li Zhang. We are grateful to Madhu Sudan for proposing this monograph. 269

274 References [1] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update Theory of Computing method: A meta-algorithm and applications. , 8(1):121–164, 2012. [2] M.-F. Balcan, A. Blum, J. D. Hartline, and Y. Mansour. Mechanism design via machine learning. In Foundations of Computer Science, 2005. , pages 605–614. IEEE, FOCS 2005. 46th Annual IEEE Symposium on 2005. [3] A. Beimel, S. P. Kasiviswanathan, and K. Nissim. Bounds on the sample complexity for private learning and private data release. In Theory of Cryptography , pages 437–454. Springer, 2010. [4] A. Beimel, K. Nissim, and U. Stemmer. Characterizing the sample com- plexity of private learners. In Proceedings of the Conference on Inno- vations in Theoretical Computer Science , pages 97–110. Association for Computing Machinery, 2013. [5] A. Bhaskara, D. Dadush, R. Krishnaswamy, and K. Talwar. Uncondi- tional differentially private mechanisms for linear queries. In H. J. Karloff Proceedings of the Symposium on Theory of Com- and T. Pitassi, editors, puting Conference, Symposium on Theory of Computing, New York, NY, USA, May 19–22, 2012 , pages 1269–1284. 2012. [6] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In Chen Li, editor, Principles of Database Systems , pages 128–138. ACM, 2005. [7] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the sulq framework. In Principles of Database Systems . 2005. 270

275 References 271 A. Blum, K. Ligett, and A. Roth. A learning theory approach to non- [8] interactive database privacy. In Cynthia Dwork, editor, Symposium , pages 609–618. Association for Computing on Theory of Computing Machinery, 2008. A. Blum and Y. Monsour. Learning, regret minimization, and equilibria, [9] 2007. [10] J. L. Casti. Five Golden Rules: Great Theories of 20th-Century Mathe- . Wiley, 1996. matics and Why They Matter [11] T. H. Hubert Chan, E. Shi, and D. Song. Private and continual release of statistics. In , pages 405–417. Automata, Languages and Programming Springer, 2010. K. Chaudhuri and D. Hsu. Sample complexity bounds for differentially [12] Proceedings of the Annual Conference on Learning private learning. In Theory (COLT 2011) . 2011. K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially pri- [13] Journal of machine learning research: vate empirical risk minimization. JMLR , 12:1069, 2011. [14] K. Chaudhuri, A. Sarwate, and K. Sinha. Near-optimal differentially private principal components. In Advances in Neural Information Pro- cessing Systems 25 , pages 998–1006. 2012. Y. Chen, S. Chong, I. A. Kash, T. Moran, and S. P. Vadhan. Truthful [15] Association for Computing mechanisms for agents that value privacy. Machinery Conference on Electronic Commerce , 2013. P. Dandekar, N. Fawaz, and S. Ioannidis. Privacy auctions for recom- [16] Internet and Network Economics , pages 309–322. mender systems. In Springer, 2012. [17] A. De. Lower bounds in differential privacy. In Theory of Cryptography Conference , pages 321–338. 2012. [18] I. Dinur and K. Nissim. Revealing information while preserving privacy. In Proceedings of the Association for Computing Machinery SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems , pages 202–210. 2003. [19] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. arXiv preprint arXiv:1302.3203 , 2013. [20] C. Dwork. Differential privacy. In Proceedings of the International Col- loquium on Automata, Languages and Programming (ICALP)(2) , pages 1–12. 2006.

276 272 References C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our [21] data, ourselves: Privacy via distributed noise generation. In EURO- , pages 486–503. 2006. CRYPT [22] Pro- C. Dwork and J. Lei. Differential privacy and robust statistics. In ceedings of the 2009 International Association for Computing Machinery Symposium on Theory of Computing (STOC) . 2009. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to [23] sensitivity in private data analysis. In Theory of Cryptography Conference , pages 265–284. 2006. ’06 C. Dwork, F. McSherry, and K. Talwar. The price of privacy and the [24] Proceedings of the Association for Computing limits of lp decoding. In , pages 85–94. 2007. Machinery Symposium on Theory of Computing [25] C. Dwork and M. Naor. On the difficulties of disclosure prevention in sta- Journal of Privacy tistical databases or the case for differential privacy. , 2010. and Confidentiality [26] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacy under continual observation. In Proceedings of the Association for Com- puting Machinery Symposium on Theory of Computing , pages 715–724. Association for Computing Machinery, 2010. [27] C. Dwork, M. Naor, T. Pitassi, G. N. Rothblum, and Sergey Yekhanin. Proceedings of International Con- Pan-private streaming algorithms. In . 2010. ference on Super Computing [28] C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. P. Vadhan. On the complexity of differentially private data release: Efficient algorithms Symposium on Theory of Computing ’09 , pages and hardness results. In 381–390. 2009. [29] C. Dwork, M. Naor, and S. Vadhan. The privacy of the analyst and the power of the state. In Foundations of Computer Science . 2012. [30] C. Dwork, A. Nikolov, and K. Talwar. Efficient algorithms for privately releasing marginals via convex relaxations. In Proceedings of the Annual Symposium on Computational Geometry (SoCG) . 2014. [31] C. Dwork and K. Nissim. Privacy-preserving datamining on vertically partitioned databases. In Proceedings of Cryptology 2004 , vol. 3152, pages 528–544. 2004. [32] C. Dwork, G. N. Rothblum, and S. P. Vadhan. Boosting and differential privacy. In Foundations of Computer Science , pages 51–60. 2010.

277 References 273 C. Dwork, K. Talwar, A. Thakurta, and L. Zhang. Analyze gauss: Opti- [33] mal bounds for privacy-preserving pca. In Symposium on Theory of . 2014. Computing [34] L. Fleischer and Y.-H. Lyu. Approximately optimal auctions for selling privacy when costs are correlated with data. In Association for Com- puting Machinery Conference on Electronic Commerce , pages 568–585. 2012. A. Ghosh and K. Ligett. Privacy and coordination: Computing on [35] databases with endogenous participation. In Proceedings of the four- teenth ACM conference on Electronic commerce (EC) , pages 543–560, 2013. A. Ghosh and A. Roth. Selling privacy at auction. In [36] Association for Computing Machinery Conference on Electronic Commerce , pages 199– 208. 2011. A. Groce, J. Katz, and A. Yerukhimovich. Limits of computational dif- [37] Proceedings of the Theory ferential privacy in the client/server setting. In of Cryptography Conference . 2011. [38] A. Gupta, M. Hardt, A. Roth, and J. Ullman. Privately releasing con- junctions and the statistical query barrier. In Symposium on Theory of Computing ’11 , pages 803–812. 2011. A. Gupta, A. Roth, and J. Ullman. Iterative constructions and private [39] Theory of Cryptography Conference , pages 339–356. 2012. data release. In [40] J. Håstad, R. Impagliazzo, L. Levin, and M. Luby. A pseudorandom generator from any one-way function. , 28, SIAM Journal of Computing 1999. [41] M. Hardt, K. Ligett, and F. McSherry. A simple and practical algorithm for differentially private data release. In Advances in Neural Information Processing Systems 25 , pages 2348–2356. 2012. [42] M. Hardt and A. Roth. Beating randomized response on incoherent matrices. In Proceedings of the Symposium on Theory of Computing , pages 1255–1268. Association for Computing Machinery, 2012. [43] M. Hardt and A. Roth. Beyond worst-case analysis in private singu- lar vector computation. In Proceedings of the Symposium on Theory of Computing . 2013. [44] M. Hardt and G. N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In Foundations of Computer Science , pages 61–70. IEEE Computer Society, 2010.

278 274 References M. Hardt and K. Talwar. On the geometry of differential privacy. In [45] Proceedings of the Association for Computing Machinery Symposium on Theory of Computing , pages 705–714. Association for Computing Machinery, 2010. N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, [46] J. Pearson, D. Stephan, S. Nelson, and D. Craig. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high- , 4, 2008. density snp genotyping microarrays. PLoS Genet J. Hsu, Z. Huang, A. Roth, T. Roughgarden, and Z. S. Wu. Private [47] matchings and allocations. arXiv preprint arXiv:1311.2828, 2013. J. Hsu, A. Roth, and J. Ullman. Differential privacy for the analyst [48] Proceedings of the Association via private equilibrium computation. In , for Computing Machinery Symposium on Theory of Computing (STOC) pages 341–350, 2013. [49] Z. Huang and S. Kannan. The exponential mechanism for social welfare: Private, truthful, and nearly optimal. In IEEE Annual Symposium on the Foundations of Computer Science (FOCS) , pages 140–149. 2012. [50] P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. Journal of Machine Learning Research — Proceedings Track , 23:24.1–24.34, 2012. [51] M. Kapralov and K. Talwar. On differentially private low rank approxi- Symposium on Discrete Algorthims mation. In Sanjeev Khanna, editor, , pages 1395–1414. SIAM, 2013. [52] S. P. Kasiviswanathan, H. K. Lee, Kobbi Nissim, S. Raskhodnikova, and SIAM Journal on Computing , A. Smith. What can we learn privately? 40(3):793–826, 2011. [53] M. Kearns. Efficient noise-tolerant learning from statistical queries. Jour- nal of the Association for Computing Machinery (JAssociation for Com- puting Machinery) , 45(6):983–1006, 1998. [54] M. Kearns, M. Pai, A. Roth, and J. Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science (ITCS) , 2014. [55] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk min- imization and high-dimensional regression. Journal of Machine Learning Research , 1:41, 2012. [56] K. Ligett and A. Roth. Take it or leave it: Running a survey when privacy comes at a cost. In Internet and Network Economics , pages 378–391. Springer, 2012.

279 References 275 N. Littlestone and M. K. Warmuth. The weighted majority algorithm. [57] , pages In Annual Symposium on Foundations of Computer Science, 1989 256–261. IEEE, 1989. A. McGregor, I. Mironov, T. Pitassi, O. Reingold, K. Talwar, and S. P. [58] Foundations of Vadhan. The limits of two-party differential privacy. In Computer Science , pages 81–90. IEEE Computer Society, 2010. F. McSherry. Privacy integrated queries (codebase). Available on [59] Microsoft Research downloads website. See also the Proceedings of SIG- MOD 2009. F. McSherry and K. Talwar. Mechanism design via differential privacy. [60] Foundations of Computer Science In , pages 94–103. 2007. F. McSherry and K. Talwar. Mechanism design via differential privacy. [61] Foundations of Computer Science , pages 94–103. 2007. In D. Mir, S. Muthukrishnan, A. Nikolov, and R. N. Wright. Pan-private [62] Proceedings of the Association algorithms via statistics on sketches. In for Computing Machinery SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems , pages 37–48. Association for Computing Machinery, 2011. [63] I. Mironov. On significance of the least significant bits for differential privacy. In T. Yu, G. Danezis, and V. D. Gligor, editors, Association for Computing Machinery Conference on Computer and Communications Security , pages 650–661. Association for Computing Machinery, 2012. [64] I. Mironov, O. Pandey, O. Reingold, and S. P. Vadhan. Computational Proceedings of CRYPTOLOGY , pages 126–142. differential privacy. In 2009. [65] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets (how to break anonymity of the netflix prize dataset). In Proceedings of IEEE Symposium on Security and Privacy . 2008. [66] A. Nikolov, K. Talwar, and L. Zhang. The geometry of differential pri- vacy: the sparse and approximate cases. Symposium on Theory of Com- puting , 2013. [67] K. Nissim, C. Orlandi, and R. Smorodinsky. Privacy-aware mechanism design. In Association for Computing Machinery Conference on Elec- tronic Commerce , pages 774–789. 2012. [68] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the Association for Computing Machinery Symposium on Theory of Computing , pages 75–84. 2007.

280 276 References K. Nissim, R. Smorodinsky, and M. Tennenholtz. Approximately optimal [69] mechanism design via differential privacy. In Innovations in Theoretical , pages 203–213. 2012. Computer Science [70] SIGecom Exchanges , M. Pai and A. Roth. Privacy and mechanism design. 2013. [71] R. Rogers and A. Roth. Asymptotically truthful equilibrium selection in large congestion games. arXiv preprint arXiv:1311.2625, 2013. A. Roth. Differential privacy and the fat-shattering dimension of linear [72] queries. In Approximation, Randomization, and Combinatorial Optimiza- , pages 683–695. Springer, 2010. tion, Algorithms and Techniques A. Roth. Buying private data at auction: the sensitive surveyor’s prob- [73] Association for Computing Machinery SIGecom Exchanges lem. , 11(1):1– 8, 2012. [74] A. Roth and T. Roughgarden. Interactive privacy via the median mech- Symposium on Theory of Computing ’10 , pages 765–774. 2010. anism. In A. Roth and G. Schoenebeck. Conducting truthful surveys, cheaply. In [75] Proceedings of the ACM Conference on Electronic Commerce , pages 826– 843. 2012. [76] B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacy-preserving mechanisms for svm learning. arXiv preprint arXiv:0911.5708, 2009. [77] R. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, and B. Yu, editors, . Springer, 2003. Nonlinear Estimation and Classification [78] R. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning , 39:297–336, 1999. [79] Boosting: Foundations and Algorithms . R. E. Schapire and Y. Freund. MIT Press, 2012. [80] A. Smith and A. G. Thakurta. Differentially private feature selection via stability arguments, and the robustness of the lasso. In Proceedings of Conference on Learning Theory . 2013. [81] L. Sweeney. Weaving technology and policy together to maintain confi- dentiality. Journal of Law, Medicines Ethics , 25:98–110, 1997. } (1) { 2+ o J. Ullman. Answering n [82] counting queries with differential pri- vacy is hard. In D. Boneh, T. Roughgarden, and J. Feigenbaum, edi- tors, Symposium on Theory of Computing , pages 361–370. Association for Computing Machinery, 2013.

281 References 277 [83] L. G. Valiant. A theory of the learnable. Communications of the Asso- ciation for Computing Machinery , 27(11):1134–1142, 1984. [84] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association , 60(309):63–69, 1965. [85] D. Xiao. Is privacy compatible with truthfulness? In Proceedings of the Conference on Innovations in Theoretical Computer Science , pages 67–86. 2013.