Sweeney Article

Transcript

1 L. Sweeney. k International Journal on Uncertainty, -anonymity: a model for protecting privacy. 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, 1 k -ANONYMITY: A MODEL FOR PROTECTING PRIVACY LATANYA SWEENEY School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA E-mail: [email protected] 2002 Received May Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution k provided in this paper includes a formal protection model named -anonymity and a set k -anonymity protection if of accompanying policies for deployment. A release provides nnot be distinguished from at the information for each person contained in the release ca least k -1 individuals whose information also appears in the release. This paper also - examines re-identification attacks that can be realized on releases that adhere to k anonymity unless -anonymity protection k accompanying policies are respected. The model is important because it forms the basis on which the real-world systems known as -Argus and k -Similar provide guarantees of privacy protection. Datafly, μ : data anonymity, data privacy, re-identification, data fusion, privacy. Keywords 1. Introduction Society is experiencing exponential growth in the number and variety of data collections containing person-specific information as computer technology, network connectivity and disk storage space become increasingly affordable. Data holders, operating autonomously and with limited knowledge, are left with the difficulty of releasing information that does not compromise privacy, confidentiality or national interests. In many cases the survival of the database itself depends on the data holder's ability to produce anonymous data because not releasing such information at all may diminish the need for the data, while on the other hand, failing to provide proper protection within a release may create circumstances that harm the public or others. 1 This paper significantly amends and substantially expands the earlier paper “Protecting privacy when disclosing information: k -anonymity and its enforcement through generalization and suppression” (with Samarati) submitted to IEEE Security and Privacy 1998, and extends parts of my Ph.D. thesis “Computational Disclosure Control: A primer on data privacy protection” at the Massachusetts Institute of Technology 2001. Page 1

2 L. Sweeney. -anonymity: a model for protecting privacy. International Journal on Uncertainty, k 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, So a common practice is for organizations to release and receive person- specific data with all explicit identifiers, such as name, address and telephone number, removed on the assumption that anonymity is maintained because the resulting data look anonymous. However, in most of these cases, the remaining data can be used to re-identify individuals by linking or matching the data to other data or by looking at unique characteristics found in the released data. In an earlier work, experiments using 1990 U.S. Census summary data were conducted to determine how many individuals within geographically situated populations had combinations of demographic values that occurred infrequently [1]. Combinations of few characteristics often combine in populations to uniquely or nearly uniquely identify some individuals. For example, a finding in that study was that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on { 5-digit 2 , gender , date of birth }. Clearly, data released containing such information ZIP about these individuals should not be considered anonymous. Yet, health and other person-specific data are often publicly available in this form. Below is a demonstration of how such data can be re-identified. Example 1. Re-identification by linking The National Association of Health Data Organizations (NAHDO) reported that 37 states in the USA have legislative mandates to collect hospital level data and that 17 states have started collecting ambulatory care data from hospitals, physicians offices, clinics, and so forth [2]. The leftmost circle in attributes Figure 1 contains a subset of the fields of information, or , that NAHDO recommends these states collect; these attributes include the patient’s ZIP code, birth date, gender, and ethnicity. In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. GIC collected patient- specific data with nearly one hundred attributes per encounter along the lines of the those shown in the leftmost circle of Figure 1 for approximately 135,000 state employees and their families. Because the data were believed to be anonymous, GIC gave a copy of the data to researchers and sold a copy to industry [3]. For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and received the information on two diskettes [4]. The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter. This information can be linked using ZIP code, birth date and gender to the medical information, thereby 2 In the United States, a ZIP code refers to the postal code assigned by the U.S. Postal Service. Typically 5-digit ZIP codes are used, though 9-digit ZIP codes have been assigned. A 5-digit code is the first 5 digits of the 9-digit code. Page 2

3 L. Sweeney. International Journal on Uncertainty, k -anonymity: a model for protecting privacy. Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570. linking diagnosis, procedures, and medications to particularly named individuals. For example, William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code. Name Ethnicity Address Visit date ZIP Date Diagnosis Birth registered date Procedure Party Sex Medication affiliation Total charge Date last voted Medical Data Voter List Figure 1 Linking to re-identify data The example above provides a demonstration of re-identification by directly linking (or “matching”) on shared attributes. The work presented in this paper shows that altering the released information to map to many possible people, thereby making the linking ambiguous, can thwart this kind of attack. The greater the number of candidates provided, the more ambiguous the linking, and therefore, the more anonymous the data. 2. Background The problem of releasing a version of privately held data so that the individuals who are the subjects of the data cannot be identified is not a new problem. There are existing works in the statistics community on statistical databases and in the computer security community on multi-level databases to consider. However, none of these works provide solutions to the broader problems experienced in today’s data rich setting. 2.1. Statistical databases Federal and state statistics offices around the world have traditionally been entrusted with the release of statistical information about all aspects of the populace [5]. But like other data holders, statistics offices are also facing tremendous demand for person-specific data for applications such as data mining, Page 3

4 L. Sweeney. International Journal on Uncertainty, k -anonymity: a model for protecting privacy. Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570. cost analysis, fraud detection and retrospective research. But many of the established statistical database techniques, which involve various ways of adding noise [6] to the data while still maintaining some statistical invariant [7, 8], often destroy the integrity of records, or tuples , and so, for many new uses of data, these established techniques are not appropriate. Willenborg and De Waal [9] provide more extensive coverage of traditional statistical techniques. 2.2. Multi-level databases Another related area is aggregation and inference in multi-level databases [10, 11, 12, 13, 14, 15] which concerns restricting the release of lower classified information such that higher classified information cannot be derived. Denning and Lunt [16] described a multilevel relational database system (MDB) as having data stored at different security classifications and users having different security clearances. Su and Ozsoyoglu formally investigated inference in MDB. They showed that eliminating precise inference compromise due to functional dependencies and multi-valued dependencies is NP-complete. By extension to this work, the precise elimination of all inferences with respect to the identities of the individuals whose information is included in person-specific data is typically impossible to guarantee. Intuitively this makes sense because the data holder cannot consider a priori every possible attack. In trying to produce anonymous data, the work that is the subject of this paper seeks to primarily protect against known attacks. The biggest problems result from inferences that can be drawn after linking the released data to other knowledge, so in this work, it is the ability to link the result to foreseeable data sources that must be controlled. Many aggregation inference problems can be solved by database design, but this solution is not practical in today’s data rich setting. In today's environment, information is often divided and partially replicated among multiple data holders and the data holders usually operate autonomously in making decisions about how data will be released. Such decisions are typically made locally with incomplete knowledge of how sensitive other holders of the information might consider replicated data. For example, when somewhat aged information on joint projects is declassified differently by the Department of Defense than by the Department of Energy, the overall declassification effort suffers; using the two partial releases, the original may be reconstructed in its entirety. In general, systems that attempt to produce anonymous data must operate without the degree of omniscience and level of control typically available in the traditional aggregation problem. In both aggregation and MDB, the primary technique used to control the flow of sensitive information is suppression , where sensitive information and all information that allows the inference of sensitive information are simply not released. Suppression can drastically reduce the quality of the data, and in the Page 4

5 L. Sweeney. -anonymity: a model for protecting privacy. International Journal on Uncertainty, k 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, case of statistical use, overall statistics can be altered, rendering the data When protecting national interests, not releasing the practically useless. information at all may be possible, but the greatest demand for person-specific data is in situations where the data holder must provide adequate protections while keeping the data useful, such as sharing person-specific medical data for research purposes. 2.3. Computer security is not privacy protection An area that might appear to have a common ancestry with the subject of this paper is access control and authentication, which are traditional areas associated with computer security. Work in this area ensures that the recipient of information has the authority to receive that information. While access control and authentication protections can safeguard against direct disclosures, they do not address disclosures based on inferences that can be drawn from released data. The more insidious problem in the work that is the subject of this paper is not so much whether the recipient can get access or not to the information as much as what values will constitute the information the recipient will receive. A general doctrine of the work presented herein is to release all the information but to do so such that the identities of the people who are the subjects of the data (or other sensitive properties found in the data) are protected. Therefore, the goal of the work presented in this paper lies outside of traditional work on access control and authentication. 2.4. Multiple queries can leak inference Denning [17] and others [18, 19] were among the first to explore inferences For example, consider a table realized from multiple queries to a database. containing only (physician, patient, medication). A query listing the patients seen by each physician, i.e., a relation R(physician, patient), may not be sensitive. Likewise, a query itemizing medications prescribed by each physician may also not be sensitive. But the query associating patients with their prescribed medications may be sensitive because medications typically correlate with diseases. One common solution, called query restriction, prohibits queries that can reveal sensitive information. This is effectively realized by suppressing all inferences to sensitive data. In contrast, this work poses a real-time solution to this problem by advocating that the data be first rendered sufficiently anonymous, and then the resulting data used as the basis on which queries are processed. Doing so typically retains far more usefulness in the data because the resulting release is often less distorted. In summary, the dramatic increase in the availability of person-specific data from autonomous data holders has expanded the scope and nature of inference control problems and exasperated established operating practices. The goal of this work Page 5

6 L. Sweeney. -anonymity: a model for protecting privacy. International Journal on Uncertainty, k Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570. is to provide a model for understanding, evaluating and constructing computational systems that control inferences in this setting. 3. Methods The goal of this section is to provide a formal framework for constructing and evaluating algorithms and systems that release information such that the released information limits what can be revealed about properties of the entities that are to be protected. For convenience, I focus on person-specific data, so the entities are people, and the property to be protected is the identity of the subjects whose information is contained in the data. However, other properties could also be -anonymity protected. The formal methods provided in this paper include the k protection model. The real-world systems Datafly [20], μ -Argus [21] and k - Similar [22] motivate this approach. data refers to person-specific information Unless otherwise stated, the term that is conceptually organized as a table of rows (or records) and columns (or fields). Each row is termed a . A tuple contains a relationship among the set tuple are not necessarily of values associated with a person. Tuples within a table unique. Each column is called an attribute and denotes a field or semantic category of information that is a set of possible values; therefore, an attribute is also a domain. Attributes within a table are unique. So by observing a table, each is in the , d d ,..., d > such that each value d row is an ordered n -tuple of values < n 2 1 j n j j =1,2,..., domain of the where -th column, for is the number of columns. In n mathematical set theory, a relation corresponds with this tabular presentation, the only difference is the absence of column names. Ullman provides a detailed discussion of relational database concepts [23]. Definition 1. Attributes ,..., A )bea table with a finite number of tuples. The finite set of Let ( A B n 1 }. A ,..., B are { A attributes of 1 n A , I use B ∈ t ,..., A }, and a tuple ), { A ,..., ,..., A A } ⊆ { A ( B Given a table j i n n 1 1 in A ,..., t A ] to denote the sequence of the values, v . I use ,..., v ,of A ,..., A [ t i i j j j i ,..., A ] to denote the projection, maintaining duplicate tuples, of attributes B A [ i j . B ,... A in A j i Throughout the remainder of this work each tuple is assumed to be specific to one person and no two tuples pertain to the same person. This assumption simplifies discussion without loss of applicability. To draw an inference is to come to believe a new fact on the basis of other information. A disclosure means that explicit or inferable information about a person was released that was not intended. This definition may not be consistent with colloquial use but is used in this work consistent with its meaning in Page 6

7 L. Sweeney. k International Journal on Uncertainty, -anonymity: a model for protecting privacy. 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, attempts to identify and limit statistical disclosure control. So, disclosure control disclosures in released data. Typically the goal of disclosure control with respect to person-specific data is to ensure that released data are sufficiently anonymous. Let me be more specific about how properties are selected and controlled. Recall the linking example shown in Figure 1. In that case, the need for protection centered on limiting the ability to link released information to other external So the properties to be controlled are operationally realized as collections. attributes in the privately held collection. The data holder is expected to identify all attributes in the private information that could be used for linking with external information. Such attributes not only include explicit identifiers such as name, address, and phone number, but also include attributes that in combination can uniquely identify individuals such as birth date and gender. The set of such by Dalenius [24]. So operationally, a attributes has been termed a quasi-identifier goal of this work is to release person-specific data such that the ability to link to other information using the quasi-identifier is limited. Definition 2. Quasi-identifier ,...,A → ), f U : , an entity-specific table T (A Given a population of entities U c 1 n and T , is a set of : T → U', where U f U' . A quasi-identifier of T , written Q ⊆ T g attributes {A } where: p ,...,A ]) = } ⊆ {A Q ,...,A )[ . ∃ p p ∈ U such that f ( ( f i c n 1 T j i i g i Example 2. Quasi-identifier Let V be the voter-specific table described earlier in Figure 1 as the voter list. , birth date , ZIP , ,is{ name , address Q , written V A quasi-identifier for V }. gender Linking the voter list to the medical data as shown in Figure 1, clearly address ⊆ } , name . However, { demonstrates that { } ⊆ ender ,g ZIP Q , birth date V because these attributes can also appear in external information and be used for Q V linking. In the case of anonymity, it is usually publicly available data on which linking is to be prohibited and so attributes which appear in private data and also appear in public data are candidates for linking; therefore, these attributes constitute the quasi-identifier and the disclosure of these attributes must be controlled. It is believed that these attributes can be easily identified by the data holder. Assumption (quasi-identifier). The data holder can identify attributes in his private data that may also appear in external information and therefore, can accurately identify quasi-identifiers. Page 7

8 L. Sweeney. k International Journal on Uncertainty, -anonymity: a model for protecting privacy. 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, Consider an instance where this assumption is incorrect; that is, the data holder misjudges which attributes are sensitive for linking. In this case, the released data may be less anonymous than what was required, and as a result, individuals may be more easily identified. Clearly, this risk cannot be perfectly resolved by the data holder because the data holder cannot always know what each recipient of the data knows but policies and contracts, which lie outside the algorithms, can help. Also, the data holder may find it necessary to release data that are only partially anonymous. Again, policies, laws and contracts can provide complementary protections. In the remainder of this work, I assume a proper quasi-identifier has been recognized. As an aside, there are many ways to expand the notion of a quasi-identifier to provide more flexibility and granularity. Both the Datafly and μ -Argus systems weight the attributes of the quasi-identifier. For simplicity in this work, however, I consider a single quasi-identifier based on attributes, without weights, appearing together in an external table or in a possible join of external tables. k 3.1. The -anonymity protection model In an earlier work, I introduced basic protection models termed null-map , k-map wrong-map and which provide protection by ensuring that released information maptono, k or incorrect entities, respectively [25]. To determine how many individuals each released tuple actually matches requires combining the released data with externally available data and analyzing other possible attacks. Making such a determination directly can be an extremely difficult task for the data holder who releases information. Although I can assume the data holder knows which PT data in also appear externally, and therefore what constitutes a quasi-identifier, the specific values contained in external data cannot be assumed. I therefore seek to protect the information in this work by satisfying a slightly different constraint requirement. This is a special case of k - k-anonymity on released data, termed the is enforced on the released data. map protection where k k Definition 3. -anonymity be the quasi-identifier associated with it. ) be a table and QI ,..., A A ( Let RT n RT 1 RT is said to satisfy k -anonymity if and only if each sequence of values in ]. QI [ RT occurrences in ] appears with at least k QI [ RT RT RT Page 8

9 L. Sweeney. k International Journal on Uncertainty, -anonymity: a model for protecting privacy. 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, Gender ZIP Problem Race Birth m t1 Black 0214* 1965 short breath 0214* m chest pain t2 Black 1965 t3 Black 1965 f 0213* hypertension t4 Black hypertension 1965 f 0213* f 0213* obesity 1964 t5 Black f 0213* t6 Black 1964 chest pain 1964 0213* chest pain t7 White m m obesity 1964 t8 White 0213* m 0213* short breath t9 White 1964 1967 0213* chest pain t10 White m t11 White 1967 m 0213* chest pain k -anonymity, where k =2 and QI ={ Race , Birth , Figure 2 Example of , ZIP } Gender Example 3. Table adhering to -anonymity k T Figure 2 provides an example of a table -anonymity. The that adheres to k QI quasi-identifier for the table is k , Birth , Gender , ZIP }and ={ =2. Race T T , the values of the Therefore, for each of the tuples contained in the table tuple that comprise the quasi-identifier appear at least twice in T . That is, for occurrences of those 2 ] there are at least T each sequence of values in [ QI T ]. In particular, t1 [ QI QI ]= t2 [ ], ]= ], t3 [ QI [ ]= t4 [ QI t5 QI [ QI values in T T T T T T T ]. QI [ t11 ]= ], t7 [ QI QI ]= t8 [ QI [ ]= t9 [ QI t10 ], and QI [ t6 T T T T T T Lemma. =( ) be the quasi-identifier associated ,..., A )beatable, QI A A ,..., Let RT ( A i RT j 1 n with RT , A Then, each ,..., A -anonymity. ⊆ A k ,..., A satisfy ,and RT n 1 i j occurrences in ] appears with at least ] QI [ k RT [ sequence of values in A RT x RT for x = i ,..., j . Example 4. k occurrences of each value under k -anonymity , Birth ={ Race , -anonymity, where QI Table T in Figure 2 adheres to k T , ZIP }and k =2. Therefore, each value that appears in a value Gender associated with an attribute of QI in T appears at least k times. | T [ Race [ ="black"]| = 6. | [ Race ="white"]| = 5. | T T Birth ="1964"]| = 5. | T [ Birth Gender ="1965"]| = 4. | T [ Birth ="1967"]| = 2. | T [ ="f"]| ="m"]| = 6. | T [ Gender [ =5.| ="0213*"]| = 9. And, | T [ ZIP ="0214*"]| = 2. ZIP T satisfies k -anonymity with RT It can be trivially proven that if the released data , then the combination of the released data RT respect to the quasi-identifier QI PT QI or a subset was based, cannot link on and the external sources on which QI PT PT of its attributes to match fewer than k individuals. This property holds provided which are externally available in that all attributes in the released table RT Page 9

10 L. Sweeney. k International Journal on Uncertainty, -anonymity: a model for protecting privacy. 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, combination (i.e., appearing together in an external table or in a possible join of associated with the private QI external tables) are defined in the quasi-identifier PT table . This property does not guarantee individuals cannot be identified in PT RT ; there may exist other inference attacks that could reveal the identities of the individuals contained in the data. However, the property does protect against RT inference from linking (by direct matching) to known external sources; and in this context, the solution can provide an effective guard against re-identifying individuals. Race Race Race ZIP ZIP ZIP 02138 02138 Asian 02130 Asian Person 02139 Person 02139 Asian 02130 Asian Asian Person 02141 Asian 02140 02141 02142 02142 Asian 02140 Asian Person 02138 Person 02138 Black 02130 Black Black 02139 02139 Black 02130 Person Black Person 02141 Black 02140 02141 02140 Black 02142 Person 02142 Black White 02138 White 02130 Person 02138 Person 02130 White 02139 White 02139 02141 02141 White 02140 White Person 02142 Person 02142 White 02140 White GT1 GT2 PT Figure 3 Examples of k -anonymity tables based on PT 4. Attacks against k-anonymity Even when sufficient care is taken to identify the quasi-identifier, a solution that adheres to k -anonymity can still be vulnerable to attacks. Three are described below. Fortunately, the attacks presented can be thwarted by due diligence to some accompanying practices, which are also described below. k -anonymity 4.1. Unsorted matching attack against This attack is based on the order in which tuples appear in the released table. While I have maintained the use of a relational model in this discussion, and so the order of tuples cannot be assumed, in real-world use this is often a problem. It can be corrected of course, by randomly sorting the tuples of the solution table. Otherwise, the release of a related table can leak sensitive information. Example 5. Unsorted matching attack Tables GT1 and GT2 in Figure 3 are based on PT and adhere to k -anonymity, }and =2. The positions of the tuples in each table ={ Race , ZIP k QI where PT correspond to those in PT .If GT1 is released and a subsequent release of is then performed, then direct matching of tuples across the tables based GT2 Page 10

11 L. Sweeney. k International Journal on Uncertainty, -anonymity: a model for protecting privacy. 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, on tuple position within the tables reveals sensitive information. On the other hand, if the positions of the tuples within each table are randomly determined, both tables can be released. k 4.2. Complementary release attack against -anonymity In the previous example, all the attributes were in the quasi-identifier. That is typically not the case. It is more common that the attributes that constitute the quasi-identifier are themselves a subset of the attributes released. As a result, k when a table T , which adheres to -anonymity, is released, it should be considered as joining other external information. Therefore, subsequent releases of the same T a quasi- privately held information must consider all of the released attributes of , unless of course, subsequent releases are based identifier to prohibit linking on T . on T Example 6. Complementary release attack PT GT1 and GT3 in Figure in Figure 4. The tables Consider the private table and adhere to k -anonymity, where k =2 and the quasi- 5 are based on PT GT1 ZIP ={ Race , BirthDate , Gender , is }. Suppose table identifier QI PT If subsequently GT3 k -anonymity released. is also released, then the protection will no longer hold, even though the tuple positions are randomly GT1 determined in both tables. Linking GT3 on { Problem } reveals the and LT shown in Figure 4. Notice how [ white , 1964 , male , 02138 table and ] [ , 1965 , female , 02139 ] are unique in LT white LT does not and so, satisfy the k -anonymity requirement enforced by GT1 and GT3 . This GT3 problem would not exist if }or used the quasi=identifier QI ∪ { Problem if GT1 . In this latter case, no value more specific had been the basis of GT3 would be subsequently released. GT1 than it appears in Problem Race BirthDate Gender ZIP Problem Gender ZIP Race BirthDate 02141 short of breath black 1965 male 02141 short of breath black 9/20/1965 male 2/14/1965 male 02141 chest pain black 1965 male 02141 chest pain black 10/23/1965 female black 1965 female 02138 painful eye 02138 painful eye black 8/24/1965 female 02138 wheezing black 1965 female 02138 wheezing black 11/7/1964 female 02138 obesity black 1964 female 02138 obesity black 12/1/1964 female 02138 chest pain black black 1964 female 02138 chest pain white 10/23/1964 male white 1964 male 02138 short of breath 02138 short of breath 3/15/1965 female 1965 white white female 02139 hypertension 02139 hypertension white 02139 obesity white 1964 male 02139 obesity 8/13/1964 male white 5/5/1964 male 02139 fever white 1964 male 02139 fever white 02138 vomiting white 1967 male 02138 vomiting 2/13/1967 male white 3/21/1967 male 02138 back pain white 1967 male 02138 back pain PT LT LT Figure 4 Private Table PT and linked table Page 11

12 L. Sweeney. -anonymity: a model for protecting privacy. International Journal on Uncertainty, k 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, Problem Race BirthDate Gender ZIP Problem Race BirthDate Gender ZIP male black 1965 male 02141 short of breath 1965 02141 short of breath black male 02141 chest pain black black male 02141 chest pain 1965 1965 female 0213* painful eye black 1965 female 02138 painful eye person 1965 female 0213* wheezing black person 1965 female 02138 wheezing 1965 black female 02138 obesity black 1964 female 02138 obesity 1964 1964 02138 chest pain black 1964 female 02138 chest pain black female 1964 male 0213* short of breath white 1960-69 male 02138 short of breath white person 1965 female 0213* hypertension white 1960-69 human 02139 hypertension white 1964 male 0213* obesity white 1960-69 human 02139 obesity white 1964 0213* fever white 1960-69 human 02139 fever male 1967 02138 vomiting white 1960-69 male 02138 vomiting white male 1967 male 02138 back pain white 1960-69 male white 02138 back pain GT1 GT3 k -anonymity tables based on PT Figure 5 Two k =2 in Figure 4 where 4.3. Temporal attack against k -anonymity Data collections are dynamic. Tuples are added, changed, and removed constantly. As a result, releases of generalized data over time can be subject to a be the original privately held table at time temporal inference attack. Let table T 0 ,is RT , which I will call table t T -anonymity solution based on k =0. Assume a 0 0 t , assume additional tuples were added to the privately held released. At time that is ,soitcomes k T .Let RT -anonymity solution based on be a T T table t t t 0 , linking RT respect . Because there is no requirement that RT released at time t t 0 RT may reveal sensitive information and thereby compromise and RT the tables t 0 k -anonymity protection. As was the case in the previous example, to combat this should be considered as joining other external information. problem, RT 0 would be considered a quasi- RT Therefore, either all of the attributes of 0 identifier for subsequent releases, or subsequent releases themselves would be . based on RT 0 Example 7. Temporal attack , assume the privately held information is PT in Figure 4. As stated t At time 0 GT1 and GT3 earlier, k -anonymity solutions based on PT over in Figure 5 are =2. Assume Race , BirthDate , ={ , ZIP }where k Gender QI the quasi-identifier PT GT1 At a later time t is released. , PT PT ,whichis PT ∪ becomes 1 t1 , { [ black , 9/7/65 , male , 02139 , headache ] , [ black , 11/4/65 , male 02139 , ] }. Assume a k -anonymity solution based on PT is provided, rash . Assume this table contains GT3 in Figure 5; GT andthatitiscalled t1 specifically, GT , = GT3 ∪ { [ black , 1965 , male , 02139 , headache ] t1 [ , 1965 , male black 02139 , rash ] }. As was shown in the previous , example, GT1 and GT3 canbelinkedon{ Problem } to reveal unique tuples . Likewise, GT1 and GT can be linked to reveal the same unique over QI PT t1 tuples. One way to combat this problem is to base k-anonymity solutions on Page 12

13 L. Sweeney. -anonymity: a model for protecting privacy. International Journal on Uncertainty, k 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, ∪ ( PT , – PT ). In that case, a result could be GT1 ∪ { [ black , 1965 GT1 t1 male , 02139 , headache ] , [ black , 1965 , male , 02139 , rash ] }, which does not compromise the distorted values in GT1 . 5. Concluding remark In this paper, I have presented the k -anonymity protection model, explored related attacks and provided ways in which these attacks can be thwarted. Acknowledgments First, I thank Vicenc Torra and Josep Domingo for their encouragement to write -anonymity and thank this paper. I also credit Pierangela Samarati for naming k her for recommending I engage the material in a formal manner and starting me in that direction in 1998. Finally, I am extremely grateful to the corporate and government members of the Laboratory for International Data Privacy at Carnegie Mellon University for providing me support and the opportunity to work on real- world data anonymity problems. References L. Sweeney, Uniqueness of Simple Demographics in the U.S. Population ,LIDAP- 1 WP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA: 2000. Forthcoming book entitled, The Identifiability of Data . 2 A Guide to State-Level National Association of Health Data Organizations, Ambulatory Care Data Collection Activities (Falls Church: National Association of Health Data Organizations, Oct. 1996). 3 Group Insurance Commission testimony before the Massachusetts Health Care Committee. See Session of the Joint Committee on Health Care, Massachusetts State Legislature , (March 19, 1997). 4 Cambridge Voters List Database. City of Cambridge, Massachusetts .Cambridge: February 1997. I. Fellegi. On the question of statistical confidentiality. Journal of the American 5 Statistical Association , 1972, pp. 7-18. 6 J. Kim. A method for limiting disclosure of microdata based on random noise and transformation Proceedings of the Section on Survey Research Methods of the American Statistical Association , 370-374. 1986. 7 M. Palley and J. Siminoff. Regression methodology based disclosure of a statistical database Proceedings of the Section on Survey Research Methods of the American Statistical Association 382-387. 1986. 8 G. Duncan and R. Pearson. Enhancing access to data while protecting confidentiality: prospects for the future. Statistical Science , May, as Invited Paper with Discussion. 1991. Page 13

14 L. Sweeney. k International Journal on Uncertainty, -anonymity: a model for protecting privacy. 10 (5), 2002; 557-570. Fuzziness and Knowledge-based Systems, Statistical Disclosure Control in Practice . Springer- 9 L. Willenborg and T. De Waal. Verlag, 1996. T. Su and G. Ozsoyoglu. Controlling FD and MVD inference in multilevel relational 10 IEEE Transactions on Knowledge and Data Engineering , 3:474-- database systems. 485, 1991. M. Morgenstern. Security and Inference in multilevel database and knowledge based 11 Proc. of the ACM SIGMOD Conference, pages 357--373, 1987. systems. 12 T. Hinke. Inference aggregation detection in database management systems. In Proc. , pages 96-107, Oakland, of the IEEE Symposium on Research in Security and Privacy 1988. 13 Proc. of the IEEE T. Lunt. Aggregation and inference: Facts and fallacies. In , pages 102--109, Oakland, CA, May 1989. Symposium on Security and Privacy X. Qian, M. Stickel, P. Karp, T. Lunt, and T. Garvey. Detection and elimination of 14 inference channels in multilevel relational database systems. In Proc. of the IEEE Symposium on Research in Security and Privacy , pages 196--205, 1993. T. Garvey, T. Lunt and M. Stickel. Abductive and approximate reasoning models for 15 characterizing inference channels. , IEEE Computer Security Foundations Workshop 4 , 1991. 16 D. Denning and T. Lunt. A multilevel relational data model. In Proc. of the IEEE Symposium on Research in Security and Privacy , pages 220-234, Oakland, 1987. 17 . Addison-Wesley, 1982. D. Denning. Cryptography and Data Security 18 D. Denning, P. Denning, and M. Schwartz. The tracker: A threat to statistical database , 4(1):76--96, March 1979. ACM Trans. on Database Systems security. 19 G. Duncan and S. Mukherjee. Microdata disclosure limitation in statistical databases: Proc. of the 1991 IEEE Symposium query size and random sample query control. In on Research in Security and Privacy , May 20-22, Oakland, California. 1991. L. Sweeney. Guaranteeing anonymity when sharing medical data, the Datafly system. 20 . Washington, Proceedings, Journal of the American Medical Informatics Association DC: Hanley & Belfus, Inc., 1997. 21 A. Hundepool and L. Willenborg. μ -and τ -argus: software for statistical disclosure control. Third International Seminar on Statistical Confidentiality. Bled: 1996. 22 L. Sweeney. Towards the optimal suppression of details when disclosing medical data, the use of sub-combination analysis. Proceedings, MEDINFO 98 . International Medical Informatics Association. Seoul, Korea. North-Holland, 1998. 23 J. Ullman. Principles of Database and Knowledge Base Systems . Computer Science Press, Rockville, MD. 1988. 24 T. Dalenius. Finding a needle in a haystack – or identifying anonymous census record. Journal of Official Statistics, 2(3):329-336, 1986. 25 L. Sweeney, Computational Data Privacy Protection ,LIDAP-WP5.CarnegieMellon University, Laboratory for International Data Privacy, Pittsburgh, PA: 2000. . Forthcoming book entitled, A Primer on Providing Privacy in Data Page 14

Related documents

Christl Networks  K .indd

Christl Networks K .indd

Wolfie Christl, Sarah Spiekermann Networks of Control

More info »