Let's take a minute to look at how de-identification works. You can do this by having fields that you retain in de-identified data and when you have these fields they may have identifying data that the de-identification algorithm missed. So for example, there's the physician notes field in your medical record that's a text field, and that would be retained because that is not firstly identifying information, and the physician carelessly mentioned a patient's name in the text field. John came in with a big headache. Well, now you know that the patient's name was John. Often, data is structured as a network or a graph, and even if you've taken away the labels that tell you which node in the network is what. If you can match the structure of a graph with the structure of another graph, you can figure out things. Other ways in which the identification is defeated, is by a combination of multiple partial identification. So the examples we looked at with AOL and the Massachusetts Health Information, are things where the attackers had a partial identification of the user from each individual piece of data that was retained. And then, one could use external data sets. And the Netflix example, the IMDB data set, is the one that resulted in the de-identification being defeated. When you defeat de-identification, there are four different types of leakage that can take place. The one that we've focused on, thus far, is revealing identity. But there might be other things that one is able to reveal that are less than revealing the full identity, but could, nevertheless, be extremely damaging. So, you could reveal the value of a hidden attribute. So, I don't know exactly all the details of your medical record, but I know that you have cancer. That was an attribute that you didn't want to have revealed. And I managed to figure that out. I may be able to reveal the link between two entities. So, if I have a phone call meta data, that is, I know who called who even though I don't know what they talked about, that might be enough for me to figure out what the friendship circles are by networks of phone calls. And that might tell me what people are friends and therefore, what entities are related. I may be able to reveal group membership. And this is not just network group membership, but if I'm, for instance, tracking your cell phone location and I see where you are every Sunday morning, I can, from there determine firstly that you're religious and secondly what your religious denomination is because I know precisely which house of worship you are at every Sunday morning. In terms of data that you think is private, that actually becomes public, it isn't just algorithms doing this or attackers doing this, even your friends and relatives can play a role. So, you go to a party. You don't tell people that you were at this party but a friend posts photos of a party. And in some of those photos you are there. And so, everyone who sees these photos know you were at the party. And you have no control over what photos your friends chose to post. To take another example, one might say it's an individual choice. And some people may be willing to make their DNA public, for example, because they wish to promote medical research. But, if I know your DNA, I also have a pretty good idea about the DNA of your close blood relatives. I don't know exactly what their DNA is, but we know, in terms of genetics, that your parents and your children, your siblings have DNA that is in many ways very similar to yours. And so, their DNA is now not quite public, because there is some uncertainty, but it's semi public. And, they didn't play any role in that happening because you made your choice about making your DNA public. So, to bring all of this together, I think that a way to think about this is that an anonymity is virtually impossible. There's enough other data in the world. And so, if you think that something is going to be anonymized because you have a diverse entities set, you can probably eliminate that by joining with external data. There are techniques for randomization where the idea is to do random perturbation. But those techniques work if you can only guarantee a one time perturbation. If I can repeatedly do this then, on average the randomness should disappear. Aggregation works only if there is no known structure among the entities aggregated otherwise, the biases show up. In terms of image data, faces can be recognized. And the technology for facial recognition has been improving by leaps and bounds. And now, we're able to do this even under challenging conditions and even with things like partial occlusion. The net result of this is, we really shouldn't assume that we have anonymity. And we should design other parts of our system to work assuming that anonymity can be breached by somebody who really wants to. So, a knee jerk reaction then is to say, "Oh, anonymity is impossible. Let's not publish the data." But access to data is crucial for many, many desirable purposes. We've talked about medical research. You're going to have public watchdogs. You want government agencies to publicize a lot of their data. And, if the fears of revealing personal sensitive data prevents that from happening, we lose the benefits in terms of medical advance or in terms of having watchdogs watch over how our government invests resources. So what can we do? I think that, really, the issue here is, if you've got de-identified data, it is equivalent to you putting a lock on the door in your home. So, I came here today, I locked the door in my home and what that means is that a passerby wouldn't casually just be able to enter. It doesn't mean that I've got a door that couldn't be broken down or a lock that couldn't be picked or a window that couldn't be broken and used to force entry. It certainly isn't the case that my home is guaranteed invulnerable. But, there still is value to locking the door when I leave the home. And I think that's the way we want to think about de-identification of data. You have de-identified data and this means that casual identification isn't possible. Somebody who is sufficiently committed to it will be able to re-identify it. And what we want is, to make it hard for such people to access the data. And so, you want to have data that you want to make public but you want to make it public through a licensing regime. And this license can be a contract of some sort. Or it can be a thing that is enforced as a professional standard. And, just the fact that people aren't idly trying to re-identify de-identified data sets, I think is going to work for a lot of purposes that we worry about identification. So, the general idea here is, you want to be in control of how you're represented on the web. That is your notion of identity. It's very hard to manage. Anonymity is possible only in very limited, narrow situation. And so, we have to come up with a mechanism to be able to get value from our data, while realizing that anonymity is going to be limited. And de-identification has a role to play in this regard. Imperfect though, it might be.