De-identification is the removal of identifiable information from data. So the general interpretation is, if in a data set we have things like name, phone, address, take these things out. And once these things are out, there are no personally identifying attribute values in the data set. And so, the identity of the person is not immediately evident. Now, so far so good. The point is, we need to recognize that if we have performed de-identification, all it means is that the identity of the person is not immediately evident but it could potentially be determined. So let's think a little more about this. Given zip code, birth date, and sex, about 87% of Social Security Numbers can be determined uniquely. But zip code, birth date, and sex are not normally considered personally identifying information. Because hey, zip code, I roll data up by zip code and that is used for any number of different reporting things that are aggregated by zip code. And so, you could be very safe in terms of how you handle personally identifying information. But you could still be revealing really sensitive things because you've got all three of the fields above which leads to Social Security Number which leads to everything, pretty much. Now this set of fields was actually used in a famous Massachusetts Re-identification incident where the state General Insurance Commission released de-identified health records. And again, they did so because they thought they were serving a good public purpose in doing this. And researcher Latanya Sweeney used these to locate the health record of then governor William Weld. And in particular, she was able to look up his diagnoses and prescriptions and she sent him a letter with all of this intimate health history in her letter, just as a way of making her point. You don't actually need to have sensitive or potentially sensitive things like zip code and birth date. To consider a more difficult scenario, Netflix organized a prize competition where they released a data set and in this data set, they had User ID, the date, the movie name, and the rating given by the user for that movie. And presumably from the movie name, one could then linked to a lot of information about the movie itself, the director, the actor, the release date, etc. etc. Netflix thought they'd completely de-identified this. If you think about what does somebody know about a user from this? Well, all one knows is a User ID which is made up ID. But one also knows what movies the user likes and on what date they watched which movie, and that's it. They thought they were safe releasing this data set. And they offered a million dollars to anybody who could improve their movie recommendation system by more than 10%. Now what happened was, that many of Netflix users had actually posted movie reviews on IMDb at the same time as they had watched the movies and rated movies numerically on Netflix. Some researchers were able to link users across the two systems. IMDb system where the users were identified and they were actually talking about the movies they watched and the Netflix system where you could see all the movies that somebody had watched. Now think about this, a user made a conscious choice to post a review about some of the movies that they watched. This is a careful conscious choice. These are the movies that they decided they had something interesting to say about and more importantly, these were movies that they were willing to be publicly known to have watched. Movies that they didn't post reviews on IMDb, as far as these customers were concerned, were things that were between them and Netflix. Nobody else should have known that they watched something else. Now, if you have a sexual orientation that you wish to keep hidden, maybe you post reviews of big blockbuster movies but you won't post movie reviews for, let's say, gay movies that you watched. And this was exactly something that was a problem with the linkage of data between Netflix and IMDb. Netflix was sued by a lesbian mom who had not yet come out of the closet and she claimed that Netflix was outing her by releasing this data set. And this case went through two years and more of litigation then was settled for $9 million. The consequence of this whole saga is that Netflix canceled plans for additional rounds of its Prize challenge. It did the one round then it got into trouble. It had intended to do additional rounds and so technology, in terms of technology progress, was slowed down because the free sharing of data that one would have wanted, couldn't take place. And in the case of movie recommendations, maybe this isn't such a big deal. But if this were medical care and a fight against cancer, you can imagine how we may have different notion of the social cost that gets paid because it's important to retain people's privacy.ize