Anonymizing and pseudonymizing data. In a Big Data context many attributes or data points are collected about people. The more attributes we collect, the more likely it is to find out the relationship between the data and the real person. Two important ways to reduce the risk of re-identification are anonymization and pseudonymization. Both of these methods are data masking techniques. This means that the data is modified in such a way that the original identity of the person is no longer directly attributable to the data. Let's consider first what anonymization is. The strongest way to a hide data is anonymity, that is data persons are affected but nobody knows who, and there's no way to find out. And if there's no way to find out the relationship between data and the associated persons would call this anonymous relationship. And pseudonymization But there may be an indirect association that is not to the real persons but to some names which you cannot directly associate with real persons, that would be with pseudonym as we would call it or persona, and in our everyday life, we know a lot of persona or pseudonym. For example, every telephone number is a pseudonym of the telephone owner. Every bank account number is a pseudonym of the bank account holder, and with both examples there is a way to reveal the relationship So, anonymization refers to the act of removing all information that can be linked to a certain person. For example, names in a database might be completely destroyed so that the identities of the people are no longer recognizable. Pseudonymization on the other hand allows for the removal of an association with the data subject. How might this be realized? With pseudonymization, we just substitute the text, that means we don't eliminate the text or replace it by one specific pattern, but instead we generate some artificial substitution of specific fields. With Pseudonymization It is possible to go back and re-identify the person. For example, if in one data table we have information about age, nationality and a diagnosis with pseudonyms, we know for example that the person with pseudonym S729S is 48 years old, from Germany and had a heart attack. Let's assume a second table exists that matches names and pseudonyms. We can combine the information from both tables and can re identify an actual person. We find out for example that Johan Meyer had a heart attack. It can also be possible to re-identify people from pseudonymized data by making links between different data sets. Let's assume that we do not have the table with real names and pseudonyms available. Instead, we get access to the following pseudonymized table with information about age, income and employer. If we now combine this with the original data set including health data, then it would be easy to predict that X729S is Johan Meyer, especially if you know how small a village like Houghton Bogota is. In comparison, this re-identification is much more difficult with anonymized data. In fact when anonymizing data It is good practice to adhere to key anonymity to make linking more difficult. The number K in key anonymity is the minimum number of individuals that have equivalent identities within the table and hence cannot be distinguished In other words, a given person might correspond to any of K or more table entries. An important thing to consider when choosing to apply anonymization and pseudonymization methods is the context within which you are working. When we look on the healthcare domain, especially and for example hospitals, well, we have patient data. Then we have to understand that we cannot apply Pseudonymization here. Why? Because when we eliminate data in this specific table such as the disease, age or whatever this is highly relevant for the doctors, that hints that if we unanonymized the patients back they can't the doctors can't work with these data anymore. That means here pseudonymization can be used but not anonymization. Also regarding anonymization we ask ourselves, so what actually is there to anonymize? There are specific patterns, but some make sense and some other, don't. So for instance, what should be anonymized are highly confidential data such as bank account numbers, such as social ID number, personal identification number, and such numbers really should be anonymized, but also constellation of specific attributes and a database that together for example three columns like birthplace, first, last name. I have somehow some kind of a uniqueness. That means that I can identify someone better if I have these three specific keys So, pseudonymization should be used when you want to keep highly relevant information visible and when the possibility To re-identify the person is important. For example, for work or research purposes. Highly confidential information or clusters of information that when combined reveal highly confidential information should rather be anonymized. In summary, pseudonymization and anonymization can help to mask the relationship between data and a real person. Anonymization is the strongest way to do this and involves removing all information linked to a real person Pseudonymization involves a pseudonym to mask the identity. By linking attributes or by using identification tables, persons can be re-identified from pseudonymized data. K-anonimity is a concept that addresses the risk of re-identification of anonymized data through linkage to other datasets.