Article icon
Article

Ask a Data Ethicist: What Should You Know About De-Identifying Data?

This month, we’re getting practical and tactical to answer … What should you know about de-identifying data?

Data de-identification is where technical data protection methods intersect with legal and ethical privacy obligations. This is a relevant and timely topic as organizations seek to consume, share, and monetize data but must do so in ways that are compliant.

What Is Data De-Identification?

Simply put, data de-identification is removing or obscuring details from a dataset in order to preserve privacy. We can think about de-identification as existing on a continuum

For example, let’s say Joe Blogs is participating in a medical research study. The person conducting the research with Joe might know his full name and contact details. Yet, as data is shared with others on the research team, perhaps pieces of Joe’s identify are not shared in order to respect his privacy. Additionally, when the full study is published there might be a further level of obscuring the details so that Joe is no longer identifiable. 

This might look like the following:

Actual InformationPseudonymizationAnnonymization
Joe BlogsID – 88898Participant A 
06/17/1966Age: 35-65 Over 50 
123 Meadow Drive, Moab, UT 84532See IDREMOVED
555-555-5555555-###-####REMOVED

As you can see from this example, the more that data is aggregated or if certain data is removed, the less granular utility it retains. 

Annonymization means that the data has been altered in irreversible ways. For example, removing the address and phone number for Participant A means severing that connection – deleting that data –  so it is impossible to contact Participant A by email or phone. That data is no longer available. 

Pseudonymization is the application of different techniques to obscure the information, but allows it to be accessed when another piece of information (key) is applied. In the above example, the identity number might unlock the full details – Joe Blogs of 123 Meadow Drive, Moab UT. Pseudonymization retains the utility of the data while affording a certain level of privacy.

It should be noted that while the terms anonymize or anonymization are widely used – including in regulations – some feel it is not really possible to fully anonymize data, as there is always a non-zero chance of reidentification. Yet, taking reasonable steps on the de-identification continuum is an important part of compliance with requirements that call for the protection of personal data.

There are many different articles and resources that discuss a wide variety of types of de-identification techniques and the merits of various approaches ranging from simple masking techniques to more sophisticated types of encryption. The objective is to strike a balance between the complexity of the the technique to ensure sufficient protection, while not being burdensome to implement and maintain. For example, the following data masking techniques are simple to implement but they have been found to be less effective – in other words these are examples of what to avoid doing: 

TechniqueExampleWhy its problematic
Character ScramblingBLOGS becomes LGOBSEasy to reverse.
Character MaskingBLOGS becomesBLOG*Most important question is how many characters to mask? More = harder to reidentify.
TruncationBLOGS becomes BLOGSimilar as issues as masking too few characters. More = harder to reidentify.

Source: Adapted from Guide to the De-identification of Personal Health Data, p.165-167 

Considerations to Guide Your De-Identification Choices

The first decision is whether or not you are anonymizing or pseudonymizing, and that choice is largely a function of whether or not you require the ability to access or use the original information. You’ll need to consider the “nature of the data, the purposes you collect use or retain it for and the context of the processing.” (ICO)

By ensuring that personal identifiers are removed (anonymized) the data is no longer considered personal data. This is not the case with pseudonymization, at least under the GDPR (and Quebec’s Law 25), pseudonymized data is still considered personal data and as such, afforded the relevant legal protections.* 

The ICO guidance on pseudonymization suggests the following high level considerations:

  • Goals: What does your use of pseudonymisation intend to achieve?
  • Risks: What types of attack are possible, who may attempt them, and what measures do you need to implement as a result?  
  • Technique: Which technique (or set of techniques) is most appropriate?
  • Who does it? You or a processor?
  • Documenting the decisions and risk assessments

The Office of the Australian Information Commissioner also has a comprehensive guidance De-identification Decision Making Framework to help structure this process. 

Taking steps to pseudonymize data helps manage risk. It’s seen as an appropriate data protection safeguard for personal, sensitive or confidential data. Conversely, not taking steps to pseudonymize data might be construed as inadequate in terms of the duties surrounding data protection. There are ethical concerns with respect to a lack of duty of care, and there may also be legal issues that arise if there is a data breach. 

Measuring De-identication Risk

There’s a simple equation for measuring risk as it relates to de-identification:

 Overall risk = Data risk x Context risk

What’s more involved is the work of calculating data risk and context risk. The IPCO’s De-identification guidelines for structured data provides an indepth, step by step way to calculate these factors. 

Data Risk: To measure data risk, calculate the probability of reidentification for each row and then apply the appropriate risk method. It sounds straightforward, but it does involve some work particularly for a larger dataset.

“For a given row, the probability of reidentification is dependent on how many other rows in the data set have the same values for variables that are quasi-identifiers. All the rows in a data set with the same values for variables that are quasi-identifiers form an “equivalence class.” For example, in a data set with variables for gender, age and highest level of education, all the rows corresponding to 35-year-old men with post-secondary degrees would form an equivalence class. The size of an equivalence class is equal to the number of rows with the same values for quasi-identifiers.” (IPCO)

From there the numbers can be crunched to come up with a value, that is then assessed against the risk method or thresholds. When it comes to risk method, public or semi-public data is deemed higher risk (more possible access by bad actors), while non-public data is deemed less of a risk. 

Context Risk: For context risk, there are similar calculations to be completed but they boil down to putting a numeric value on the question – who might want to access this data? There are certain types of data that might be deemed more valuable or certain situations involving greater numbers of access points, both might lead to a higher context risk score. 

There is work involved in conducting this level of risk assessment but without doing the work, you are left making broad guesses rather than more informed choices. If you want to apply the appropriate level of pseudonymization to your data then you will need to do the work … or you could get a off-the-shelf tool to help streamline the process. 

*One justice has issued an opinion that pseudonymized data shared with a third party might constitute the data as effectively anonymized, but much depends on details and context 

Send Me Your Questions!

I would love to hear about your data dilemmas or AI ethics questions and quandaries. You can send me a note at [email protected] or connect with me on LinkedIn. I will keep all inquiries confidential and remove any potentially sensitive information – so please feel free to keep things high level and anonymous as well. 

This column is not legal advice. The information provided is strictly for educational purposes. AI and data regulation is an evolving area and anyone with specific questions should seek advice from a legal professional.