Getting started with data de-identification

Data submitted to ARCHIMEDES must comply with applicable privacy regulations and ethical approvals. In many cases this involves de-identifying or coding data with consent prior to submission.

The tools and resources below are provided for educational purposes only, and researchers are responsible for ensuring their data is prepared appropriately.

The De-identification Lifecycle

The de-identification process follows an iterative cycle. Each phase builds on the previous one to manage re-identification risk while preserving data utility.

  1. Secure Environment — Conduct de-identification within an approved secure system.
  2. Identify Variables — Identify direct and indirect variables that may contribute to re-identification risk.
  3. Assess Risk – Evaluate likelihood of re-identification based on context and intended use.
  4. Apply Techniques – Apply appropriate methods to reduce identified privacy risks.
  5. Evaluate Utility — Assess impact on data quality and analytical usefulness.
  6. Document & Review – Record decisions and periodically reassess as context evolves.

De-Identification Fundamentals

See our short, high-level videos introducing the fundamentals of health data de-identification, key terminology, and common risk considerations.

Playlist

2 Videos
Transcript

Welcome to ARCHIMEDES, the Advanced Research Collaboration for Health Integration, Medical Exploration, and Data Synthesis – a platform designed for seamless and secure medical data sharing.

Preparing data for sharing on ARCHIMEDES involves several steps to ensure privacy, security, and compliance with legal and ethical frameworks. One crucial component of this process is data de-identification. Data must be fully de-identified by the uploader before it is submitted to ARCHIMEDES.

De-identification is the process of removing or modifying personal information from data. This ensures that patient privacy is protected in medical data. It protects patient privacy, minimizes the risk of breaches, and allows data to be shared for collaboration – all while staying compliant with privacy laws.

However, de-identifying data isn’t always simple. It requires balancing privacy with data usability – and compliance with a range of regulatory frameworks. De-identification must account for potential risks of re-identification, especially with advances in data analytics and machine learning. Proper de-identification is essential for fostering trust in data sharing among stakeholders while preserving the value of the data for research and clinical use.

The terms “de-identification” and “anonymization” are often used interchangeably, but terminology can vary. Both processes remove personal health information (PHI) to protect privacy. Anonymization irreversibly removes PHI, which minimizes the risk of re-identification. On the other hand, de-identification (sometimes also called “pseudonymization”) removes most PHI, but may retain low-risk identifiers or use coding or encryption to preserve data utility over time. While both methods aim to protect privacy, de-identification often allows researchers to link data across time or datasets, whereas anonymization eliminates this possibility for greater privacy protection. Both anonymization and de-identification protect privacy and ensure compliance with privacy regulations, but de-identification often allows for greater data utility. To achieve this, a variety of techniques can be used to effectively remove or alter sensitive information. Let’s explore some of the most commonly used de-identification methods

First, data masking. This involves the removal or modification of direct identifiers—things like names, phone numbers, and medical record numbers. Masking is often the first and most straightforward step in the de-identification process.

Next, data perturbation. This method slightly modifies the values of sensitive data to protect identity. For example, an age or date might be adjusted by a small, random amount. While the overall dataset stays statistically meaningful, individual-level precision is blurred to protect privacy.

Finally, tokenization. This replaces identifiable data with unique codes or pseudonyms that cannot be linked back to an individual without a secure key. Tokenization is especially helpful when researchers need to track records across time or across datasets, without compromising identity.

Together, these tools form the foundation of most de-identification strategies—removing identifiers, adding uncertainty, and preserving utility where possible.

In Canada, the Personal Information Protection and Electronic Documents Act—PIPEDA— outlines legal requirements for de-identifying medical data. In the U.S., the HIPAA De-identification Standard outlines similar rules. These frameworks define how data must be treated. While the PIPEDA outlines legal requirements for data de-identification, the Office of the Privacy Commissioner of Canada provides guidance on how to adequately de-identify data. Some provinces also have their own regulations. There are lots of other resources available to learn more about de-identification regulations.

To explore resources, templates, and tools for data de-identification, visit the ARCHIMEDES platform and learn how to get started.

See also

  • De-identification workflows (coming soon)
  • Tutorials and workshops (coming soon)
  • Link library (coming soon)