Skip to main content

PHI and PII for healthcare in the world of AI

A practical guide on working with health data, safely, with multiple approaches for handling PHI
Created on October 6|Last edited on October 17
In the digital age of healthcare, protecting patient information has become more challenging and crucial than ever. The concept of Protected Health Information (PHI) plays a central role in maintaining patient privacy and confidentiality. PHI refers to any health-related data that can identify an individual, whether in physical or electronic form. The rise of electronic health records (EHRs) and digital communications in healthcare has given rise to Electronic Protected Health Information (ePHI), which falls under stringent regulations such as the Health Insurance Portability and Accountability Act (HIPAA).
Understanding and managing ePHI is vital for healthcare providers, insurance companies, and any entity handling patient information to ensure compliance and avoid potential breaches. The nuances between PHI and ePHI are significant, as ePHI specifically pertains to health information in digital format, encompassing everything from EHRs to patient data stored in cloud services. Ensuring the privacy and security of ePHI is critical, given the sensitive nature of the data and the severe penalties for non-compliance with HIPAA regulations.
This article covers the details of what constitutes ePHI, explores HIPAA regulations surrounding it, outlines the 18 HIPAA identifiers, and discusses methods for de-identifying health data to protect patient privacy. By understanding these core concepts, healthcare professionals and organizations can better navigate the complex landscape of patient data privacy and security in the modern world of artificial intelligence and digital healthcare.
If you'd like to skip the preamble and jump right into a project where we mask PHI, just click here.


Table of contents



What is Protected Health Information?

Protected Health Information is any health-related information that can identify an individual and is created, used, or disclosed in the course of providing healthcare services. This includes medical records, billing information, and any data that ties health information to a specific person.
PHI is regulated under the Health Insurance Portability and Accountability Act in the United States, which establishes strict guidelines for how this information can be accessed, shared, and protected to ensure patient privacy and security.
PHI differs from general Personally Identifiable Information (PII) because it specifically concerns health-related data used in the context of medical care. While PII covers a broad range of identifiable information like phone numbers, email addresses, and social security numbers, PHI is unique to healthcare and includes any data that connects health information to an individual. For example, while a name or date of birth alone is considered PII, if either is linked to medical conditions or healthcare services, it becomes PHI under HIPAA regulations.
In healthcare, PHI is vital for patient care, billing, insurance claims, and healthcare operations. It includes information that reveals an individual’s health status, medical treatments, and financial transactions related to medical services. Examples of PHI include patient medical records, diagnoses, treatment plans, appointment schedules, prescription information, and billing statements. Because this information is sensitive and can potentially expose individuals to identity theft or discrimination if mishandled, HIPAA enforces stringent measures to protect it.
The significance of PHI extends beyond clinical care, as it plays a critical role in maintaining patient trust and ensuring compliance with legal standards. Healthcare providers, insurance companies, and related entities must follow HIPAA's Privacy Rule and Security Rule, which outline the necessary safeguards to protect the confidentiality, integrity, and availability of PHI. The regulations also dictate how PHI can be used or disclosed, primarily limiting it to purposes related to treatment, payment, or healthcare operations, unless the patient provides explicit authorization for other uses.
While the focus of this discussion is on U.S. regulations like HIPAA, it's important to note that other jurisdictions, such as the European Union, have their own data privacy regulations like the General Data Protection Regulation (GDPR), which also covers health-related information. Future sections of this document will provide tutorials on data anonymization, de-identification, and privacy techniques that are universally applicable for protecting sensitive data, regardless of regulatory environment.

Electronic Protected Health Information (ePHI) and HIPAA Regulations

Electronic Protected Health Information refers to any Protected Health Information that is created, stored, transmitted, or received in electronic form. It encompasses a broad range of data formats such as electronic medical records, emails containing health information, or billing details stored in a digital system. The main difference between PHI and ePHI is the medium: PHI can exist in any form (e.g., paper, oral, or electronic), whereas ePHI is specifically limited to electronic formats. This distinction is important because electronic data poses unique risks and requires specific protections due to the potential for cyber threats and unauthorized access.
HIPAA’s regulations on ePHI are detailed in the HIPAA Security Rule, which establishes a set of standards to ensure the confidentiality, integrity, and security of electronic health data. The Security Rule mandates that healthcare providers, health plans, and business associates implement administrative, physical, and technical safeguards to protect ePHI. Administrative safeguards include policies and procedures to manage the selection, development, and use of security measures. Physical safeguards involve controlling physical access to electronic systems and facilities. Technical safeguards include the use of encryption, access controls, and audit trails to protect data during transmission and storage. Additionally, the HIPAA Privacy Rule governs how ePHI can be used and disclosed, focusing on patient consent and limiting disclosures to purposes of treatment, payment, and healthcare operations unless otherwise authorized.
Compliance with these regulations is crucial, as failure to protect ePHI can result in significant legal and financial penalties, as well as reputational damage. HIPAA violations can lead to fines ranging from $100 to $50,000 per violation, depending on the level of negligence and the number of affected individuals. Organizations must regularly review and update their security policies, conduct risk assessments, and train staff on the proper handling of ePHI to remain compliant and mitigate potential breaches.

The 18 HIPAA Identifiers

Protected Health Information is defined by its association with certain identifiers that can be used to trace an individual’s identity. HIPAA outlines 18 specific identifiers that must be removed or anonymized to de-identify health data.
These identifiers include the following:
1. Names
2. Geographic subdivisions smaller than a state (e.g., street address, city, county, ZIP code)
3. Dates (other than year) directly related to an individual (e.g., birthdate, admission and discharge dates)
4. Telephone numbers
5. Fax numbers
6. Email addresses
7. Social Security numbers
8. Medical record numbers
9. Health plan beneficiary numbers
10. Account numbers
11. Certificate or license numbers
12. Vehicle identifiers and serial numbers (including license plate numbers)
13. Device identifiers and serial numbers
14. Web URLs
15. IP addresses
16. Biometric identifiers (e.g., fingerprints, voiceprints)
17. Full-face photographs and other comparable images
18. Any unique identifying number, characteristic, or code
These identifiers are considered sensitive because they can be used alone or in combination with other information to identify a specific individual. For example, even if a medical record number is anonymized but linked with a specific date of service or a particular diagnosis, it can still pose a risk of re-identification. Mishandling these identifiers can lead to unauthorized disclosure of PHI, resulting in privacy breaches, identity theft, and violation of HIPAA regulations. Therefore, organizations handling PHI and ePHI must ensure these identifiers are appropriately protected or removed when sharing or analyzing health data.
Organizations handling health data must adhere to HIPAA regulations to protect patient privacy, particularly when dealing with Protected Health Information. Under HIPAA, organizations have two primary methods for de-identifying health data to ensure that it no longer qualifies as PHI and can be shared more freely: the Safe Harbor method and the Expert Determination method.

The Safe Harbor method

The Safe Harbor method involves the removal of 18 specific identifiers, such as names, dates of birth (except the year), geographic subdivisions smaller than a state, and other direct identifiers. Once all these identifiers are removed, the data is considered de-identified, and no additional steps are necessary. However, while the Safe Harbor method is straightforward, it often results in a significant loss of data utility. Removing geographic information or detailed dates, for example, can hinder the ability to conduct certain types of analysis, such as studying disease outbreaks or tracking health trends over time.

The Expert Determination method

The Expert Determination method, on the other hand, offers more flexibility. This method involves having a qualified expert apply statistical or scientific principles to assess the risk of re-identification and certify that the data is de-identified. The expert will analyze the data and use methods statistical methods to minimize the likelihood of re-identification. Because this method doesn’t require the removal of all 18 identifiers, the Expert Determination method allows for greater data utility while still protecting patient privacy. The trade-off is that it’s more complex and costly to implement since it requires the involvement of a qualified expert to evaluate the risk and certify the de-identification.

The difference between encryption and de-identification

Encryption and de-identification are distinct methods for protecting sensitive health data. Encryption works by converting readable data into an unreadable format using a cryptographic key, ensuring that only authorized parties can access the information. This protects data during storage and transmission but does not change the fact that the data remains classified as PHI because it can be decrypted back to its original form.
Encryption is particularly useful for securing data that needs to be maintained in a complete form for analysis or internal use. When storing private health data, it's highly recommended to implement encryption, especially when transmitting data over networks or storing it in the cloud. This ensures that even if unauthorized individuals gain access to the storage medium or intercept the data during transmission, they cannot interpret or use it without the decryption key. However, while encryption protects the confidentiality of the data, it does not alter the structure or content of the data itself, which means it still carries identifiable information. As a result, encrypted data is still considered sensitive and must be handled according to privacy regulations like HIPAA.
De-identification, on the other hand, goes further by removing or transforming identifiable elements so that the data can no longer be used to trace back to a specific individual. When done correctly, de-identified data is no longer considered PHI under HIPAA, allowing it to be shared more freely for research or public health purposes without the need for additional privacy safeguards. Unlike encryption, de-identification involves altering or suppressing data elements to prevent re-identification, which can sometimes reduce the data’s analytical value.

Why use encryption or de-identification?

Encryption is typically used when data must remain fully intact for authorized users, such as when storing health records in databases or transmitting sensitive information between healthcare providers. It ensures that even if unauthorized individuals gain access to the data, they cannot read or interpret it without the decryption key. De-identification is more suitable when data needs to be shared with external entities, used for research, or published for broader use. Since de-identified data no longer contains personal identifiers, it poses less risk if shared or disclosed, making it ideal for non-treatment-related purposes.

Why not just use the Safe Harbor method?

The Safe Harbor method provides a straightforward checklist for de-identification by requiring the removal of 18 specific identifiers. While easy to implement, it often results in a loss of data utility because it removes critical information necessary for certain types of analyses. For example, removing geographic information smaller than a state makes it difficult to perform regional health studies, and removing specific dates hinders the ability to conduct time-series analyses. Additionally, the Safe Harbor method does not account for the possibility of re-identification attacks, where attackers cross-reference anonymized datasets with other publicly available information. This limitation makes it unsuitable for complex datasets or cases where data utility needs to be preserved.

Understanding how attackers operate: Data cross-referencing example

One of the main challenges with de-identification is preventing attackers from re-identifying individuals using cross-referencing techniques. A landmark example in the field of data privacy occurred in Massachusetts when the state released anonymized health records of state employees. In a groundbreaking study, researcher Latanya Sweeney demonstrated how vulnerable traditional de-identification methods can be by re-identifying individuals using publicly available voter registration data. Through this process, she used quasi-identifiers such as ZIP code, birthdate, and gender to link health records back to specific individuals—including the Governor of Massachusetts. By narrowing down the dataset based on the Governor's known ZIP code, birthdate, and gender, she was able to pinpoint his medical records and expose his sensitive health information.
This study was pivotal because it showed how seemingly "anonymized" data could still lead to re-identification when cross-referenced with external datasets. Sweeney’s research became a cornerstone in data privacy discussions and influenced the development of more advanced privacy-preserving techniques like k-anonymity, l-diversity, and t-closeness. The findings emphasized the critical need for comprehensive de-identification strategies that consider the broader data ecosystem, setting new standards for how privacy should be approached in the digital age. This landmark paper underscored the complexity of true anonymity and changed the landscape of data privacy by demonstrating the ease with which personal information could be uncovered, even from anonymized datasets.
How Latanya Sweeney was able to cross reference publicly available data [1]

Explanation of K-Anonymity, L-Diversity, and T-Closeness

K-anonymity, L-diversity, and T-closeness are advanced privacy models that address the limitations of simpler de-identification techniques by providing structured ways to protect against re-identification and attribute disclosure risks. Each model measures privacy from a different perspective, ensuring that sensitive information in a dataset is sufficiently protected while maintaining analytical value.

K-Anonymity

K-anonymity ensures that each individual in a dataset cannot be distinguished from at least *k* other individuals based on shared quasi-identifiers, such as age, gender, or ZIP code. If a dataset satisfies a k-anonymity value of 5, for example, it means that for every combination of quasi-identifiers, there are at least five individuals with those same attributes.
This guarantees that each individual is hidden among at least four others. However, k-anonymity alone may still leave a dataset vulnerable to attribute disclosure. For instance, if a k-anonymous group consists of five individuals and all of them share the same diagnosis, an attacker can easily infer that diagnosis even if the individuals cannot be uniquely identified. This is known as a homogeneity attack, where uniformity within the group compromises privacy by revealing sensitive information about all members.
For example, consider a dataset of patients with attributes such as gender, ZIP code, and age. If k-anonymity is set to 5, then for every unique combination of gender, ZIP code, and age, there must be at least five individuals sharing that combination. If only one person in the dataset is a 45-year-old male from ZIP code 12345, achieving k-anonymity would require either generalizing the age to a broader range, such as 40-49, or reducing the precision of the ZIP code to include only the first three digits, such as 123xx. This way, the unique 45-year-old male can no longer be distinguished from at least four others, thereby ensuring that he cannot be easily re-identified.

L-Diversity

L-diversity enhances k-anonymity by ensuring that there is a sufficient variety of sensitive attribute values within each group. Even if a group meets the k-anonymity criteria, l-diversity guarantees that an attacker cannot infer the sensitive attribute too easily because there are multiple possibilities for that attribute within each group.
If the sensitive attribute is a medical condition and the dataset satisfies a 3-diversity standard, each group must contain at least three distinct diagnoses. This would prevent the attacker from confidently guessing the medical condition of any individual based on group membership. However, l-diversity can still be inadequate when the distribution of sensitive attributes is skewed. For example, if one diagnosis is significantly more common than others within a group, an attacker could still make strong inferences about an individual's condition. This situation is known as a skewness attack, where certain values dominate within a group, undermining the diversity intended to protect privacy.
For example, let’s say we have a k-anonymous group where all five individuals have the same diagnosis, such as “Diabetes.” While the group is protected from identity disclosure, knowing that an individual belongs to this group immediately reveals their diagnosis, defeating the purpose of de-identification. To achieve l-diversity, the group must include at least two or more different diagnoses, such as “Flu” and “Asthma,” to prevent attackers from inferring sensitive information based solely on group membership. By ensuring multiple sensitive attribute values, l-diversity reduces the risk of attribute disclosure.

T-Closeness

T-closeness addresses the limitations of both k-anonymity and l-diversity by ensuring that the distribution of sensitive attributes within each group is similar to the distribution of those attributes in the overall dataset. This reduces the risk of an attacker learning specific information about an individual based on their group membership.
For instance, if the overall dataset has a 10% probability of "Cancer" and a 30% probability of "Flu," t-closeness would require each group’s distribution of these diagnoses to be as close as possible to these global probabilities. This minimizes the risk of an attacker gaining too much information about a group member’s sensitive attribute simply by knowing which group they belong to. In contrast to l-diversity, which only ensures that there are multiple sensitive values within a group, t-closeness ensures that the values are proportionally represented, thereby providing a more robust protection against attribute inference.
For example, suppose a dataset satisfies both k-anonymity and l-diversity, but the distribution of sensitive attributes within groups is uneven. If the overall dataset has a 15% probability of “Cancer” and an 85% probability of “Flu,” while a particular group has a 90% probability of “Cancer” and only 10% probability of “Flu,” the group still reveals too much information about the prevalence of “Cancer.” This uneven distribution discloses too much information about the individuals in the group, making them vulnerable to attribute disclosure even if the group is diverse. T-closeness would require that the distribution within each group be close to the overall dataset distribution, such as having “Cancer” at 15% and “Flu” at 85%, thus protecting against attribute disclosure by ensuring that no group disproportionately reveals a specific attribute’s prevalence.

Choosing the right privacy metrics

While k-anonymity, l-diversity, and t-closeness offer increasing levels of protection, they each come with trade-offs between privacy and data utility. Achieving high k-anonymity can require extensive generalization, which may reduce the usefulness of the data for certain types of analysis. L-diversity provides better protection for sensitive attributes, but skewed distributions can still lead to vulnerabilities. T-closeness offers the most robust protection by maintaining similar attribute distributions, but it can significantly affect the utility of the data, making it less suitable for fine-grained analysis. Consequently, selecting the appropriate privacy model depends on the specific requirements and sensitivity of the dataset, as well as the type of analysis that will be performed.
Choosing target values for k-anonymity, l-diversity, and t-closeness is more of an art than a science, as there are no specific guidelines or universally accepted thresholds that guarantee optimal privacy protection. The right target values depend on the nature of the dataset, the sensitivity of the attributes, and the acceptable level of data utility loss.
In practice, k-anonymity values of 5 or 10 are commonly used, while l-diversity values often range from 2 to 5, and t-closeness is typically set between 0.1 and 0.3. Higher k or l values increase privacy but can reduce data utility, and lower t values provide stronger protection but might significantly distort the dataset. Ultimately, setting target values involves balancing privacy risks against the analytical needs of the dataset. It is important to conduct iterative testing and evaluation of different transformations, considering the specific context of the dataset and the goals of the analysis, to find the best compromise between privacy and utility.

A practical tutorial

For our tutorial, I generated a synthetic dataset that mimics real-world health data and voter information. The initial dataset contained attributes such as ZIP code, birthdate, gender, and diagnosis, which are often used to identify individuals.
I chose to use synthetic data instead of real-world information because publicly available datasets have already undergone de-identification processes, making it challenging to demonstrate re-identification risks.
💡
Using synthetic data enables us to illustrate the vulnerabilities and attack vectors that could be used on real-world datasets before de-identification is applied. This also ensures that we can highlight privacy risks and demonstrate de-identification techniques without compromising any actual personal information.
If you're interested in the dataset, I shared the script that generates it in the project’s GitHub repo, so feel free to check it out. Additionally, I have uploaded the two CSV files to a W&B Table here:

Run set
1



Run set
1

The initial medical dataset had a k-anonymity and l-diversity value of just 1, indicating that every individual record was unique and had no diversity in the sensitive attributes within groups. This meant that any individual in the dataset could possibly be cross referenced with another dataset, and pinpointed with absolute certainty, making the data highly susceptible to re-identification attacks.
The dataset also included a synthetic record of a well-known figure—a governor—who could easily be cross-referenced between the voter and medical datasets using these attributes. This cross-referencing highlighted how attackers could use quasi-identifiers to re-identify individuals, underscoring the importance of robust de-identification techniques.

Cross referencing our own data

To demonstrate why this data is not safe, I wrote a script that cross-references the records based on shared quasi-identifiers. By matching attributes like ZIP code, birthdate, and gender, the script successfully identified the governor in both datasets, revealing sensitive medical information.
This exercise clearly shows how vulnerable such data can be to re-identification attacks, even when no direct identifiers like names or Social Security numbers are present. Here's the script which is able to cross-reference the governor between the voter dataset and the medical data dataset:
import pandas as pd

# Load both the voter and medical datasets, specifying data types
dtype_spec = {'ZIP_code': str, 'Birth_date': str, 'Gender': str}

voter_data = pd.read_csv('synthetic_voter_data_with_governor.csv', dtype=dtype_spec)
medical_data = pd.read_csv('synthetic_medical_data_with_governor.csv', dtype=dtype_spec)

# Define the cross-reference points based on the "Governor-like" entry
governor_zip = '02138'
governor_birth_date = '1945-07-31'
governor_gender = 'Male'

# Perform the cross-reference on the voter and medical data
matched_voter = voter_data[(voter_data['ZIP_code'] == governor_zip) &
(voter_data['Birth_date'] == governor_birth_date) &
(voter_data['Gender'] == governor_gender)]

matched_medical = medical_data[(medical_data['ZIP_code'] == governor_zip) &
(medical_data['Birth_date'] == governor_birth_date) &
(medical_data['Gender'] == governor_gender)]

# Check that exactly one record is found in both datasets
if len(matched_voter) == 1 and len(matched_medical) == 1:
print("\nCross-reference successful: This individual appears in both datasets!")
print("Matched Voter Record:")
print(matched_voter)
print("\nMatched Medical Record:")
print(matched_medical)
else:
print("\nCross-reference failed: No unique match found in both datasets.")
if len(matched_voter) > 1:
print(f"Warning: Multiple voter records found ({len(matched_voter)})")
if len(matched_medical) > 1:
print(f"Warning: Multiple medical records found ({len(matched_medical)})")
After running the script, we see that we were able to cross reference the governor, and see his specific medical condition:


De-identifying our data

To tackle these vulnerabilities, I developed a script that applies various de-identification transformations to the dataset, such as generalizing birthdates into age ranges, truncating ZIP codes to broader regions, suppressing gender information, and binning financial attributes. By systematically testing different combinations of these transformations, we can evaluate their impact on key privacy metrics like k-anonymity, l-diversity, and t-closeness.
For instance, if we aim for a k-anonymity value of at least 5—meaning each individual cannot be distinguished from at least four others—we can iteratively apply these transformations until this privacy threshold is met. Likewise, evaluating l-diversity and t-closeness ensures that sensitive attributes like diagnoses are adequately protected, and their distribution in each group closely mirrors the overall dataset, reducing the risk of attribute disclosure.
Here's the script that will apply several different transformations, and measure the previously described privacy metrics. The script displays the results for each combination of transformations, and then asks the user which combination they would like to use. We also log our privacy metrics using W&B Tables, for easy analysis of the results.
Here's the script I used for this:
import pandas as pd
import numpy as np
from itertools import combinations
import wandb

# Initialize W&B project
wandb.init(project="data_deidentification", name="transformation_metrics_logging")

# Load the medical dataset, specifying data types
dtype_spec = {
'ZIP_code': str,
'Birth_date': str,
'Gender': str,
'Total_charge': str, # Ensure that 'Total_charge' is read as a string initially
'Diagnosis': str
}

# Load medical data
medical_data_original = pd.read_csv('synthetic_medical_data_with_governor.csv', dtype=dtype_spec)
medical_data = medical_data_original.copy()

# Define the possible transformations
def generalize_birth_to_year(data):
data['Birth_date'] = pd.to_datetime(data['Birth_date'], errors='coerce').dt.year.astype(str)
return data

def generalize_birth_to_age_range(data, bin_size=10):
data['Birth_date'] = pd.to_datetime(data['Birth_date'], errors='coerce')
current_year = pd.Timestamp.now().year
data['Birth_date'] = current_year - data['Birth_date'].dt.year # Calculate age
data['Birth_date'] = data['Birth_date'].fillna(0).astype(int)
data['Birth_date'] = ((data['Birth_date'] // bin_size) * bin_size).astype(str) + '-' + \
((data['Birth_date'] // bin_size) * bin_size + bin_size - 1).astype(str)
return data

def generalize_zip_code_prefix(data, prefix_length=3):
data['ZIP_code'] = data['ZIP_code'].astype(str)
data['ZIP_code'] = data['ZIP_code'].str[:prefix_length]
return data

def suppress_gender(data):
data['Gender'] = 'Suppressed'
return data

def bin_total_charge(data, bin_size=1000):
if 'Total_charge' in data.columns:
data['Total_charge'] = pd.to_numeric(data['Total_charge'], errors='coerce').fillna(0).astype(int)
data['Total_charge'] = (data['Total_charge'] // bin_size) * bin_size
data['Total_charge'] = data['Total_charge'].astype(str) + '-' + \
(data['Total_charge'] + bin_size - 1).astype(str)
return data

def top_bottom_code_total_charge(data, top_threshold=9000, bottom_threshold=1000):
if 'Total_charge' in data.columns:
data['Total_charge'] = pd.to_numeric(data['Total_charge'], errors='coerce').fillna(0).astype(int)
data['Total_charge'] = data['Total_charge'].clip(lower=bottom_threshold, upper=top_threshold)
return data

def redact_rare_diagnoses(data, threshold=5):
if 'Diagnosis' in data.columns:
diagnosis_counts = data['Diagnosis'].value_counts()
rare_diagnoses = diagnosis_counts[diagnosis_counts < threshold].index
data.loc[data['Diagnosis'].isin(rare_diagnoses), 'Diagnosis'] = 'Other'
return data

# List of transformation functions
transformations = [
('Generalize Birth Date to Year', generalize_birth_to_year),
('Generalize Birth Date to Age Range', generalize_birth_to_age_range),
('Generalize ZIP Code to Prefix', generalize_zip_code_prefix),
('Suppress Gender', suppress_gender),
('Bin Total Charge', bin_total_charge),
('Top/Bottom Code Total Charge', top_bottom_code_total_charge),
('Redact Rare Diagnoses', redact_rare_diagnoses)
]

# Functions to calculate privacy metrics
def calculate_k_anonymity(data, quasi_identifiers):
grouped = data.groupby(quasi_identifiers)
group_sizes = grouped.size()
return group_sizes.min()

def calculate_l_diversity(data, quasi_identifiers, sensitive_attribute):
grouped = data.groupby(quasi_identifiers)
return grouped[sensitive_attribute].nunique().min()

def calculate_t_closeness(data, quasi_identifiers, sensitive_attribute):
overall_distribution = data[sensitive_attribute].value_counts(normalize=True)
grouped = data.groupby(quasi_identifiers)
max_t = 0
for _, group in grouped:
group_distribution = group[sensitive_attribute].value_counts(normalize=True)
t_distance = abs(overall_distribution - group_distribution).sum() / 2
max_t = max(max_t, t_distance)
return max_t

# Function to apply transformations and calculate privacy metrics
def apply_transformations_and_evaluate(transform_list, target_k=None, target_l=None, target_t=None, min_thresholds=False):
data = medical_data.copy()
description = ''
for name, func in transform_list:
data = func(data)
description += f'{name}, '

if 'Diagnosis' in data.columns:
data['Diagnosis'] = data['Diagnosis'].fillna('Unknown')

quasi_identifiers = [col for col in ['ZIP_code', 'Birth_date', 'Gender'] if col in data.columns]

k_anonymity = calculate_k_anonymity(data, quasi_identifiers)
l_diversity = calculate_l_diversity(data, quasi_identifiers, 'Diagnosis')
t_closeness = calculate_t_closeness(data, quasi_identifiers, 'Diagnosis')

k_error = abs(target_k - k_anonymity) if target_k is not None else 0
l_error = abs(target_l - l_diversity) if target_l is not None else 0
t_error = abs(target_t - t_closeness) if target_t is not None else 0

if min_thresholds:
if target_k is not None and k_anonymity < target_k:
return None, None
if target_l is not None and l_diversity < target_l:
return None, None
if target_t is not None and t_closeness > target_t:
return None, None

total_error = k_error + l_error + t_error

return data, {
'Transformations Applied': description.strip(', '),
'Quasi-Identifiers': quasi_identifiers,
'K-Anonymity': k_anonymity,
'L-Diversity': l_diversity,
'T-Closeness': t_closeness,
'Total Error': total_error
}

# Apply transformations and evaluate
results = []
for transformation in transformations:
_, result = apply_transformations_and_evaluate([transformation], target_k=5, target_l=3, target_t=0.15)
if result:
results.append(result)

# Apply combinations of transformations
for r in range(2, len(transformations) + 1):
comb_list = list(combinations(transformations, r))
for comb in comb_list:
_, result = apply_transformations_and_evaluate(comb, target_k=5, target_l=3, target_t=0.15)
if result:
results.append(result)

# Sort and display the results
results = sorted(results, key=lambda x: x['Total Error'])
print(f"\nTransformations sorted by closeness to desired k-anonymity, l-diversity, and t-closeness values:")
for result in results:
print(f"Transformations Applied: {result['Transformations Applied']}")
print(f"Quasi-Identifiers: {result['Quasi-Identifiers']}")
print(f"K-Anonymity: {result['K-Anonymity']}, L-Diversity: {result['L-Diversity']}, T-Closeness: {result['T-Closeness']:.4f}")
print(f"Total Error: {result['Total Error']}\n")

# Create a W&B Table to log the results
table = wandb.Table(columns=["Transformations Applied", "Quasi-Identifiers", "K-Anonymity", "L-Diversity", "T-Closeness", "Total Error"])

# Populate the W&B Table with transformation results
for result in results:
table.add_data(
result['Transformations Applied'],
str(result['Quasi-Identifiers']),
result['K-Anonymity'],
result['L-Diversity'],
result['T-Closeness'],
result['Total Error']
)

# Log the table to W&B
wandb.log({"Transformation Metrics": table})

# Prompt user to select transformations
print("Available transformations:")
for idx, (name, _) in enumerate(transformations, start=1):
print(f"{idx}. {name}")

selected_indices = input("Select transformations to apply by entering their numbers separated by commas (e.g., 1,3,5): ")
selected_indices = [int(i.strip()) for i in selected_indices.split(',') if i.strip().isdigit()]

# Get the list of selected transformations
selected_transformations = [transformations[i-1] for i in selected_indices]

# Apply the selected transformations to the data and get modified data with metrics
modified_data, modified_data_metrics = apply_transformations_and_evaluate(selected_transformations, target_k=5, target_l=3, target_t=0.15)

# Save the modified data to a new file
modified_data.to_csv('modified_data.csv', index=False)
print("Modified data saved to 'modified_data.csv'.")

# Re-read the saved modified data for verification
modified_data_reloaded = pd.read_csv('modified_data.csv')

# Calculate and print privacy metrics for reloaded modified data
k_anonymity_value = calculate_k_anonymity(modified_data_reloaded, modified_data_metrics['Quasi-Identifiers'])
l_diversity_value = calculate_l_diversity(modified_data_reloaded, modified_data_metrics['Quasi-Identifiers'], 'Diagnosis')
t_closeness_value = calculate_t_closeness(modified_data_reloaded, modified_data_metrics['Quasi-Identifiers'], 'Diagnosis')

# Log the final modified data and its metrics to W&B
modified_data_reloaded = pd.read_csv('modified_data.csv', dtype={'ZIP_code': str})

# Ensure that ZIP_code is explicitly converted to a string and handle any non-string values or NaNs
modified_data_reloaded['ZIP_code'] = modified_data_reloaded['ZIP_code'].astype(str).fillna('').str.zfill(3)
final_table = wandb.Table(dataframe=modified_data_reloaded)

wandb.log({"Modified Data": final_table})

# Print remeasured privacy metrics for verification
print("\nModified Data Details and Privacy Metrics after Reloading:")
for key, value in modified_data_metrics.items():
print(f"{key}: {value}")
print(f"\nPrivacy metrics for Reloaded Modified Data:\nK-Anonymity: {k_anonymity_value}\nL-Diversity: {l_diversity_value}\nT-Closeness: {t_closeness_value:.4f}")

# End W&B run
wandb.finish()
This script applies various de-identification transformations to a medical dataset and calculates privacy metrics like k-anonymity, l-diversity, and t-closeness. It systematically tests different transformation combinations, logs results to Weights & Biases, and allows users to select transformations to meet specific privacy requirements.
The transformations I ultimately chose included broadening birthdate attributes into age ranges, reducing the level of detail in ZIP codes, and suppressing gender. When applied together, these transformations significantly improved the k-anonymity and l-diversity metrics, ensuring that each individual record was indistinguishable from several others and that sensitive attributes like diagnoses had enough diversity within each group.
This combination of transformations allowed us to maintain the some of the utility of demographic and geographic data for analysis while minimizing the risk of re-identification. I logged the results of each combination of transformations using W&B Tables as shown below. The table includes values for each privacy metric, along with a custom "error" score that I created to balance them together.

Run set
1

After applying these transformations, I saved the modified data to a new CSV file and evaluated its privacy metrics. The results showed that the dataset achieved a k-anonymity of 5, meaning that each individual is indistinguishable from at least six others. The l-diversity value of 3 indicates that there is sufficient diversity in sensitive attributes, such as diagnosis, within each group. Finally, the t-closeness value of 0.25 signifies that the distribution of sensitive attributes in each group is fairly close to the overall distribution, further reducing the risk of attribute disclosure. Here's the final modified dataset, with the transformations applied:

Here's the new dataset, after applying the transformations:

Run set
1


Overall

In the context of protecting patient data, utilizing privacy-preserving transformations such as k-anonymity, l-diversity, and t-closeness is essential for maintaining data confidentiality while still enabling meaningful analysis. In the project, we applied various de-identification techniques to a synthetic dataset that included both voter and medical records, demonstrating how privacy metrics can be used to evaluate the effectiveness of these transformations.
Our results show that, by generalizing birthdates to broader age ranges, reducing ZIP code granularity, and suppressing gender identifiers, we were able to significantly enhance the dataset’s privacy. This improvement is reflected in the achieved k-anonymity of 5, indicating that each record is indistinguishable from at least four others. Moreover, the l-diversity value of 3 ensures that within each quasi-identifier group, there are at least three different sensitive attribute values, making it difficult to pinpoint any individual's diagnosis. The t-closeness score of 0.25 further confirms that the distribution of diagnoses in each group is proportionally similar to the overall dataset, minimizing the risk of attribute inference.
These results highlight the critical balance between data utility and privacy. While increasing k-anonymity and l-diversity typically reduces data specificity, careful selection of de-identification transformations can help preserve analytical value while ensuring compliance with privacy regulations such as HIPAA.
Leveraging tools like Weights & Biases for logging and analyzing these privacy metrics enables us to track the impact of different transformation strategies and identify the optimal approach for our data. Feel free to check out the project repo here.


Iterate on AI agents and models faster. Try Weights & Biases today.