Combining open-source PII redaction with closed-model analysis in healthcare using Llama 3.1, MedSpacy and GPT-4o
A guide to PII redaction with AI, covering open-source tools, proprietary models, HIPAA compliance, and how logging supports secure data handling.
Created on December 5|Last edited on December 10
Comment
Personally identifiable information (PII) must be carefully handled to ensure compliance with regulations like HIPAA while maintaining the data's analytical value. Effective handling requires not only advanced AI models and open-source tools but also robust logging practices to track, validate, and optimize every step of the workflow.
Open-source solutions offer transparency and flexibility, particularly for organizations requiring on-premises deployments to meet stringent data security policies. They pair seamlessly with logging tools like Weights & Biases, enabling teams to track model performance, transformations, and compliance throughout the process. On the other hand, closed-source AI models often leverage state-of-the-art architectures, offering unmatched performance and ease of use, albeit with limited customizability. Logging these workflows ensures transparency and helps organizations make informed decisions about model deployment.

In this article, we'll explore the intricate process of PII redaction and its critical role in maintaining data privacy and compliance with regulations like HIPAA. From traditional rule-based methods to advanced AI-driven approaches, we'll examine the tools and techniques that anonymize sensitive information while preserving the utility of datasets for research and analysis. Additionally, we'll highlight how AI models like Llama and GPT-4o, combined with detailed logging, can revolutionize healthcare workflows by enabling privacy-preserving insights and driving better outcomes.
If you're familiar with the need to redact your PII with an open-source model like Llama and analyze it with a closed model like GPT-4o, you can:
Jump to the tutorial
But if you're interested in understanding why, here's what we'll be covering:
Table of contents
Table of contentsUnderstanding PII redactionOpen-source and proprietary modelsThe dataAnalyzing sensitive data with GPT-4o using redacted PIIMedication interaction analysisMisdiagnosis identificationStructured data extraction for advanced insightsOverview of the models and datasetGenerating the datasetAnalyzing the masked data with GPT-4oConclusion
Understanding PII redaction
PII redaction involves more than removing sensitive data; it ensures anonymization while preserving analytical utility. This requires replacing fields like names, dates, and medical record numbers with placeholders or generalizations to prevent re-identification risks.
The specific fields requiring masking often depend on legal and regulatory frameworks, such as HIPAA, which protects protected health information (PHI). HIPAA distinguishes between direct identifiers (e.g., patient names, Social Security Numbers) and indirect identifiers (e.g., dates tied to individuals or geographic details smaller than a state). These must be carefully handled to maintain compliance. In this tutorial, we’ll focus on HIPAA standards and demonstrate redaction strategies for safeguarding PHI.
Digging in a bit further: privacy risks often arise from indirect identifiers, which attackers can cross-reference with external datasets. This makes effective redaction strategies critical. AI stands out in redaction for its adaptability to diverse datasets and ability to mask nuanced patterns that traditional methods may overlook. These AI-powered techniques not only ensure compliance with regulations like HIPAA but also enable data usage for research, quality control, and operational improvements. As datasets grow in complexity, advanced AI-driven redaction becomes indispensable.
Traditional PII redaction methods, such as rule-based systems, remain valuable for structured and predictable datasets. These systems rely on predefined patterns and static rules, making them easy to implement and interpret. However, their limitations become evident in complex data, such as unstructured text or contexts with ambiguous information. For instance, distinguishing between "May" as a name versus a month often requires manual and labor-intensive rule updates.
As data grows more varied and intricate, the rigidity of rule-based approaches has spurred the adoption of AI-powered solutions. AI offers the flexibility and adaptability needed to handle these complexities. That said, rule-based systems still have their place in hybrid approaches or scenarios where simplicity and transparency are essential.
Large language models like Llama have revolutionized redaction by providing contextual understanding and adaptability. Unlike rule-based systems, LLMs can generalize across datasets, accurately identify PHI in unstructured formats, and adjust to new contexts without requiring manual reconfiguration. This includes all HIPAA-defined PHI elements, such as names, geographic details, dates (except the year), phone numbers, email addresses, and photographs.
By combining advanced AI models with diverse datasets and detailed logging with tools like Weights & Biases, healthcare workflows can be transformed to enable privacy-preserving analysis and meaningful insights. Leveraging open-source and proprietary tools with well-curated datasets, this tutorial will demonstrate techniques for securely handling sensitive information, identifying potential medication interactions, and uncovering misdiagnoses—all while adhering to HIPAA's stringent privacy requirements.
Open-source and proprietary models
Open-source tools offer security advantages by allowing organizations to self-host and maintain control over their data, ensuring compliance with internal policies and regulations. They enable organizations to inspect the codebase for vulnerabilities, customize solutions to their specific needs, and avoid potential risks associated with third-party tools and APIs. For instance, organizations with highly sensitive data—such as healthcare providers—benefit significantly from self-hosting open-source models like Llama, which can run locally without sending data to external servers.
Conversely, proprietary models like GPT-4o provide cutting-edge performance, benefiting from continuous updates and a broader pool of training data. These models are particularly advantageous for organizations that prioritize ease of deployment and scalability over absolute control. However, they often require organizations to trust external vendors with sensitive information, which can be a limitation in industries with strict data privacy requirements.
The data
For tasks involving PII masking and analysis, a variety of datasets can be utilized, depending on the specific use case. Publicly available datasets like MIMIC-III, MIMIC-IV, eICU, and i2b2 datasets are commonly used for research in healthcare natural language processing. These datasets provide diverse clinical text formats, including discharge summaries, progress notes, and radiology reports, which are rich sources for PII masking and redaction techniques.
Alternatively, synthetic datasets can be generated to mimic real-world scenarios while ensuring patient privacy. These can include simulated doctor-patient conversations, SOAP notes, or narrative summaries. For example, tools like Synthea can generate realistic but entirely synthetic health records.
For organizations with access to their own data, de-identified or anonymized clinical notes can also be used, provided that all PII is masked appropriately. Combining these resources allows for testing and refining PII redaction methods across different formats and styles, ensuring robustness and generalizability in real-world applications.
Analyzing sensitive data with GPT-4o using redacted PII
PII anonymization allows extracting actionable insights from healthcare data while maintaining patient anonymity. Once PII is removed using Llama or similar open-source tools, GPT-4o can analyze masked datasets to uncover patterns and trends that drive better healthcare outcomes. Throughout this workflow, Weights & Biases plays a critical role in tracking and visualizing key metrics, providing a layer of accountability and insight into the system's effectiveness.
By leveraging Weights & Biases for logging, teams gain a clear understanding of how PII is masked, how redacted datasets are prepared, and how models like GPT-4o perform during analysis. This detailed tracking promotes compliance, facilitates debugging, and allows for iterative improvements.
Medication interaction analysis
GPT-4o excels in identifying potential medication interactions within clinical notes. After masking patient-specific details, the model can focus solely on analyzing prescribed medications and their combinations. By cross-referencing medication information with known pharmacological databases, GPT-4o highlights potential contraindications or adverse reactions. For example, it might identify that a patient's simultaneous use of two drugs carries a risk of interaction, flagging it for further review.
Misdiagnosis identification
In addition to medication analysis, GPT-4o assists in detecting potential misdiagnoses. By comparing reported symptoms and diagnostic conclusions across masked clinical notes, the model identifies inconsistencies or patterns suggestive of diagnostic errors. For instance, if a masked dataset shows frequent misclassification of similar conditions, such as Lyme disease versus lupus, the model can flag these trends for further investigation.
Structured data extraction for advanced insights
GPT-4o can also transform unstructured, masked clinical text into structured data formats. Entities like symptoms, medications, and treatment plans are extracted and organized into tables or charts, enabling deeper statistical analysis. This structured output allows researchers to identify trends, such as the prevalence of certain conditions in specific demographics or the efficacy of particular treatments, all while ensuring PII remains protected.
By enabling robust, privacy-preserving analysis, GPT-4o bridges the gap between safeguarding sensitive information and leveraging the full potential of healthcare data. This capability is pivotal for advancing medical research and improving clinical outcomes in an increasingly data-driven healthcare ecosystem.
Overview of the models and dataset
In this tutorial, we’ll showcase how advanced AI tools can streamline healthcare data workflows by securely redacting sensitive information and extracting meaningful insights. First, we’ll use Llama 3.1 8B Instruct, an open-source language model, to mask PII, ensuring compliance with privacy regulations without compromising data usability. Then, we’ll leverage GPT-4o, a cutting-edge proprietary AI, to analyze the redacted data, identifying trends like medication interactions and potential diagnostic inconsistencies.
We will generate a synthetic dataset, using the MTS Dialogue Clinical Note dataset, a public resource of annotated clinical documentation, as the backbone of the dataset. Rather than generating an entirely synthetic dataset, we bootstrap off this existing resource, using its rich, realistic structure as a foundation.
This approach involves synthesizing additional clinical notes—like hypothetical patient cases or extended diagnostic summaries—that maintain the complexity, language, and variety of authentic medical records. By expanding on the existing dataset rather than starting from scratch, we preserve its depth and diversity while introducing variability that reflects real-world reporting styles.
This hybrid methodology ensures the dataset is both realistic and anonymized, making it ideal for testing privacy-preserving AI techniques and advanced healthcare analytics.
Generating the dataset
To start, we will generate the dataset by building on the MTS Dialogue Clinical Note dataset. By leveraging this resource, we retain the depth and variability essential for realistic clinical data while avoiding the shortcomings of purely synthetic datasets, which often fail to capture the nuances and contextual richness of actual medical documentation.
Our approach involves bootstrapping from the original dataset to create an enhanced version. Using the existing data as a scaffold, we synthesize additional content such as hypothetical patient cases, expanded diagnostic summaries, and diverse report styles. These synthetic additions are designed to mirror the complexity and diversity of authentic healthcare records, incorporating styles like SOAP notes, narrative reports, and discharge summaries.
import randomimport wandbfrom datasets import load_datasetfrom openai import OpenAIimport jsonfrom typing import Dict# ConstantsFIRST_NAMES = ["John", "Jane", "Michael", "Emily", "Robert", "Jessica", "David", "Sophia", "Matthew", "Amanda"]LAST_NAMES = ["Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Davis", "Miller", "Wilson", "Martinez"]DATES = [f"2024-0{i}-15" for i in range(1, 10)]MEDICAL_RECORD_IDS = [f"MRN-{random.randint(1000, 9999)}" for _ in range(100)]REPORT_STYLES = ["Narrative Style", "SOAP Style", "Checklist Style", "Bullet Point Summary Style","Medico-Legal Style", "Case Summary Style", "Discharge Summary Style", "Teaching Style","Diagnostic-Driven Style", "Interdisciplinary Communication Style", "Layperson Style","Comparative Style", "Emergency Note Style"]def generate_prompt(section_text: str,first_name: str,last_name: str,date: str,record_id: str,style: str) -> str:"""Generate a formatted prompt for clinical report generation."""style_templates = {"Narrative Style": """Write a detailed clinical report as a narrative. Include all relevant patient details, findings, and recommendations in prose format. If appropriate based on the patient's condition, include any necessary medication prescriptions with dosage and duration.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","SOAP Style": """Write a clinical report following the SOAP format (Subjective, Objective, Assessment, Plan). Ensure clarity and proper sectioning. In the Plan section, include any appropriate medication prescriptions with specific dosage and duration if warranted by the patient's condition.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Checklist Style": """Write a clinical report as a checklist. Focus on concise and structured items for each aspect of the case. Include a medications section if any prescriptions are warranted, with clear dosage and duration details.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Bullet Point Summary Style": """Write a clinical report as a bullet point summary. Highlight key findings, diagnosis, and next steps. Include any necessary medication prescriptions as separate bullet points with clear dosage and duration information.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Medico-Legal Style": """Write a formal medico-legal report. Include all findings with precise language suitable for legal review. Document any prescribed medications with complete details including dosage, duration, and clinical rationale.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Case Summary Style": """Summarize the case in a concise format suitable for a referral or handoff. Include details of any prescribed medications, including dosage and duration, if warranted by the patient's condition.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Discharge Summary Style": """Write a discharge summary focusing on follow-up care and post-visit recommendations. Include a detailed medication section with any new prescriptions, including dosage, duration, and instructions for use.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Teaching Style": """Write a clinical report suitable for medical training purposes. Include reasoning and explanations for the findings. If prescribing any medications, explain the clinical rationale for each prescription including choice of drug, dosage, and duration.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Diagnostic-Driven Style": """Write a report focused on the diagnostic process. Highlight the reasoning behind the diagnosis. Include any therapeutic decisions, including medication prescriptions with dosage and duration when appropriate.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Interdisciplinary Communication Style": """Write a clinical report intended for communication with other healthcare professionals. Detail any medication changes or new prescriptions, including full prescribing information and rationale.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Layperson Style": """Write a clinical report in simple language that a patient without medical knowledge can understand. If any medications are prescribed, explain them in plain language, including why they're needed and how to take them properly.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Comparative Style": """Write a clinical report comparing the current findings with previous evaluations. Include any changes in medication regimen, new prescriptions, or discontinued medications with complete details.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:""","Emergency Note Style": """Write a brief and urgent clinical report as if documenting an emergency case. Include any immediate medication interventions or prescriptions with precise dosing instructions.Patient Name: {first_name} {last_name}Date: {date}Medical Record Number: {record_id}### Input:{section_text}### Synthetic Report:"""}if style not in style_templates:raise ValueError(f"Unknown style: {style}")return style_templates[style].format(first_name=first_name,last_name=last_name,date=date,record_id=record_id,section_text=section_text)def generate_random_identifiers():"""Generate a set of random identifiers for a patient."""return (random.choice(FIRST_NAMES),random.choice(LAST_NAMES),random.choice(DATES),random.choice(MEDICAL_RECORD_IDS))def generate_gpt_synthetic_report(section_text: str, client: OpenAI) -> str:"""Generate a synthetic medical report using the OpenAI API."""first_name, last_name, date, record_id = generate_random_identifiers()style = random.choice(REPORT_STYLES)prompt = generate_prompt(section_text=section_text,first_name=first_name,last_name=last_name,date=date,record_id=record_id,style=style)response = client.chat.completions.create(model="gpt-4o-2024-08-06",temperature=0.0,messages=[{"role": "system", "content": f"You are a medical assistant creating clinical reports in {style}."},{"role": "user", "content": prompt}])return response.choices[0].message.content.strip()def main():# Initialize OpenAI clientOPENAI_API_KEY = ""client = OpenAI(api_key=OPENAI_API_KEY)# Initialize WandBwandb.init(project="synthetic-clinical-reports")# Load datasetdataset = load_dataset("har1/MTS_Dialogue-Clinical_Note")first_hundred = dataset["train"].select(range(100))# Generate reportssynthetic_reports = []for row in first_hundred:section_text = row["section_text"]synthetic_report = generate_gpt_synthetic_report(section_text, client)synthetic_reports.append({"id": row["ID"],"original_section_text": section_text,"synthetic_report": synthetic_report})# Log to WandB and savetable = wandb.Table(columns=["ID", "Original Section Text", "Synthetic Report"])for report in synthetic_reports:table.add_data(report["id"],report["original_section_text"],report["synthetic_report"])wandb.log({"Synthetic Reports": table})# Save to filewith open("./synthetic_reports.json", "w") as f:json.dump(synthetic_reports, f, indent=4)if __name__ == "__main__":main()
We focused on the initial 100 entries of the dataset for the purpose of this tutorial. Each entry, containing real-world medical language and structure, was further enhanced to create a diverse and complex dataset for synthetic clinical note generation.
The script bootstrapped off of the existing dataset by synthesizing additional clinical notes. For each entry, we generated reports in randomly selected styles, such as SOAP notes, narrative formats, or checklist summaries, using predefined templates. These were populated with randomized but realistic identifiers, including names, dates, and medical record numbers, ensuring variability while maintaining coherence. By leveraging the GPT-4o model, the content generation maintained high linguistic quality and clinical relevance, closely mimicking authentic medical records.
Here's the dataset we generated. You can interact with it below, as it was logged using Weights & Biases.
Run set
25
After generating the synthetic dataset, the next step is to ensure it is fully anonymized by masking any personally identifiable information (PII). Using the code below, we'll leverage the Llama 3.1 8B model to identify and replace sensitive data—such as names, addresses, medical record numbers, and other identifiers—with placeholder text like “[redacted].”
This process is essential for maintaining compliance with privacy regulations like HIPAA while preserving the analytical utility of the data. The script systematically processes each synthetic report, identifying PII based on a predefined set of categories. By using the transformers pipeline, the Llama model applies its advanced contextual understanding to detect nuanced patterns of PII that traditional rule-based approaches might miss.
import transformersimport torchimport jsonimport weave; weave.init("masking_pii")# Load the dataset from a local filewith open("synthetic_reports.json", "r") as f:dataset = json.load(f)# Load the Llama model using the transformers pipelinemodel_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"pipeline = transformers.pipeline("text-generation",model=model_id,model_kwargs={"torch_dtype": torch.bfloat16},device_map="auto",)# List of HIPAA identifiers to maskPII_CATEGORIES = ["Names", "Geographic information (addresses, cities, counties, zip codes)","Dates (birth, admission, discharge, death, except year)", "Phone Numbers","Fax Numbers", "Email Addresses", "Social Security Numbers","Medical Record Numbers", "Health Plan Beneficiary Numbers", "Account Numbers","Certificate/License Numbers", "Vehicle Identifiers (including license plates)","Device Identifiers", "Web URLs", "IP Addresses", "Biometric Identifiers (e.g., fingerprints)","Full-Face Photographic Images", "Any other unique identifier or code"]def postprocess_inputs(inputs: dict) -> dict:# Exclude the 'text' input to avoid logging sensitive datareturn {k: v for k, v in inputs.items() if k != 'text'}@weave.op(postprocess_inputs=postprocess_inputs)def mask_pii(text):"""Use the Llama model to mask PII in the input text.It replaces sensitive information with '[redacted]'."""# Format the messages for the new prompt stylemessages = [{"role": "system", "content": "You are an advanced language model tasked with identifying and masking personally identifiable information (PII) in text."},{"role": "user", "content": f"""The following text contains personally identifiable information (PII).Please identify all instances of the following PII categories: {', '.join(PII_CATEGORIES)}and replace them with '[redacted]'. Ensure that the rest of the text remains unchanged. DO NOT RESPOND WITH ANY PII INFORMATION WHATSOEVER!Text: "{text}"Masked Text:"""}]# Generate masked textresponse = pipeline(messages, max_new_tokens=2048)print(response[0])return response[0]["generated_text"][2]['content']# Iterate over the dataset and mask PIImasked_data = []for record in dataset:original_text = record["synthetic_report"]response = mask_pii(original_text)# Assuming the pipeline response is a list of dictionariesmasked_text = response # Extract the generated text directlymasked_data.append({"id": record["id"],"original_text": original_text,"masked_text": masked_text.strip() # Ensure whitespace is trimmed})print(masked_text) # Print the masked text for debugging# Save the masked dataset to a new JSON filewith open("masked_ds.json", "w") as f:json.dump(masked_data, f, indent=4)print("PII masking completed. Masked dataset saved to 'masked_ds.json'.")
The anonymized data is then saved as a new dataset (masked_ds.json), which retains the structure and detail of the original synthetic notes while ensuring complete de-identification. This masked dataset serves as the foundation for subsequent analysis, where we extract and analyze trends without compromising patient privacy.
I also leveraged Weave to visualize the outputs for my model. Note that for this specific script, I chose to not log the inputs to our inference function, as it contained our unmasked PHI data. In Weave, you can mask inputs by creating a preprocessor function and then passing the function to the Weave Op as shown below:
def postprocess_inputs(inputs: dict) -> dict:# Exclude the 'text' input to avoid logging sensitive datareturn {k: v for k, v in inputs.items() if k != 'text'}@weave.op(postprocess_inputs=postprocess_inputs)def mask_pii(text):# rest of the function
This targeted logging is useful for situations where you would like to prevent logging of certain input fields while still gaining visibility of the rest of your inference pipeline. Here is what it looks like inside Weave after running our script:


Analyzing the masked data with GPT-4o
Every day, medication errors and misdiagnoses affect patient outcomes, sometimes with devastating consequences. Adverse drug interactions, overlooked symptoms, or incorrect diagnoses are common challenges in healthcare systems worldwide. These issues often arise from the complexity of clinical data, the pressures of rapid decision-making, and human error. However, with advancements in AI, we now have the technology to detect and potentially prevent these issues before they occur.
In this step, we'll leverage GPT-4o to analyze the masked dataset (masked_ds.json) and extract actionable insights that could directly address these problems. By simulating the role of a medical expert, GPT-4o evaluates clinical reports to identify potential drug interactions based on prescribed medications and flags possible diagnostic inconsistencies by correlating symptoms with documented diagnoses.
import jsonfrom openai import OpenAIimport weave; weave.init("masking_pii_double_check")# Set up OpenAI API keyOPENAI_API_KEY = ""oclient = OpenAI(api_key=OPENAI_API_KEY)def analyze_report(text):"""Use GPT-4o to analyze a masked clinical report for possible medication interactionsor potential misdiagnoses."""messages = [{"role": "system", "content": "You are a medical expert analyzing clinical reports for medication interactions and potential misdiagnoses."},{"role": "user", "content": f"""The following is a masked clinical report. Please analyze it for:1. Any potential medication interactions based on the prescribed medications.2. Possible misdiagnoses based on the symptoms and other clinical details provided.Masked Clinical Report:{text}Analysis:"""}]# Generate analysisresponse = oclient.chat.completions.create(model="gpt-4o-2024-08-06",temperature=0.5,messages=messages)return response.choices[0].message.content.strip()def main():# Load the masked datasetwith open("masked_ds.json", "r") as f:masked_data = json.load(f)analysis_results = []# Analyze each masked reportfor record in masked_data:original_text = record["masked_text"]analysis = analyze_report(original_text)analysis_results.append({"id": record["id"],"masked_text": original_text,"analysis": analysis})print(f"Analysis for ID {record['id']} completed.") # Debugging output# Save the analysis results to a new JSON fileoutput_path = "analysis_results.json"with open(output_path, "w") as f:json.dump(analysis_results, f, indent=4)print(f"Analysis completed. Results saved to '{output_path}'.")if __name__ == "__main__":main()
By identifying risks hidden within clinical reports, AI tools like GPT-4o enable healthcare organizations to enhance patient safety, reduce errors, and refine treatment strategies. The results, saved in analysis_results.json, demonstrate how privacy-preserving AI can serve as a powerful ally in creating safer and more effective healthcare systems.
I also used Weave to log the results of our analysis system using the Weave Op decorator, which is able to be added to a function in order to log it's inputs and outputs:

In addition to using a high powered model like GPT-4o for extracting insights, we may also want to extract the exact medications into a structured format for other types of analysis.
For this use-case, we can use MedSpaCy, a lightweight and efficient NLP library tailored for medical text, to extract structured information from our masked clinical notes dataset. While not strictly necessary—since models like Llama can also perform entity extraction, MedSpaCy offers a more resource-efficient alternative, particularly for tasks focused on extracting specific medical entities such as conditions, medications, and procedures.
Here is the code that uses MedSpaCy for medication extraction:
import spacyimport medspacyimport json# Load MedSpaCy with the med7 pipelinemed7 = spacy.load("en_core_med7_lg")nlp = medspacy.load(med7)# Load the masked datasetwith open("masked_ds.json", "r") as f:masked_dataset = json.load(f)# Process the masked dataprocessed_data = []for record in masked_dataset:# Read the masked textmasked_text = record["masked_text"]# Use medspacy to process the textdoc = nlp(masked_text)# Extract entitiesentities = [{"text": ent.text, "label": ent.label_} for ent in doc.ents]# Append resultsprocessed_data.append({"id": record["id"],"masked_text": masked_text,"entities": entities})# Write the processed data to a new JSON filewith open("processed_masked_ds.json", "w") as f:json.dump(processed_data, f, indent=4)print("Processing completed. Results saved to 'processed_masked_ds.json'.")
This approach allows us to turn unstructured, anonymized clinical text into structured data formats that are ready for analysis.
By leveraging this specialized solution, we can efficiently process large datasets with minimal computational overhead, offering a practical and scalable option for organizations looking to optimize their workflows while maintaining privacy and compliance.
Conclusion
The advancements in AI-driven PII redaction and analysis represent a pivotal shift in how we handle and utilize healthcare data. By combining the power of open-source solutions like Llama for secure and efficient anonymization with the analytical depth of proprietary models like GPT-4o, we are not only safeguarding patient privacy but also unlocking the potential of this data for meaningful insights.
A crucial component of this transformation is robust logging. Incorporating tools like Weights & Biases into workflows ensures transparency, reproducibility, and accountability. By logging every step, from redaction to analysis, teams can validate results, refine workflows, and build trust in their systems. This transparency is not just a compliance requirement but a catalyst for innovation, enabling healthcare organizations to tackle challenges such as medication errors and misdiagnoses with confidence.
As we look to the future, integrating these technologies and maintaining meticulous logs of their usage offers a comprehensive framework for advancing data-driven medicine. Together, these tools create a safer, smarter, and more patient-centered healthcare system.
PHI and PII for healthcare in the world of AI
A practical guide on working with health data, safely, with multiple approaches for handling PHI
Building reliable apps with GPT-4o and structured outputs
Learn how to enforce consistency on GPT-4o outputs, and build reliable Gen-AI Apps.
Knowledge distillation: Teaching LLM's with synthetic data
Unlock the power of knowledge distillation by learning how to efficiently transfer complex model insights from teacher to student models, step by step.
How to train and evaluate an LLM router
This tutorial explores LLM routers, inspired by the RouteLLM paper, covering training, evaluation, and practical use cases for managing LLMs effectively.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.