Automated medical note generation: Fine-tuning GPT models for clinical documentation using Azure OpenAI and Weights & Biases

Trying out the new Weights & Biases Azure OpenAI integration with a simple medical use case
Created on November 12|Last edited on November 19
Comment
At Microsoft Ignite 2024, Weights & Biases announced a new integration designed to simplify and elevate the process of fine-tuning a diverse range of LLMs. The integration between Weights & Biases and Azure OpenAI Service allows developers to automatically track aspects of the fine-tuning jobs, as well as compare model versions and evaluate and trace LLM-powered apps. In this post, we explore how this integration works through a medical note generation application.
To set up the Azure OpenAI and Weights & Biases integration please refer to these docs. If you'd like to run the code for this project in a free Colab, head here. 
💡
The challengeOne of the most time-consuming tasks for healthcare providers is creating accurate, detailed medical notes from patient encounters. Studies show that physicians spend up to 2 hours on documentation for every hour of direct patient care but this documentation is critically important for ensuring high-quality care. This makes medical documentation an ideal candidate for AI assistance.
Converting doctor-patient dialogues into structured medical notes requires:
Understanding medical terminology
Capturing key clinical information
Maintaining consistent formatting
Ensuring accuracy of medical details
Let's explore how we approached this problem using Azure OpenAI fine-tuning integrated with Weights & Biases Models for experiment tracking. Weights & Biases Models allows us to seamlessly monitor fine-tuning progress, visualize model metrics, and compare model versions in real-time. This tracking helps identify the most accurate and reliable model configurations for deployment, ensuring that each iteration aligns closely with clinical documentation standards and requirements. Furthermore, we use Weights & Biases Weave to evaluate the quality of the model against a golden set of data we curate from the dialogues we have available.
Solution architectureOur implementation uses:
Azure OpenAI for model fine-tuning and inference
﻿Weights & Biases Models for experiment tracking
﻿Weights & Biases Weave for evaluation
OpenAI's GPT-4 as an LLM evaluator (you can learn more in our new course)
Let's walk through this implementation, step by step.
Data pipelineFirst, we need to prepare our medical conversation dataset:
def load_medical_data(url: str, num_samples: int = N_SAMPLES) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Load medical data and split into train and test sets"""
    df = pd.read_csv(url)
    df = df.sample(n=num_samples, random_state=42)
﻿
    # Split into 80% train, 20% test
    train_size = int(0.8 * len(df))
    train_df = df[:train_size]
    test_df = df[train_size:]
﻿
    return train_df, test_df
You'll see something like this is Weights & Biases: 
﻿
Next, we need to set up the Azure OpenAI integration
Setting Up Azure OpenAIWe initialize the Azure OpenAI client with proper credentials by running:
azure_client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-02-01"
)
Fine-tuning processWith our data pipeline and integration sorted, we move onto fine-tuning. Our process here leverages a new integration between Weights & Biases and Microsoft Azure OpenAI Service, empowering enterprises to streamline fine-tuning with efficiency and precision. By combining the collaborative and iterative experiment tracking and model management tools from W&B Models with Azure’s powerful cloud infrastructure and models (like GPT-3.5, GPT-4, GPT-4o, and GPT-4o mini), the fine-tuning process becomes simpler, more scalable, and more effective.
﻿
The data we'll use for fine-tuning is organized into conversation snippets with the following components:
System message: Sets the model’s role as a medical scribe, defining responsibilities for accurate and detailed documentation.
User message: Contains the doctor-patient dialogue that needs transcription.
Assistant message: Holds the ground truth medical note, serving as the reference for accuracy in model training.
This setup, combined with the Weights & Biases and Azure integration, enables our model to capture domain-specific knowledge, retain consistent note structure, and handle clinical nuances with high precision.
Our collaboration with Weights & Biases and Azure OpenAI highlights the potential for tailored LLM fine-tuning across industries. By creating models that capture specialized language and context, we are driving the next generation of accurate, reliable, and highly specific AI solutions in healthcare and beyond.
def convert_to_jsonl(df: pd.DataFrame, output_file: str):
    with open(output_file, 'w', encoding='utf-8') as f:
        for _, row in df.iterrows():
            conversation = {
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a medical scribe assistant..."
                    },
                    {
                        "role": "user",
                        "content": row['dialogue']
                    },
                    {
                        "role": "assistant",
                        "content": row['note']
                    }
                ]
            }
            json_line = json.dumps(conversation, ensure_ascii=False)
            f.write(json_line + '\\\\n')
﻿
﻿
To learn more about how to setup the integration between Azure OpenAI fine-tuning and Weights & Biases Models please refer to the official documentation.
💡
﻿
Note that shown below is the best run from our many fine-tuning jobs emitted from our Azure fine-tuning jobs. We can make comparisons across multiple experiments in one dashboard and make key decisions on which model to choose to deploy:
﻿
Run set7
﻿
﻿
﻿
Run set7
﻿
﻿
﻿
Run set7
﻿
These results show that the blue colored model outperformed the other candidate models. We'll evaluate it below using W&B Weave to understand more about our model's behavior. 
W&B Weave integration for tracking and evaluationFirst, we initialize Weave with our project configuration. It just takes one line of code: 
weave.init(f"{ENTITY}/{WEAVE_PROJECT}")
We use Weave's evaluation framework to assess model performance:
test_evaluation = weave.Evaluation(
    name='medical_record_extraction_test',
    dataset=test_samples,
    scorers=[medical_note_accuracy]
)
﻿
Comparing our base and fine-tuned modelsIn W&B Weave, we can easily visualize the performance of our models in the "Evaluations" tab. Here's our base model implementation and it's performance: 
@weave.op()
def process_medical_record(dialogue: str) -> Dict:
    transcript = format_dialogue(dialogue)
    prompt = medical_task.format(transcript=transcript)
﻿
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": medical_system_prompt},
            {"role": "user", "content": prompt},
        ],
    )
    return {"input": transcript, "output": response.choices[0].message.content}
﻿
﻿
And here's the same information with our fine-tuned model:
@weave.op()
def process_medical_record_azure(dialogue: str) -> Dict:
﻿
    response = azure_client.chat.completions.create(
        model="gpt-35-turbo-0125-ft-d30b3aee14864c29acd9ac54eb92457f",
        messages=[
            {"role": "system", "content": "You are a medical scribe assistant. Your task is to accurately document medical conversations between doctors and patients, creating detailed medical notes that capture all relevant clinical information."},
            {"role": "user", "content": dialogue},
        ],
    )
﻿
    extracted_info = response.choices[0].message.content
﻿
    return {
        "input": dialogue,
        "output": extracted_info,
    }
﻿
﻿
The Azure OpenAI fine-tuned model shows several advantages:
Domain-specific understanding of medical terminology
Consistent note structure following medical documentation standards
Improved handling of clinical context
Evaluation FrameworkWe implement a comprehensive evaluation using GPT-4:
@weave.op()
async def medical_note_accuracy(note: str, output: dict) -> dict:
    scoring_prompt = """Compare the generated medical note with the ground truth note and evaluate accuracy.
    Score as 1 if the generated note captures the key medical information accurately, 0 if not."""
﻿
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )
    return json.loads(response.choices[0].message.content)
Our results: 
﻿
We can also view both models at once in our evaluation view to make comparison a lot easier: 
﻿
Here, it's readily apparent our fine-tune model is more accurate. 
ConclusionThis implementation demonstrates the power of combining Azure OpenAI's fine-tuning capabilities with Weights & Biases's Model experiment tracking for creating specialized medical documentation models, which we then validated using Weights & Biases Weave. This solution provides a foundation for AI-assisted medical documentation while maintaining clinical accuracy.
The complete implementation, including data processing, model training, and evaluation components, is available in our notebook. 
﻿
﻿
﻿
Add a comment
Tags: Articles, Framework / Integration, OpenAI, Agents
Iterate on AI agents and models faster. Try Weights & Biases today.