How to create a biomedical RAG application using Snowflake Arctic for PubMed paper understanding

A tutorial about building a RAG application to better understand a large corpus of medical information
Created on August 13|Last edited on August 15
Comment
﻿
You can check out the code for this project and see it in W&B Weave by following these links:
Code: https://github.com/ash0ts/snowflake-arctic-weave-demo﻿
Weave: https://wandb.ai/a-sh0ts/bioasq_example/weave/traces﻿﻿﻿
💡
IntroductionImagine a bustling hospital where clinicians need to make quick, accurate decisions for complex patient cases, like a pediatrician handling a rare genetic disorder in a newborn. This pediatrician needs to understand genetic factors, treatment options, and the latest research findings. However, sifting through the massive volume of biomedical literature on PubMed to find relevant information can be onerous.
In this context, a retrieval-augmented generation (RAG) system integrated with Snowflake Arctic can be invaluable. Here's how:
Time-sensitive information retrieval: The pediatrician inputs a clinical question like "Is Hirschsprung disease a Mendelian or multifactorial disorder?" The system quickly transforms this question into an optimized semantic search query, retrieving the most pertinent documents from PubMed.
Accurate and relevant document retrieval: Using advanced embedding models, the system searches vast biomedical databases, pulling out articles most likely to contain the needed information.
Efficient contextual analysis: The system evaluates the abstracts of retrieved documents, determining their relevance to the clinical question. This ensures only the most relevant documents are considered, saving the clinician valuable time.
Concise summarization: The system then summarizes the key points from these documents, providing a clear and concise overview addressing the clinical question.
Informed decision making: Finally, the system synthesizes the summarized information into a coherent answer. The pediatrician receives a concise, evidence-based response that aids in understanding the genetic aspects of Hirschsprung disease, guiding treatment decisions.
By employing a RAG system like this, clinicians can significantly reduce the time spent on literature review, allowing them to focus more on patient care. This integration of advanced machine learning models with robust data platforms like Snowflake Arctic enhances clinical decision support, leading to better patient outcomes and more efficient healthcare delivery.
Example:
Question: 
﻿
Autophagy is the process where a virus obtains nutrients from it's host, yes or no?
﻿
Ground Truth: 
﻿
No, autophagy is important in cellular homeostasis for the cell survival mechanism and is involved apoptosis.
﻿
Prediction:
﻿
No, autophagy is not the process where a virus obtains nutrients from its host. Autophagy is a cellular process that recycles or eliminates cell components and damaged organelles through lysosomal degradation. Some viruses, like Herpes simplex virus type I (HSV-1), can manipulate the autophagic process for their replication and survival, but it is not a means for the virus to obtain nutrients from its host.
﻿
Context:
﻿
Autophagy is a homeostatic process involved in the turnover or elimination of
cytoplasmic components, damaged organelles, and protein aggregates via a
lysosomal degradation mechanism. Autophagy also provides a mechanism of innate
immunity, known as xenophagy, designed to protect cells from intracellular
pathogens, but it may unfortunately be subverted to act as a pro-viral pathway
facilitating the replication of certain viruses. Herpes simplex virus type I
(HSV-1) is a neurotropic virus that remains latent in host neurons; it is the
most common cause of sporadic viral encephalitis. Moreover, HSV-1 has been
related to the pathogenesis of Alzheimer's disease. HSV-1 can modulate the
autophagic process through a mechanism mediated by the viral protein ICP34.5.
Here we report that HSV-1 induces a strong increase in GFP-LC3 and endogenous
LC3 lipidation, and triggers the accumulation of intracellular autophagic
compartments (mainly autophagosomes) without enhancing autophagic long-lived
protein degradation in the late stages of infection. Autophagy inhibition
mediated by ATG5 gene silencing had no effect on viral growth. The present
results suggest that HSV-1 infection activates the host autophagic machinery and
strongly controls the autophagic process, blocking the fusion of autophagosomes
with lysosomes. These events might be important in the neurodegenerative process
associated with HSV-1 infection. (Score: 0.5636628997173461)
Background
The biomedical information problem
﻿
  💡 Based on: https://arxiv.org/pdf/2310.16146﻿
The aggregation and distribution of medical knowledge, facilitated by platforms like PubMed or Cochrane, enable healthcare professionals and researchers to stay updated with the latest scientific discoveries. However, the influx of over 1 million papers annually into PubMed means keeping up with all the new findings is simply impossible.
Existing technologies often fail to meet the information needs of healthcare professionals and researchers. Clinicians typically have one care-related question for every other patient seen and refer to sources like PubMed or UpToDate for answers. Questions that cannot be answered within 2 to 3 minutes are often abandoned, potentially impacting patient care and outcomes.
While systematic review (SR) articles can provide quick answers, many questions are simply not covered by existing reviews. Manually synthesizing findings from multiple primary sources without a published review article is extraordinarily time-consuming. Review articles take an average of 67.3 weeks to complete and may not include the most updated research.
Question-answering tools leveraging frequently updated external electronic resources can provide up-to-date information efficiently, benefiting both scientific discovery and patient care. In previous decades, applications integrating clinical systems with online information (e.g., “infobuttons”) were typically driven by semantic networks. Other works, such as CHiQA, combined knowledge-based, machine learning, and deep learning approaches to develop question-answering systems using patient-oriented resources.
The natural language generation problemThe new capabilities of agents powered by large language models (LLM) have accelerated the development of automated literature summarization tools. Most solutions are privately developed, closed-source systems based on retrieval-augmented LLMs (RetA LLMs). However, the lack of publicly available technical reports, guidelines, regulations, and evaluations to ensure safe and responsible usage is a major concern.
This natural language generation (NLG) problem is exacerbated by the lack of (1) representative datasets and tasks and (2) automated metrics for evaluating RetA LLMs. Fortunately, developments in LLM evaluation have shown that automated metrics correlate moderately with human preference, even in domain-specific scenarios, including medicine.
BioAsq
﻿
﻿
One effort to address these challenges is the BioASQ project. BioASQ focuses on large-scale biomedical semantic indexing and question answering, providing a collection of biomedical questions along with relevant documents. That makes it an ideal benchmark for evaluating the performance of retrieval-augmented generation systems. By utilizing the BioASQ dataset, researchers can create representative tasks that reflect real-world information needs in the biomedical domain.
 Note: We’ll use this as a replacement for PubMed to have a better evaluation pipeline in these class of experiments as this is a better source for evaluation of biomedical text understanding
💡
Snowflake Arctic
﻿
  💡 https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/﻿
Snowflake Arctic is a state-of-the-art enterprise LLM designed for cost-effective training and openness, revolutionizing the landscape of enterprise-grade AI. It excels in tasks such as SQL generation, coding, instruction following, and complex query answering, outperforming open-source models with significantly higher compute budgets.
By utilizing an innovative Dense-MoE Hybrid transformer architecture, Arctic combines a 10B dense transformer model with a residual 128×3.66B MoE MLP, achieving 480B total and 17B active parameters for top-tier intelligence and resource-efficient training and inference. Available under Apache 2.0 license, Arctic ensures ungated access to weights, code, data recipes, and research insights, making it highly accessible and cost-effective for enterprise AI applications.
Arctic's high training efficiency means that Snowflake customers can create high-quality custom models affordably, with training costs under $2 million (less than 3K GPU weeks). Arctic surpasses other models like Llama 3 8B and Llama 2 70B on enterprise metrics using less than half the training compute budget. It also offers faster inference performance by collaborating with NVIDIA for optimized implementations, making it a practical choice for interactive and high-batch size inference scenarios.
The combination of cutting-edge architecture, cost-effectiveness, and open access makes Snowflake Arctic an ideal solution for enterprise AI, enhancing the ability to deploy powerful, efficient AI models in various applications.
How we’re building our biomedical RAG modelThe rapid advancement in biomedical research necessitates efficient methods for extracting and synthesizing relevant information from vast amounts of literature. Here, we introduce a practical example of using retrieval-augmented generation (RAG) models for understanding and answering complex biomedical questions based on PubMed papers.
The integration of Weave, Snowflake Arctic, and Streamlit in this application provides a practical and efficient solution for biomedical question answering. Weave's framework enables version control and modularity, facilitating rapid experimentation with different model configurations and seamless deployment of updates. 
This is crucial in a rapidly evolving field like biomedical research, where incorporating the latest findings is essential for accuracy. The utilization of Snowflake Arctic's cost-effective, yet powerful, LLM architecture, specifically its hybrid dense-MoE structure, allows for robust handling of complex biomedical reasoning tasks while maintaining low inference latency. This efficiency translates to faster response times, crucial for time-sensitive clinical applications. The Streamlit interface provides an accessible front-end to this pipeline, enabling efficient information retrieval for both researchers and clinicians without requiring specialized technical expertise. 
This combined approach offers significant advantages in speed and cost-effectiveness compared to traditional methods like manual literature reviews or reliance on commercial knowledge bases, while also simplifying the user experience through an intuitive interface.
﻿
Key componentsQuery transformation: Using a custom weave.Model called GenericLLMModel to convert biomedical questions into optimized semantic search queries.
Document retrieval: Employing a vector store with advanced embedding models to find the most relevant documents from the BioASQ dataset.
Context scoring: Utilizing a specialized model to assess the relevance of retrieved documents based on their abstracts.
Summarization: Summarizing the key points from relevant documents to provide a coherent overview addressing the biomedical question.
Final answer synthesis: Synthesizing the summarized information into a clear, concise answer to the clinical question.
1. Query transformation
﻿
The first step in our biomedical RAG model is transforming the input question into an optimized semantic search query. We achieve this using a custom GenericLLMModel, which is a Weave model specifically designed for this task.
Here's how the query transformation process works:
First, we define the GenericLLMModel using Weave:
class GenericLLMModel(weave.Model):
    model_name: str = "replicate/snowflake/snowflake-arctic-instruct"
    prompt_template: PromptTemplate
    temperature: float = 0.0
    name: str = "GenericLLMModel"
﻿
    def __init__(
        self,
        system_prompt: Optional[str] = None,
        human_prompt: Optional[str] = None,
        model_name: str = "gpt-3.5-turbo",
        temperature: float = 0.0,
    ):
        super().__init__(
            model_name=model_name,
            prompt_template=PromptTemplate(
                system_prompt=system_prompt, human_prompt=human_prompt
            ),
            temperature=temperature,
        )
﻿
    @weave.op()
    def predict(
        self,
        human_prompt_args: Optional[dict] = {},
        system_prompt_args: Optional[dict] = {},
    ) -> dict:
        messages = self.prompt_template.format_prompt(
            human_prompt_args=human_prompt_args, system_prompt_args=system_prompt_args
        )
        # ...
        response = completion(**completion_args)
        answer = response.choices[0].message.content
        return {"answer": answer}
﻿
Then, we create the question_2_query_model using this GenericLLMModel:
question_2_query_model = GenericLLMModel(
    system_prompt=question_2_query_system_prompt,
    human_prompt=question_2_query_human_prompt
)
To transform a question into a query, we use the predict method of our model, which is decorated with @weave.op():
transformed_query = question_2_query_model.predict(human_prompt_args={"question": question})['answer']
This process takes a question and transforms it into an optimized semantic search query. The @weave.op() decorator ensures that this operation is tracked although the underlying call will already be auto-logged.
The prompt we use here is:
question_2_query_system_prompt = """
### Instruction ###
You are an expert biomedical researcher tasked with converting biomedical questions into optimized semantic search queries. Your goal is to generate queries that will retrieve the most relevant documents from the BioASQ dataset to answer the given question.
﻿
### Process ###
Follow these steps to create the semantic search query:
1. Carefully analyze the biomedical question to identify the most important keywords, concepts, and entities
2. Construct a search query using those keywords, aiming to retrieve all potentially relevant documents
3. Optimize the query by incorporating synonyms, related terms, and expanding acronyms if applicable
4. Double check that the query captures the core intent of the question and will match pertinent documents
5. Provide only the final semantic search query in your response, without any additional commentary
﻿
### Context ###
The BioASQ dataset consists of biomedical questions along with relevant documents. Your semantic search queries will be used to find the most relevant documents from this dataset to answer each question. The ideal answers have been removed, so your query should focus solely on the question text.
﻿
### Examples ###
Question: Is Hirschsprung disease a mendelian or a multifactorial disorder?
Semantic Search Query: Hirschsprung disease AND (mendelian OR multifactorial OR complex) AND (inheritance OR genetics OR genes)
﻿
Question: List signaling molecules (ligands) that interact with the receptor EGFR?  
Semantic Search Query: EGFR AND (ligands OR "signaling molecules") AND (EGF OR BTC OR EPR OR HB-EGF OR TGF-α OR AREG OR EPG)
﻿
Question: Is the protein Papilin secreted?
Semantic Search Query: Papilin AND (secreted OR extracellular OR "secretory pathway")
﻿
### Evaluation ###
Your performance will be evaluated on:  
- Inclusion of the most salient keywords, concepts and entities from the biomedical question
- Appropriate use of synonyms and related terms to improve retrieval
- Ability of the query to capture the full scope and intent of the question
- Overall likelihood of the query retrieving documents that can answer the question
- Adherence to the response format instructions
﻿
You MUST provide a well-constructed query that fulfills the given criteria. You will be penalized for queries that are too narrow, off-topic, or poorly formulated.
"""
One example:
Question: List signaling molecules (ligands) that interact with the receptor EGFR?  
Semantic Search Query: EGFR AND (ligands OR "signaling molecules") AND (EGF OR BTC OR EPR OR HB-EGF OR TGF-α OR AREG OR EPG)
2. Document retrieval
﻿
The document retrieval step follows query transformation in our RAG pipeline. It utilizes a custom VectorStore class to efficiently retrieve relevant documents from the BioASQ dataset.
The VectorStore class is a weave.Object with the following key components:
Embedding model (default: "text-embedding-3-small")
Embedding function
Article storage and embeddings
Ranking method (default: cosine similarity)
The retrieval process involves:
Embedding the transformed query
Comparing the query embedding to pre-computed document embeddings
Ranking documents based on the chosen similarity metric
Returning the top N most relevant documents
vector_store = weave.ref('VectorStore:latest').get()
embedding_model = weave.ref('SentenceTransformersModel:latest').get()
vector_store.set_embedding_model(embedding_model)
﻿
relevant_docs = vector_store.get_most_relevant_documents(
    query=transformed_query,
    n=5,
    ranking_method="cosine"
)
Throughout our biomedical RAG pipeline, we utilize weave.ref to efficiently access pre-computed resources and datasets:
vector_store = weave.ref('VectorStore:latest').get()
embedding_model = weave.ref('SentenceTransformersModel:latest').get()
qap = weave.ref('QuestionAnswerPairsTrainFiltered:latest').get()
This approach offers several advantages:
Version control: Easily access specific versions of models and datasets.
Reproducibility: Ensure consistent results across different runs.
Resource efficiency: Avoid redundant computations by reusing pre-computed resources.
Flexibility: Quickly swap out components for experimentation.
By leveraging weave.ref, we streamline our workflow and maintain a modular, easily updatable RAG pipeline.
3. Context scoring
﻿
After retrieving potentially relevant documents, our RAG Model employs a context scoring step to further refine the selection of documents. This step ensures that only the most pertinent information is used for summarization and answer synthesis.
The context scoring process utilizes another instance of our GenericLLMModel, specifically tailored to assess the relevance of retrieved documents:
article_relevance_model = GenericLLMModel(
    system_prompt=article_relevance_system_prompt,
    human_prompt=article_relevance_human_prompt
)
The model is designed to provide a binary "yes" or "no" answer regarding the relevance of each document to the original question. Here's how the scoring process works:
For each retrieved document, we use the article_relevance_model to predict its relevance:
for doc in _context:
    doc["relevance"] = article_relevance_model.predict(
        human_prompt_args={
            "question": question,
            "article_text": doc["document"]["passage"]
        }
    )['answer']
We then filter the documents based on their relevance scores:
relevant_context = [doc for doc in _context if doc["relevance"].lower() == "yes"]
This process ensures that only the most relevant documents are passed on to the summarization step, improving the quality and accuracy of the final answer.
The article_relevance_model uses a carefully crafted system prompt to guide its decision-making process:
article_relevance_system_prompt = """
### Instruction ###
You are an expert medical researcher librarian. Your task is to determine whether articles from the BioASQ dataset may be relevant to questions from clinicians based on the articles' abstracts. You MUST provide a yes or no answer. You will be penalized for answers that are not a clear yes or no.
﻿
### Process ###
1. Carefully read the provided clinical question.
2. Analyze the given article abstract in the context of the question.
3. Determine if the abstract contains information potentially relevant to answering the question.
4. Provide a definitive yes or no answer. Do not hedge or equivocate.
﻿
### Evaluation ###
Your performance will be evaluated on:
- Ability to identify abstracts with information relevant to the clinical question
- Providing a clear, unambiguous yes or no answer
- Avoiding reliance on stereotypes or biases in your determination
- Adherence to the required answer format
﻿
You MUST provide a yes or no answer. Any other response will be penalized.
"""
This context scoring step is crucial for:
Reducing noise in the input to subsequent steps
Improving the relevance and accuracy of the final answer
Enhancing the efficiency of the summarization process
By incorporating this step, our RAG Model can provide more focused and accurate responses to complex biomedical questions, even when dealing with a large volume of potentially relevant documents.
4. Summarization
﻿
﻿
The @weave.op() decorator on the predict method ensures this summarization and following synthesis step is tracked and integrates into the larger RAG pipeline:
💡
After retrieving and scoring relevant documents, the next step in our Biomedical RAG Model is summarization. This process condenses the information from multiple relevant documents into a concise summary that addresses the original question.
We use another instance of our GenericLLMModel for this task:
summarization_model = GenericLLMModel(
    system_prompt=summarization_system_prompt,
    human_prompt=summarization_human_prompt
)
The summarization process involves first preparing the context string by joining the relevant documents:
context_str = "\\\\n\\\\n".join([f"{doc['document']['passage']} (Score: {doc['score']})" for doc in relevant_context])
Then, using the summarization model to generate a summary:
summary = summarization_model.predict(human_prompt_args={"question": question, "context_str": context_str})['answer']
The summarization model uses this system prompt:
summarization_system_prompt = """
### Instruction ###
You are an expert medical researcher tasked with summarizing relevant excerpts from biomedical literature to provide background information necessary to answer clinicians' questions. Your summary should be concise yet informative, capturing the key points from the provided context.
﻿
### Process ###
1. Read the provided clinical question to understand the information needed.
2. Analyze the given context, which includes excerpts from biomedical literature along with relevance scores.
3. Identify the most pertinent information from the context in relation to the question.
4. Summarize the key points from the relevant excerpts, considering their relevance scores.
5. Synthesize the individual summaries into a coherent overview addressing the question.
6. If the context is not sufficient to answer the question, indicate that more information is needed.
﻿
### Format ###
Question: <question>
Summary: <summary_of_relevant_information>
Relevant Excerpts: <excerpts_in_order_of_relevance>
﻿
### Evaluation ###
Your performance will be evaluated on:
- Ability to identify and summarize relevant information from the provided context
- Synthesis of individual excerpt summaries into a coherent overview
- Consideration of excerpt relevance scores in the final summary
- Clarity and conciseness of the summary
- Adherence to the specified response format
﻿
Provide a summary that directly addresses the given question using the most relevant excerpts from the context. If the provided context is insufficient to answer the question, state "Insufficient information to answer the question."
"""
﻿
This summarization step serves several purposes:
Information synthesis: Combines information from multiple sources into a coherent narrative.
Relevance filtering: Focuses on the most pertinent information related to the question.
Conciseness: Distills scientific texts into a manageable summary.
Context awareness: Considers the relevance scores of each excerpt, prioritizing more relevant information.
The @weave.op() decorator on the predict method ensures this summarization step is tracked and integrates into the larger RAG pipeline:
@weave.op()
def predict(self, question: str, context_str: str) -> str:
    return self.model.predict(human_prompt_args={"question": question, "context_str": context_str})['answer']
This summarization step connects document retrieval to final answer synthesis, providing a concise overview of the relevant biomedical information. It enables the RAG model to generate contextually relevant responses to complex biomedical questions.
5. Final answer synthesis
﻿
The @weave.op() decorator on the predict method ensures this synthesis step is tracked and integrates into the larger RAG pipeline:
💡
The final step in our biomedical RAG Model is the synthesis of a concise answer based on the summarized information. This step transforms the detailed summary into a clear, direct response to the original question.
We use another instance of our GenericLLMModel for this task:
synthesis_model = GenericLLMModel(
    system_prompt=synthesis_system_prompt,
    human_prompt=synthesis_human_prompt
)
The synthesis process involves first passing the original question and the generated summary to the synthesis model:
answer = synthesis_model.predict(human_prompt_args={"question": question, "summary": summary})['answer']
We use this system prompt:
synthesis_system_prompt = """
### Instruction ###
You are an expert medical assistant. Your task is to provide accurate, concise answers to medical questions based on summaries of relevant biomedical literature. Ensure responses are clear, informative, unbiased, and avoid stereotypes. Answer in a natural, human-like manner.
﻿
### Process ###
1. Analyze the provided question to understand the key information needed.
2. Review the summary of relevant excerpts from biomedical literature.
3. Identify the most pertinent information in the summary for answering the question.
4. Synthesize the key points into a coherent, concise answer.
5. If the summary lacks sufficient information to conclusively answer the question, state "There is insufficient information provided to conclusively answer the question."
﻿
### Format ###
Question: <question>
Answer: <final_answer_based_on_summary>
﻿
### Evaluation ###
Your performance will be evaluated on:
- Accuracy and relevance of the answer based on the provided summary
- Clarity and conciseness of the response
- Ability to identify when the summary is insufficient to conclusively answer the question
- Avoidance of bias and stereotyping
- Adherence to the specified format
﻿
Provide an answer that directly addresses the question using only the information in the summary. If the summary is insufficient, state that conclusively answering is not possible. Produce the answer in a clear, natural style.
"""
﻿
This synthesis step serves several purposes:
Distillation: Condenses the detailed summary into a focused answer.
Clarity: Ensures the response directly addresses the original question.
Consistency: Maintains alignment with the information provided in the summary.
Uncertainty Handling: Acknowledges when insufficient information is available for a conclusive answer.
This final synthesis step completes the RAG pipeline, producing a concise, relevant answer to the original biomedical question. It leverages the context and summarization from previous steps to generate a response that is both informative and tailored to the specific query.
Evaluating our biomedical RAG model
Defining our BioASQAdvancedRAGModel experiment
﻿
To create our experiment, we define a BioASQAdvancedRAGModel class that inherits from weave.Model. This class encapsulates all the components of our RAG pipeline into a single, cohesive model:
class BioASQAdvancedRAGModel(RAGModel):
    def __init__(self, vector_store, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.vector_store = vector_store
﻿
    @weave.op()
    def score_context(self, _context, question) -> str:
        for doc in _context:
            doc["relevance"] = article_relevance_model.predict(
                human_prompt_args={"question": question, "article_text": doc["document"]["passage"]}
            )['answer']
﻿
    @weave.op()
    def predict(self, question: str, n_documents: int = 5) -> str:
        # Query transformation
        transformed_query = question_2_query_model.predict(
            human_prompt_args={"question": question}
        )['answer']
﻿
        # Document retrieval
        _context = self.vector_store.get_most_relevant_documents(query=transformed_query, n=n_documents)
﻿
        # Context scoring
        self.score_context(_context, question)
        relevant_context = [doc for doc in _context if doc["relevance"].lower() == "yes"]
        if len(relevant_context) == 0:
            relevant_context = [_context[0]]
﻿
        # Summarization
        context_str = "\\\\n\\\\n".join([f"{doc['document']['passage']} (Score: {doc['score']})" for doc in relevant_context])
        summary = summarization_model.predict(
            human_prompt_args={"question": question, "context_str": context_str}
        )['answer']
﻿
        # Final answer synthesis
        answer = synthesis_model.predict(
            human_prompt_args={"question": question, "summary": summary}
        )['answer']
﻿
        return {
            "answer": answer,
            "context": [doc["document"]["passage"] for doc in relevant_context],
            "all_context": _context
        }
This predict method combines all the steps we've discussed earlier:
Query transformation
Document retrieval
Context scoring
Summarization
Final answer synthesis
Each step is implemented using the appropriate model (e.g., question_2_query_model, summarization_model, synthesis_model) that we defined earlier in our pipeline.
Setting up the evaluationTo evaluate our model, we need to set up the evaluation framework:
# Create an instance of our RAG model
rag_model = BioASQAdvancedRAGModel(vector_store=vector_store)
﻿
# Load the evaluation dataset
qap = weave.ref('QuestionAnswerPairsTrainFiltered:latest').get()
sub_qap = qap.rows[:10]  # Using first 10 questions for this example
﻿
# Define evaluation metrics
from weave_example_demo.scorers.llm_guard_scorer import LLMGuardScorer
from weave_example_demo.scorers.tonic_validate_scorer import TonicValidateScorer
﻿
scorers = [
    TonicValidateScorer(
        metrics=[
            "AnswerSimilarityMetric",
            "AugmentationPrecisionMetric",
            "AnswerConsistencyMetric",
        ]
    ),
    LLMGuardScorer(
        metrics=["NoRefusal", "Relevance", "Sensitive"]),
]
﻿
Running the evaluation
﻿
With our model and evaluation setup in place, we can now run the evaluation:
This evaluation process will:
Run our BioASQAdvancedRAGModel on each question in the subset of the BioASQ dataset.
Apply each of the defined metrics to the model's outputs.
Aggregate the results to provide an overall performance assessment.
Interpreting the resultsAfter running the evaluation, we can analyze the results to understand our model's performance:
Answer similarity: This metric tells us how close our model's answers are to the ground truth answers in the BioASQ dataset.
Augmentation precision: This measures how accurately our model provides additional information beyond the direct answer.
Answer consistency: This checks if our model gives consistent answers when asked the same question multiple times.
No refusal: This ensures our model attempts to answer all valid questions without unnecessary refusals.
Relevance: This metric assesses how well our model's answers actually address the given questions.
Sensitive information: This checks if our model inadvertently includes any sensitive or inappropriate information in its responses.
By analyzing these metrics, we can identify strengths and weaknesses in our RAG pipeline. For example, if we see low scores in answer similarity but high scores in relevance, it might indicate that our model is providing relevant information but not in the exact format expected by the BioASQ dataset.
Iterative improvementBased on the evaluation results, we can iteratively improve our model:
If answer similarity is low, we might need to fine-tune our summarization or synthesis models.
If augmentation precision is low, we could improve our document retrieval or context scoring steps.
If answer consistency is an issue, we might need to adjust the temperature settings in our language models.
Low scores in the LLMGuard metrics (NoRefusal, Relevance, Sensitive) might require adjustments to our prompts or the addition of safety checks in our pipeline.
By continually evaluating and refining our BioASQAdvancedRAGModel, we can create a more accurate, reliable, and safe system for answering complex biomedical questions.
Serving the model on StreamlitBased on: https://github.com/streamlit/snowflake-arctic-st-demo﻿
💡
﻿
To make our model accessible, we can create a simple Streamlit app. Here's a minimal implementation:
import streamlit as st
import weave
﻿
# Load the RAG model
@st.cache_resource
def load_rag_model():
    return weave.ref('BioASQAdvancedRAGModel:latest').get()
﻿
rag_model = load_rag_model()
﻿
st.title("Biomedical Question Answering")
﻿
# User input
question = st.text_input("Enter your biomedical question:")
﻿
if question:
    with st.spinner("Generating answer..."):
        # Get response from the model
        response = rag_model.predict(question)
﻿
    # Display the answer
    st.subheader("Answer:")
    st.write(response['answer'])
﻿
    # Display relevant context
    st.subheader("Relevant Context:")
    for context in response['context']:
        st.write(context)
﻿
This minimal app:
Loads the RAG model using weave.ref
Provides a text input for the user's question
Generates a response using the model
Displays the answer and relevant context
The use of weave.ref allows for easy model versioning and deployment. By changing the reference, we can quickly update the model without modifying the app code.
This Streamlit interface provides a user-friendly way for researchers and clinicians to interact with our Biomedical RAG Model, quickly obtaining relevant information from biomedical literature.
ConclusionHere's one more example of our app in action: 
Question:
﻿
Which animal bite can cause Capnocytophaga canimorsus infection?
﻿
Ground Truth:
﻿
Capnocytophaga canimorsus infection is typically associated with dog bites, especially in asplenic or immunocompromised patients, and typically manifest as sepsis and/or bacteremia.
﻿
Response:
﻿
The animal bite that can cause Capnocytophaga canimorsus infection is from dogs. Capnocytophaga canimorsus is a commensal bacterium found in dogs' mouths, and it can lead to septicemia or meningitis in humans through bites or scratches.
﻿
Context:
﻿
Capnocytophaga canimorsus, a commensal bacterium from dogs' mouths, can cause
septicemia or meningitis in humans through bites or scratches. Here, we describe
and characterize the inflammatory response of human and mouse macrophages on C.
canimorsus infection. Macrophages infected with 10 different strains failed to
release tumor necrosis factor (TNF)- alpha and interleukin (IL)-1 alpha .
Macrophages infected with live and heat-killed (HK) C. canimorsus 5 (Cc5), a
strain isolated from a patient with fatal septicemia, did not release IL-6,
IL-8, interferon- gamma , macrophage inflammatory protein-1 beta , and nitric
oxide (NO). This absence of a proinflammatory response was characterized by the
inability of Toll-like receptor (TLR) 4 to respond to Cc5. Moreover, live but
not HK Cc5 blocked the release of TNF- alpha and NO induced by HK Yersinia
enterocolitica. In addition, live Cc5 down-regulated the expression of TLR4 and
dephosphorylated p38 mitogen-activated protein kinase. These results highlight
passive and active mechanisms of immune evasion by C. canimorsus, which may
explain its capacity to escape from the host immune system.
Our biomedical RAG model, built on the foundation of Snowflake Arctic and the BioASQ dataset, demonstrates a practical approach to addressing the challenges of biomedical information retrieval and synthesis. Snowflake Arctic's advanced architecture, combining dense and mixture-of-experts layers, provides the computational power needed for complex biomedical reasoning while maintaining efficiency. This allows our model to handle the nuances of medical terminology and concepts with high accuracy and speed.
By breaking down the process into distinct, trackable steps - query transformation, document retrieval, context scoring, summarization, and final answer synthesis - we've created a pipeline that can efficiently handle complex biomedical questions.
The use of Weave throughout our implementation offers several advantages:
Versioning and reproducibility of models and datasets
Efficient tracking of each operation in the pipeline
Flexibility to swap out components for experimentation
Our evaluation framework, combining metrics from TonicValidate and LLMGuard, provides a comprehensive assessment of the model's performance. This multi-faceted approach allows us to gauge not just the accuracy of answers, but also their relevance, consistency, and adherence to safety guidelines.
However, it's important to note that this implementation is just a starting point. There's significant room for improvement and experimentation:
Fine-tuning each component of the pipeline for biomedical specificity
Exploring more advanced retrieval methods, perhaps incorporating biomedical ontologies
Investigating the impact of different embedding models on retrieval performance
Optimizing the balance between model complexity and inference speed
As we continue to refine this system, we must keep in mind the ultimate goal: providing clinicians with quick, accurate, and relevant information to support patient care. The rapid pace of biomedical research makes this a challenging but crucial task, one that our RAG model is well-positioned to address.
By iterating on this model, incorporating feedback from medical professionals, and staying abreast of advancements in both machine learning and biomedical research, we can work towards a system that truly enhances the ability of healthcare providers to access and utilize the vast wealth of biomedical knowledge available.
﻿
﻿
Add a comment
Tags: Articles, Health Care, LLM, GenAI, NLP, Tutorial, Advanced
Iterate on AI agents and models faster. Try Weights & Biases today.