Building an AI teacher's assistant using LlamaIndex and Groq

Today, we're going to leverage a RAG pipeline to create an AI TA capable of helping out with grading, questions about a class syllabus, and more
Soumik Rakshit
Created on July 4|Last edited on July 10
Comment
﻿
🎬 Introduction Many teachers have to spend significant time manually grading and evaluating their students’ assignments and tests, necessary task that also take up a considerable chunk of their time and focus. 
In this report, we'll try to help solve that. Specifically, we'll build a chatbot to assist a teacher with grading student assignments and help clarify student questions regarding the syllabus and study materials using a retrieval augmented generation (RAG) pipeline.
We'll build our RAG pipeline with these tools:
﻿LlamaIndex as the data framework for building the RAG application
﻿Groq as an LLM vendor
﻿Instructor to get structured output with a consistent schema from our LLM 
﻿Weave for tracking and evaluating LLM applications
Our RAG pipeline will act as an assistant to an English teacher. Our project will center on teaching prose from the English textbook Flamingo, which is part of the Central Board of Secondary Education's (CBSE) English syllabus for Class XII in India, but similar techniques could be used with a wide variety of curriculum. Lastly, we'll use the question bank of question-answer pairs to evaluate our assistant.
An example of comparing evaluation traces using Weave
📚 Table of Contents (click to expand)
Building the Vector Store IndexRAG pipelines let LLMs pull information and facts from external sources they weren't necessarily trained on. In a RAG pipeline, indexing vector embeddings play a crucial role in efficiently and effectively finding relevant information to augment the generation process.
Indexing vector embeddings allows the system to quickly search through potentially large datasets to find the most relevant content. This is typically done using specialized algorithms and data structures—such as k-nearest Neighbors (k-NN) or locality-sensitive hashing—that are optimized for high-dimensional vector spaces.
The system can capture semantic similarities using vector embeddings rather than relying on surface-level text matches. This means that the retrieval component can fetch documents or data that are contextually similar to the input query, even if they do not share exact keywords. This enhances the relevance of the information used in the generation process.
Let's jump into the code and index the vector embeddings from our documents 👇
﻿
Loading the Data using LlamaIndex0
﻿
🍀 Building the query engineIn this section, we will build a query engine with two components:
The retriever is responsible for selecting relevant information from a large external knowledge base (typically, a vector store index). The retriever first takes in a query and vectorizes it using an embedding model (the same model used to create the vector store index). It then computes the cosine distance between the query vector and the index (a list of vectors) and picks the top k vectors from the index. These top k vectors represent the chunks that are closest (most relevant) to the query.
The generator model, typically a large language model, will use these chunks as context to generate the final response to the query.
﻿
LlamaIndex query engine using a GroqCloud LLM.1
﻿
🤖 Building task-specific assistants using prompt engineeringNow that we have a functional RAG pipeline, let's use some basic prompt engineering to make it a little more helpful. We need our teaching assistant to be able to perform the following tasks:
Emulate the ideal response of a student to a question
Conversely, emulate the teacher's response to a question from a student
Help the teacher grade the answer given by a student to a question
Here's how we get that done: 
﻿
Run set0
﻿
🧑🏼‍🏫 Building a simple assistant to answer student questionsWe're going to use weave.Model to write our assistants. A weave.Model is a combination of data (including configuration, trained model weights, or other information) and code that defines the model's operation. By structuring your code to be compatible with this API, you benefit from a structured way to version your application so you can more systematically keep track of your experiments.
Let's use a simple prompt template to build an assistant to help clarify students' questions about any point in the textbook.
import weave
﻿
class EnglishDoubtClearningAssistant(weave.Model):
    model: str = "llama3-8b-8192"
    _groq_client: Optional[Groq] = None
    
    def __init__(self, model: Optional[str] = None):
        super().__init__()
        self.model = model if model is not None else self.model
        self._groq_client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
    
    @weave.op()
    def get_prompts(self, question: str, context: str):
        system_prompt = """
You are a student in a class and your teacher has asked you to answer the following question.
You have to write the answer in the given word limit."""
        user_prompt = f"""
We have provided context information below. 
﻿
---
{context}
---
﻿
Answer the following question within 50-150 words:
﻿
---
{query}
---
"""
        return system_prompt, user_prompt
﻿
    @weave.op()
    def predict(self, question: str, context: str):
        system_prompt, user_prompt = self.get_prompts(question, context)
        chat_completion = self._groq_client.chat.completions.create(
            messages=[
                { "role": "system", "content": system_prompt },
                { "role": "user", "content": user_prompt },
            ],
            model=self.model,
        )
        return chat_completion.choices[0].message.content
﻿
﻿
weave.init(project_name="groq-rag")
assistant = EnglishDoubtClearningAssistant()
assistant.predict(question=query, context=context)
﻿
﻿
The English Doubt Clearing Assistant versioned and traced on Weave1
﻿
🙋🏻‍♀️ Building an assistant for generating student responsesLet's use another simple prompt template to build a student response-generating assistant that generates an ideal answer to a question depending on the total marks that can be awarded for the question. The EnglishStudentResponseAssistant would take in the question and a maximum mark that can be awarded for the answer and expect an ideal answer accordingly.
class EnglishStudentResponseAssistant(weave.Model):
    model: str = "llama3-8b-8192"
    _groq_client: Optional[Groq] = None
    
    def __init__(self, model: Optional[str] = None):
        super().__init__()
        self.model = model if model is not None else self.model
        self._groq_client = Groq(
            api_key=os.environ.get("GROQ_API_KEY")
        )
    
    @weave.op()
    def get_prompt(
        self, question: str, context: str, word_limit_min: int, word_limit_max: int
    ) -> Tuple[str, str]:
        system_prompt = """
You are a student in a class and your teacher has asked you to answer the following question.
You have to write the answer in the given word limit."""
        user_prompt = f"""
We have provided context information below. 
﻿
---
{context}
---
﻿
Answer the following question within {word_limit_min}-{word_limit_max} words:
﻿
---
{question}
---"""
        return system_prompt, user_prompt
﻿
    @weave.op()
    def predict(self, question: str, total_marks: int) -> str:
        response = retreival_engine.retrieve(question)
        context = response[0].node.text
        if total_marks < 3:
            word_limit_min = 5
            word_limit_max = 50
        elif total_marks < 5:
            word_limit_min = 50
            word_limit_max = 100
        else:
            word_limit_min = 100
            word_limit_max = 200
        system_prompt, user_prompt = self.get_prompt(
            question, context, word_limit_min, word_limit_max
        )
        chat_completion = self._groq_client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": system_prompt,
                },
                {
                    "role": "user",
                    "content": user_prompt,
                },
            ],
            model=self.model,
        )
        return chat_completion.choices[0].message.content
﻿
﻿
The Student Response Generation Assistant versioned and traced on Weave1
﻿
👨🏻‍🏫 Building a grading assistantTo get a holistic evaluation from our assistant, we must structure the LLM response into a consistent schema like a pydantic.BaseModel. To achieve this, we will use the Instructor library with our LLM.
Let's first install Instructor using:
pip install -U instructor
Next, we will use another simple prompt template to build an answer grading assistant.
import instructor
from pydantic import BaseModel
﻿
class GradeExtractor(BaseModel):
    question: str
    student_answer: str
    marks: float
    total_marks: float
    feedback: str
﻿
﻿
class EnglishGradingAssistant(EnglishStudentResponseAssistant):
    model: str = "llama3-8b-8192"
    _groq_client: Optional[Groq] = None
    _instructor_groq_client: Optional[instructor.Instructor] = None
﻿
    def __init__(self, model: Optional[str] = None):
        super().__init__(model=model)
        self.model = model if model is not None else self.model
        self._instructor_groq_client = instructor.from_groq(
            Groq(api_key=os.environ.get("GROQ_API_KEY"))
        )
    
    @weave.op()
    def get_prompt_for_grading(
        self,
        question: str,
        context: str,
        total_marks: int,
        student_answer: Optional[str] = None,
    ) -> Tuple[str, str]:
        system_prompt = """
You are a helpful assistant to an English teacher meant to grade the answer given by a student to a question.
You have to extract the question , the student's answer, the marks awarded to the student out of total marks,
the total marks and a contructive feedback to the student's answer with regards to how accurate it is with
respect to the context.
        """
        student_answer = (
            self.predict(question, total_marks)
            if student_answer is None
            else student_answer
        )
        user_prompt = f"""
We have provided context information below. 
﻿
---
{context}
---
﻿
We have asked the following question to the student for total_marks={total_marks}:
﻿
---
{question}
---
﻿
The student has responded with the following answer:
﻿
---
{student_answer}
---"""
        return user_prompt, system_prompt
    
    @weave.op()
    def grade_answer(
        self, question: str, student_answer: str, total_marks: int
    ) -> GradeExtractor:
        user_prompt, system_prompt = self.get_prompt_for_grading(
            question, student_answer, total_marks
        )
        return self._instructor_groq_client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": system_prompt,
                },
                {
                    "role": "user",
                    "content": user_prompt,
                },
            ],
            model=self.model,
            response_model=GradeExtractor,
        )
﻿
assistant = EnglishGradingAssistant()
grade = assistant.grade_answer(
    question=query,
    student_answer=ideal_student_response,
    total_marks=5
)
﻿
﻿
The English Answer Grading Assistant versioned and traced on Weave1
﻿
🧬 Building an evaluation pipelineTo iterate on any AI application, we need a way to systematically evaluate its performance to check if it's improving or not. A common practice is to test it against the same set of examples when there is a change. In this recipe, we will build an evaluation pipeline to evaluate the responses of our AI assistant using weave.Evaluation which is a flexible API that provides us with a first-class way to track evaluations.
⚗️ Building an evaluation datasetWe built an evaluation dataset by scraping a question bank of solved question-answer pairs of the Flamingo textbook from LearnCBSE. The dataset consists of 358 question-answer pairs corresponding to the eight chapters from our knowledge base dataset. 
We log this dataset as a weave.Dataset, which enables us to collect evaluation examples and automatically track versions for accurate comparisons. The dataset consists of examples in the following format:
{
  "question": "What was the mood in the classroom when M. Hamel gave his last French lesson? ",
  "answer": "When M.Hamel was giving his last French ; lesson, the mood in the classroom was solemn and sombre. When he announced that this was their last French lesson everyone present in the classroom suddenly developed patriotic feelings for their native language and genuinely regretted ignoring their mother tongue.",
  "marks": "3-4",
  "chapter_name": "The Last Lesson"
}
﻿
﻿
Exploring the evaluation dataset using the Weave UI1
﻿
👨🏽‍⚖️ Evaluating the assistant using an LLM JudgeOne approach to evaluating an LLM application is to use another LLM as a judge to evaluate aspects of it. This recipe demonstrates a simple example of using an LLM judge as a weave.Scorer to try to measure the correctness of the AI assistant's response by prompting it to verify if the response is relevant to the context and how well it holds up to the ground-truth answer from the evaluation dataset to evaluate application responses automatically.
import instructor
import weave
from openai import OpenAI
from pydantic import BaseModel
from typing import Dict, Optional
﻿
﻿
# The pydantic object representing
# the LLM's judge's structure response
class JudgeResponse(BaseModel):
    marks: float
    explanation: str
﻿
﻿
# The LLM judge model
class OpenaAIJudgeModel(weave.Scorer):
    model: str = "gpt-4"
    max_retries: int = 5
    _openai_client: Optional[instructor.Instructor] = None
﻿
    def __init__(self, model: Optional[str] = None):
        super().__init__()
        self.model = model if model is not None else self.model
        self._openai_client = instructor.from_openai(
            OpenAI(api_key=userdata.get("OPENAI_API_KEY")),
            mode=instructor.Mode.TOOLS,
        )
﻿
    @weave.op()
    def compose_judgement(
        self,
        question: str,
        context: str,
        ground_truth_answer: str,
        assistant_answer: str,
        total_marks: int,
    ) -> JudgeResponse:
        system_prompt = f"""
You are an expert in teacher of English langugage and literature.
Given a question, a context, a ground truth answer and an answer from an AI assistant,
you have to judge the assistant's answer based on the following criteria and assign
a score between 0 and total marks:
﻿
1. how well the assistant answers the question with respect to the context.
2. how well the assistant's answer holds up in correctness and relevance to
    the ground truth answer (assuming the ground truth answer is perfect).
﻿
You have to extract the marks to be awarded to the assistant's answer and a detailed
explanation as to how the assistant's answer was judged."""
        user_prompt = f"""
We have asked the following question to an AI assistant for total marks of {total_marks}:
﻿
---
{question}
---
﻿
We have provided context information below. 
﻿
---
{context}
---
﻿
Th AI assistant has responded with the following answer:
﻿
---
{assistant_answer}
---
﻿
An ideal answer to the question would be the following:
﻿
---
{ground_truth_answer}
---"""
        return self._openai_client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": system_prompt,
                },
                {
                    "role": "user",
                    "content": user_prompt,
                },
            ],
            max_retries=self.max_retries,
            model=self.model,
            response_model=JudgeResponse,
        )
﻿
    @weave.op()
    def score(
        self,
        question: str,
        answer: str,
        marks: str,
        model_output: Dict[str, str],
    ) -> Dict[str, float]:
        if marks == "3-4":
            total_marks = 4
        elif marks == "5-6":
            total_marks = 6
        else:
            total_marks = 4
        judge_response = self.compose_judgement(
            question=question,
            context=model_output["context"],
            ground_truth_answer=answer,
            assistant_answer=model_output["response"],
            total_marks=total_marks,
        )
        if not hasattr(judge_response, "marks"):
            return {"marks": 0.0, "fractional_marks": 0.0, "percentage": 0.0}
        return {
            "marks": judge_response.marks,
            "fractional_marks": judge_response.marks / total_marks,
            "percentage": (judge_response.marks / total_marks) * 100,
        }
Finally, let us put everything together and evaluate our LLM assistant using weave.Evaluation.
assistant = EnglishStudentResponseAssistant()
﻿
# We write an infer function for the evaluation process to match the
# function signature with the schema of the dataset.
@weave.op()
async def get_assistant_prediction(question: str, marks: str):
    if marks == "3-4":
        total_marks = 4
    elif marks == "5-6":
        total_marks = 6
    else:
        total_marks = 4
    return assistant.predict(question, total_marks)
﻿
﻿
# Get weave dataset
dataset = weave.ref("flamingos-prose-question-bank:v1").get()
﻿
# Define evaluation
evaluation = weave.Evaluation(dataset=dataset, scorers=[OpenaAIJudgeModel()])
﻿
# Evaluate the inference function
await evaluation.evaluate(get_assistant_prediction)
﻿
﻿
Exploring the evaluation traces on Weave UI1
﻿
﻿
﻿
Comparing Evaluations in the Weave UI1
﻿
🏁 ConclusionWe've learned how to build an LLM application using frameworks like LlamaIndex ﻿and Instructor.
We've learned how to use GroqCloud as an LLM vendor for our LLM application.
We've also learned how to build observability into different steps of our applications using weave.op().
We've also learned how to build more complex scoring functions, like an LLM judge, to evaluate application responses automatically.
📕 Further ResourcesWe have a free prompt engineering course here to help you think about how to structure your prompts. Also, feel free to check out the following reports to learn more about developing LLM applications
Refactoring Wandbot—our LLM-powered document assistant—for improved efficiency and speed
This report tells the story of how we utilized auto-evaluation-driven development to enhance both the quality and speed of Wandbot.
GPT-4o Python quickstart using the OpenAI API
Getting set up and running GPT-4o on your machine in Python using the OpenAI API.
How to use Azure OpenAI and Azure AI Studio with Weights & Biases Weave
In this step-by-step tutorial, we'll look at how W&B Weave alongside Microsoft's suite of Azure AI offerings
How to use the Gemini Pro API with W&B Weave
Powerful LLMs need observability. Here's how to get it. 
﻿
﻿
Add a comment
Tags: Articles, GenAI, LLM, Tutorial, NLP
Iterate on AI agents and models faster. Try Weights & Biases today.