Skip to main content

Building an AI teacher's assistant using LlamaIndex and Groq

Today, we're going to leverage a RAG pipeline to create an AI TA capable of helping out with grading, questions about a class syllabus, and more
Created on July 4|Last edited on July 10

🎬 Introduction

Many teachers have to spend significant time manually grading and evaluating their students’ assignments and tests, necessary task that also take up a considerable chunk of their time and focus.
In this report, we'll try to help solve that. Specifically, we'll build a chatbot to assist a teacher with grading student assignments and help clarify student questions regarding the syllabus and study materials using a retrieval augmented generation (RAG) pipeline.
We'll build our RAG pipeline with these tools:
  • LlamaIndex as the data framework for building the RAG application
  • Groq as an LLM vendor
  • Instructor to get structured output with a consistent schema from our LLM
  • Weave for tracking and evaluating LLM applications
Our RAG pipeline will act as an assistant to an English teacher. Our project will center on teaching prose from the English textbook Flamingo, which is part of the Central Board of Secondary Education's (CBSE) English syllabus for Class XII in India, but similar techniques could be used with a wide variety of curriculum. Lastly, we'll use the question bank of question-answer pairs to evaluate our assistant.
An example of comparing evaluation traces using Weave

📚 Table of Contents (click to expand)

Building the Vector Store Index

RAG pipelines let LLMs pull information and facts from external sources they weren't necessarily trained on. In a RAG pipeline, indexing vector embeddings play a crucial role in efficiently and effectively finding relevant information to augment the generation process.
Indexing vector embeddings allows the system to quickly search through potentially large datasets to find the most relevant content. This is typically done using specialized algorithms and data structures—such as k-nearest Neighbors (k-NN) or locality-sensitive hashing—that are optimized for high-dimensional vector spaces.
The system can capture semantic similarities using vector embeddings rather than relying on surface-level text matches. This means that the retrieval component can fetch documents or data that are contextually similar to the input query, even if they do not share exact keywords. This enhances the relevance of the information used in the generation process.
Let's jump into the code and index the vector embeddings from our documents 👇

Loading the Data using LlamaIndex
0


🍀 Building the query engine

In this section, we will build a query engine with two components:
  • The retriever is responsible for selecting relevant information from a large external knowledge base (typically, a vector store index). The retriever first takes in a query and vectorizes it using an embedding model (the same model used to create the vector store index). It then computes the cosine distance between the query vector and the index (a list of vectors) and picks the top k vectors from the index. These top k vectors represent the chunks that are closest (most relevant) to the query.
  • The generator model, typically a large language model, will use these chunks as context to generate the final response to the query.

LlamaIndex query engine using a GroqCloud LLM.
1


🤖 Building task-specific assistants using prompt engineering

Now that we have a functional RAG pipeline, let's use some basic prompt engineering to make it a little more helpful. We need our teaching assistant to be able to perform the following tasks:
  • Emulate the ideal response of a student to a question
  • Conversely, emulate the teacher's response to a question from a student
  • Help the teacher grade the answer given by a student to a question
Here's how we get that done:

Run set
0


🧑🏼‍🏫 Building a simple assistant to answer student questions

We're going to use weave.Model to write our assistants. A weave.Model is a combination of data (including configuration, trained model weights, or other information) and code that defines the model's operation. By structuring your code to be compatible with this API, you benefit from a structured way to version your application so you can more systematically keep track of your experiments.
Let's use a simple prompt template to build an assistant to help clarify students' questions about any point in the textbook.
import weave

class EnglishDoubtClearningAssistant(weave.Model):
model: str = "llama3-8b-8192"
_groq_client: Optional[Groq] = None
def __init__(self, model: Optional[str] = None):
super().__init__()
self.model = model if model is not None else self.model
self._groq_client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
@weave.op()
def get_prompts(self, question: str, context: str):
system_prompt = """
You are a student in a class and your teacher has asked you to answer the following question.
You have to write the answer in the given word limit."""
user_prompt = f"""
We have provided context information below.

---
{context}
---

Answer the following question within 50-150 words:

---
{query}
---
"""
return system_prompt, user_prompt

@weave.op()
def predict(self, question: str, context: str):
system_prompt, user_prompt = self.get_prompts(question, context)
chat_completion = self._groq_client.chat.completions.create(
messages=[
{ "role": "system", "content": system_prompt },
{ "role": "user", "content": user_prompt },
],
model=self.model,
)
return chat_completion.choices[0].message.content


weave.init(project_name="groq-rag")
assistant = EnglishDoubtClearningAssistant()
assistant.predict(question=query, context=context)


The English Doubt Clearing Assistant versioned and traced on Weave
1


🙋🏻‍♀️ Building an assistant for generating student responses

Let's use another simple prompt template to build a student response-generating assistant that generates an ideal answer to a question depending on the total marks that can be awarded for the question. The EnglishStudentResponseAssistant would take in the question and a maximum mark that can be awarded for the answer and expect an ideal answer accordingly.
class EnglishStudentResponseAssistant(weave.Model):
model: str = "llama3-8b-8192"
_groq_client: Optional[Groq] = None
def __init__(self, model: Optional[str] = None):
super().__init__()
self.model = model if model is not None else self.model
self._groq_client = Groq(
api_key=os.environ.get("GROQ_API_KEY")
)
@weave.op()
def get_prompt(
self, question: str, context: str, word_limit_min: int, word_limit_max: int
) -> Tuple[str, str]:
system_prompt = """
You are a student in a class and your teacher has asked you to answer the following question.
You have to write the answer in the given word limit."""
user_prompt = f"""
We have provided context information below.

---
{context}
---

Answer the following question within {word_limit_min}-{word_limit_max} words:

---
{question}
---"""
return system_prompt, user_prompt

@weave.op()
def predict(self, question: str, total_marks: int) -> str:
response = retreival_engine.retrieve(question)
context = response[0].node.text
if total_marks < 3:
word_limit_min = 5
word_limit_max = 50
elif total_marks < 5:
word_limit_min = 50
word_limit_max = 100
else:
word_limit_min = 100
word_limit_max = 200
system_prompt, user_prompt = self.get_prompt(
question, context, word_limit_min, word_limit_max
)
chat_completion = self._groq_client.chat.completions.create(
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": user_prompt,
},
],
model=self.model,
)
return chat_completion.choices[0].message.content


The Student Response Generation Assistant versioned and traced on Weave
1


👨🏻‍🏫 Building a grading assistant

To get a holistic evaluation from our assistant, we must structure the LLM response into a consistent schema like a pydantic.BaseModel. To achieve this, we will use the Instructor library with our LLM.
Let's first install Instructor using:
pip install -U instructor
Next, we will use another simple prompt template to build an answer grading assistant.
import instructor
from pydantic import BaseModel

class GradeExtractor(BaseModel):
question: str
student_answer: str
marks: float
total_marks: float
feedback: str


class EnglishGradingAssistant(EnglishStudentResponseAssistant):
model: str = "llama3-8b-8192"
_groq_client: Optional[Groq] = None
_instructor_groq_client: Optional[instructor.Instructor] = None

def __init__(self, model: Optional[str] = None):
super().__init__(model=model)
self.model = model if model is not None else self.model
self._instructor_groq_client = instructor.from_groq(
Groq(api_key=os.environ.get("GROQ_API_KEY"))
)
@weave.op()
def get_prompt_for_grading(
self,
question: str,
context: str,
total_marks: int,
student_answer: Optional[str] = None,
) -> Tuple[str, str]:
system_prompt = """
You are a helpful assistant to an English teacher meant to grade the answer given by a student to a question.
You have to extract the question , the student's answer, the marks awarded to the student out of total marks,
the total marks and a contructive feedback to the student's answer with regards to how accurate it is with
respect to the context.
"""
student_answer = (
self.predict(question, total_marks)
if student_answer is None
else student_answer
)
user_prompt = f"""
We have provided context information below.

---
{context}
---

We have asked the following question to the student for total_marks={total_marks}:

---
{question}
---

The student has responded with the following answer:

---
{student_answer}
---"""
return user_prompt, system_prompt
@weave.op()
def grade_answer(
self, question: str, student_answer: str, total_marks: int
) -> GradeExtractor:
user_prompt, system_prompt = self.get_prompt_for_grading(
question, student_answer, total_marks
)
return self._instructor_groq_client.chat.completions.create(
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": user_prompt,
},
],
model=self.model,
response_model=GradeExtractor,
)

assistant = EnglishGradingAssistant()
grade = assistant.grade_answer(
question=query,
student_answer=ideal_student_response,
total_marks=5
)


The English Answer Grading Assistant versioned and traced on Weave
1


🧬 Building an evaluation pipeline

To iterate on any AI application, we need a way to systematically evaluate its performance to check if it's improving or not. A common practice is to test it against the same set of examples when there is a change. In this recipe, we will build an evaluation pipeline to evaluate the responses of our AI assistant using weave.Evaluation which is a flexible API that provides us with a first-class way to track evaluations.

⚗️ Building an evaluation dataset

We built an evaluation dataset by scraping a question bank of solved question-answer pairs of the Flamingo textbook from LearnCBSE. The dataset consists of 358 question-answer pairs corresponding to the eight chapters from our knowledge base dataset.
We log this dataset as a weave.Dataset, which enables us to collect evaluation examples and automatically track versions for accurate comparisons. The dataset consists of examples in the following format:
{
"question": "What was the mood in the classroom when M. Hamel gave his last French lesson? ",
"answer": "When M.Hamel was giving his last French ; lesson, the mood in the classroom was solemn and sombre. When he announced that this was their last French lesson everyone present in the classroom suddenly developed patriotic feelings for their native language and genuinely regretted ignoring their mother tongue.",
"marks": "3-4",
"chapter_name": "The Last Lesson"
}


Exploring the evaluation dataset using the Weave UI
1


👨🏽‍⚖️ Evaluating the assistant using an LLM Judge

One approach to evaluating an LLM application is to use another LLM as a judge to evaluate aspects of it. This recipe demonstrates a simple example of using an LLM judge as a weave.Scorer to try to measure the correctness of the AI assistant's response by prompting it to verify if the response is relevant to the context and how well it holds up to the ground-truth answer from the evaluation dataset to evaluate application responses automatically.
import instructor
import weave
from openai import OpenAI
from pydantic import BaseModel
from typing import Dict, Optional


# The pydantic object representing
# the LLM's judge's structure response
class JudgeResponse(BaseModel):
marks: float
explanation: str


# The LLM judge model
class OpenaAIJudgeModel(weave.Scorer):
model: str = "gpt-4"
max_retries: int = 5
_openai_client: Optional[instructor.Instructor] = None

def __init__(self, model: Optional[str] = None):
super().__init__()
self.model = model if model is not None else self.model
self._openai_client = instructor.from_openai(
OpenAI(api_key=userdata.get("OPENAI_API_KEY")),
mode=instructor.Mode.TOOLS,
)

@weave.op()
def compose_judgement(
self,
question: str,
context: str,
ground_truth_answer: str,
assistant_answer: str,
total_marks: int,
) -> JudgeResponse:
system_prompt = f"""
You are an expert in teacher of English langugage and literature.
Given a question, a context, a ground truth answer and an answer from an AI assistant,
you have to judge the assistant's answer based on the following criteria and assign
a score between 0 and total marks:

1. how well the assistant answers the question with respect to the context.
2. how well the assistant's answer holds up in correctness and relevance to
the ground truth answer (assuming the ground truth answer is perfect).

You have to extract the marks to be awarded to the assistant's answer and a detailed
explanation as to how the assistant's answer was judged."""
user_prompt = f"""
We have asked the following question to an AI assistant for total marks of {total_marks}:

---
{question}
---

We have provided context information below.

---
{context}
---

Th AI assistant has responded with the following answer:

---
{assistant_answer}
---

An ideal answer to the question would be the following:

---
{ground_truth_answer}
---"""
return self._openai_client.chat.completions.create(
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": user_prompt,
},
],
max_retries=self.max_retries,
model=self.model,
response_model=JudgeResponse,
)

@weave.op()
def score(
self,
question: str,
answer: str,
marks: str,
model_output: Dict[str, str],
) -> Dict[str, float]:
if marks == "3-4":
total_marks = 4
elif marks == "5-6":
total_marks = 6
else:
total_marks = 4
judge_response = self.compose_judgement(
question=question,
context=model_output["context"],
ground_truth_answer=answer,
assistant_answer=model_output["response"],
total_marks=total_marks,
)
if not hasattr(judge_response, "marks"):
return {"marks": 0.0, "fractional_marks": 0.0, "percentage": 0.0}
return {
"marks": judge_response.marks,
"fractional_marks": judge_response.marks / total_marks,
"percentage": (judge_response.marks / total_marks) * 100,
}
Finally, let us put everything together and evaluate our LLM assistant using weave.Evaluation.
assistant = EnglishStudentResponseAssistant()

# We write an infer function for the evaluation process to match the
# function signature with the schema of the dataset.
@weave.op()
async def get_assistant_prediction(question: str, marks: str):
if marks == "3-4":
total_marks = 4
elif marks == "5-6":
total_marks = 6
else:
total_marks = 4
return assistant.predict(question, total_marks)


# Get weave dataset
dataset = weave.ref("flamingos-prose-question-bank:v1").get()

# Define evaluation
evaluation = weave.Evaluation(dataset=dataset, scorers=[OpenaAIJudgeModel()])

# Evaluate the inference function
await evaluation.evaluate(get_assistant_prediction)


Exploring the evaluation traces on Weave UI
1



Comparing Evaluations in the Weave UI
1


🏁 Conclusion

  • We've learned how to build an LLM application using frameworks like LlamaIndex and Instructor.
  • We've learned how to use GroqCloud as an LLM vendor for our LLM application.
  • We've also learned how to build observability into different steps of our applications using weave.op().
  • We've also learned how to build more complex scoring functions, like an LLM judge, to evaluate application responses automatically.

📕 Further Resources

We have a free prompt engineering course here to help you think about how to structure your prompts. Also, feel free to check out the following reports to learn more about developing LLM applications

Iterate on AI agents and models faster. Try Weights & Biases today.