Skip to main content

How to use the Gemini Pro API with W&B Weave

Powerful LLMs need observability. Here's how to get it.
Created on April 26|Last edited on May 28

Introduction

Welcome to this simple guide on Google Gemini and W&B Weave, where you'll learn how to use Google's powerful LLM alongside our powerful LLM observability tool to create powerful apps.
Weave makes working with language models easier and more reliable. We'll discuss Weave's basics, key features, and how it can help you organize and evaluate your work with language models.
Let's dive in.


W&B Weave

W&B Weave is a user-friendly, lightweight toolkit designed to help developers track and evaluate their large language models (LLMs) in a more organized and efficient manner. Developed by Weights & Biases, Weave brings rigor, best practices, and composability to the inherently experimental process of AI development.
Get started with W&B Weave by checking out our docs and quickstart
💡
By the end of this article, you'll understand how to:
  • Easily log and debug LLM inputs, outputs, and traces
  • Build reliable evaluations for LLM use cases
  • Organize all the information generated across the LLM workflow, from experimentation to evaluations to production.
And this is all the code you need to get started:
!pip install weave

import weave
weave.init('love-gemini') # start logging

# add @weave.op() decorator to the functions you want to track

Google Gemini

The Gemini ecosystem includes some of the most powerful models, featuring extremely long context and multimodal capabilities that allow reasoning across text, images, audio and video. Gemini is the first model that performed better than human experts on one of the most challenging benchmarks in ML, massive multitask language understanding (or MMLU).
In this article, we will see how to use Gemini models through the Python API and integrate it with W&B Weave for logging, debugging and evaluations.
We recommend this short tutorial on getting started with Gemini Pro API.
💡

Logging and debugging

As a warmup exercise, let's write some code that will generate summaries of long research papers for us. Can we handle long papers with Gemini though? Let's find out.
model_info = genai.get_model('models/gemini-1.5-pro-latest')
print(model_info.input_token_limit)
# 1048576
With 1 million tokens context, it's going to be a breeze.
Follow along in this Colab
💡
First, we'll write a short helper function to generate summaries and decorate it with Weave.
Let's take a look at the output in Weave next.
@weave.op()
def generate_summary(text):
prompt = "Generate a concise summary of below text:\n"
response = model.generate_content(prompt + long_paper_text)
return {
'summary': response.text
}

Weave allows us to quickly inspect and debug our LLM apps. Here, we can check the inputs and outputs of the function we decorated. We can also see the versioned code. I can already see a couple of things I'd like to improve:
  1. Summary comes in Markdown format. While it's easy to read, I think I'd prefer just raw text.
  2. There's a title that I'd like to remove. I just want a plain summary.
  3. The output is a bit longer than I expected.
Let's try to improve our generation with these insights, shall we? We'll change the prompt and use JSON Mode to get structured outputs.

Gemini API JSON Mode

I often want to programmatically process the output of LLM API calls and it's easier if we get it in structured format. The two main ways to achieve that is either function calling or JSON mode. Let's check out JSON mode in the Gemini API.
To use JSON mode, we need to enable it in the generation_config:
model = genai.GenerativeModel("gemini-1.5-pro-latest",
generation_config={"response_mime_type": "application/json"})
We also need to specify the JSON schema in the prompt. I hate writing JSON schemas, but pydantic comes to the rescue - we can specify a pydantic model and use it to generate JSON schema.
from pydantic import BaseModel, Field

class Summary(BaseModel):
title: str
summary: str = Field(description="plain short text summary without markdown")

schema = Summary.model_json_schema()
Now we can rewrite our generation function and see the output in Weave. Can you see how we're decorating multiple functions here? Weave allows us to keep track of nested calls and will visualize the entire trace for us. This one is pretty simple, but as you build more complex apps it could become a life saver.
@weave.op()
def create_prompt(text, schema):
prompt = f"""Generate a concise summary of below text using below JSON schema.
Please output plain text without markdown and limit it to 200 words.
Text:
{text}
JSON schema:
{schema}
"""
return prompt


@weave.op()
def generate_summary(text, schema):
prompt = create_prompt(text, schema)
response = model.generate_content(prompt)
try:
output = json.loads(response.text)
except:
output = response.text
return {
'summary': output
}
It worked perfectly. You can see the full trace view on the left and details of inputs and outputs on the right. I love using Weave to debug like this.


Evaluation with Weave

It's easy to get started using LLMs and iterate based on vibes, but you won't get very far that way. To operationalize LLMs you need to setup proper—ideally automated—evals.
For this tutorial, we'll keep it simple and check if our summary follows the expected JSON format and whether it is concise (by counting the words). In practice, you may need to invest quite a bit of time to setup good evals and use techniques such as LLM judge.

Define model

Models track system details, including prompts and temperatures, and Weave updates versions upon changes. Models are subclassed from Model, with a predict function returning responses based on examples.
class SummaryModel(weave.Model):
model_name: str
prompt_template: str
json_schema: dict
model: genai.GenerativeModel

@model_validator(mode="before")
def create_model(cls, v):
model_name = v["model_name"]
model = genai.GenerativeModel(model_name,
generation_config={"response_mime_type": "application/json"})
v["model"] = model
return v
@weave.op()
async def predict(self, text: str) -> dict:
prompt = self.prompt_template.format(text=text, schema=self.schema)
response = self.model.generate_content(prompt)
try:
output = json.loads(response.text)
return output[0]
except:
return {'summary': response.text}

Define dataset and scoring function

To run evaluation, you'll need a dataset: a collection of examples, often failure cases, to test your model. Think of it like unit tests in TDD. Then, you define scoring functions: list functions that score each example. Each function takes a model's output and optional example data, returning scores in a dictionary.
Here's an example of a scoring function checking length of generated summary.
@weave.op()
def check_conciseness(model_output: dict) -> dict:
result = False
if 'summary' in model_output.keys():
result = len(model_output['summary'].split()) < 300
return {'conciseness': result}

Run evaluation

We have everything now to create and run the evaluation: dataset, scoring functions and model. Evaluation is async, so when running in a notebook we need to await it.
evaluation = weave.Evaluation(
dataset=dataset, scorers=[check_formatting, check_conciseness],
)
await evaluation.evaluate(model)
As a result, we get the total score for each of our metrics. I find it very helpful to be able to review the results in the user interface and see which inputs or outputs are failing.

As you can see above, most of the examples succeeded on both metrics, but some failed on both. We have all the observability thanks to Weave. Let's dive in.

After checking the trace of our failed evaluation, I found these were due to 504 API errors. That's something we can fix. Once we reached the API, Gemini handled all the requests perfectly according to our metrics.

Conclusions

Gemini is a powerful LLM and Weave is a powerful LLM observability tool. They work great together. You can get a lot of value by using both of them in your workflow.
Please share with us what you build!
Iterate on AI agents and models faster. Try Weights & Biases today.