Skip to main content

Systematic Use Case Driven Evaluation of LLM Apps (with LLMs)

Systematic evaluation of LLM-powered applications for specific use-cases with an emphasis on LLM-powered evaluation. Focus on fine-tuning and prompt engineering from a technical and business point of view.
Created on November 2|Last edited on October 11

0. Context

This report aims at demonstrating how to systematically evaluate LLM-powered applications to a) facilitate compliance and b) to reduce time-to-market through faster iterations and superior observability - using Weights & Biases for LLMOps.

b) End-to-end example: Improving a RAG Chatbot

1. Setting up the Repo

This repo is made to be very simple and easily extendable for other kinds of evaluation and different LLM-powered applications using W&B. In order to get started with the codebase and create your own W&B evaluation project follow these steps.

a) Setup

b) Codebase

2. Workflow

We consider four stages to effectively setup and conduct a systematic evaluation of our RAG chatbot for the specific use-case of answering questions on climate change based on trusted sources.

a) Foundations of fast empirical iterations

Create an E2E pipeline using W&B Artifacts, W&B Model Registry: with increasingly complex pipelines (i.e. LLM-chains, data generation jobs, data pre-processing) and increasingly empirical evaluation (i.e. LLM-generated datasets, LLM-judges) iterations must be fast and observable in order to effectively debug and improve the system.
  • Our lineage plot below allows to track exactly what job with what parameters produced what artifacts. Easy model and data versioning for both compliance and productivity. Just think about how you would find out what parameters were used to generate the evaluation dataset that was used to evaluate the model - after pre-preprocessing, fine-tuning, and creation of the vector-DB - nearly impossible without proper data versioning, model versioning, and job tracking.
  • This model can be deployed to production or further testing directly from this report! Adding "staging" or "production" tags from the "Version" tab below allows to trigger webhooks or jobs. This can also be done via the model-registry.

sentence-transformers-all-MiniLM-L12-v2
Version overview
Full Name
nicolas-remerscheid/eval-llm-apps/sentence-transformers-all-MiniLM-L12-v2:v8
Aliases
candidate
staging
train
v8
Tags
eval
data_quality
Digest
5e97ab03bbc02cb797dbc756e3f1c35a
Created By
Created At
November 2nd, 2023 12:47:38
Num Consumers
5
Num Files
11
Size
133.5MB
TTL Remaining
Inactive
Description

Model Card for Fine-tuned sentence-transformers/all-MiniLM-L6-v2 for RAG Application

Model Details

  • Model Name: Fine-tuned sentence-transformers/all-MiniLM-L6-v2
  • Model Type: Sentence Embedding Model
  • Architecture: Transformer-based, MiniLM
  • Language: English
  • License: Apache-2.0
  • Model Version: 1.0
  • Hugging Face Organization: sentence-transformers
  • Authors: The team behind Sentence Transformers and contributors.

Model Architecture

Intended Use

  • Primary Applications: This model is fine-tuned specifically for Retrieval-Augmented Generation (RAG) applications, aiming to improve semantic retrieval capabilities in question-answering, chatbots, and information retrieval systems.
  • Intended Users: Researchers, developers, and enterprises looking to enhance their NLP applications with advanced semantic search capabilities.
  • Out-of-Scope Use Cases: Not intended for real-time, low-latency applications due to the computational requirements of transformer models.

Training Data

The fine-tuning process utilized a curated dataset compiled from diverse sources including academic papers, web text, and domain-specific datasets to ensure a broad semantic understanding. The data was preprocessed to focus on sentence and paragraph-level embeddings that are relevant for semantic retrieval tasks.

Training Procedure

  • Preprocessing: Texts were cleaned, tokenized, and encoded using the pre-trained sentence-transformers/all-MiniLM-L6-v2 tokenizer.
  • Fine-tuning: The model was fine-tuned on a task-specific dataset using a contrastive loss function that minimizes the distance between semantically similar sentences and maximizes it between dissimilar ones.
  • Epochs: Training was conducted for 10 epochs with early stopping based on validation loss to prevent overfitting.
  • Optimizer: AdamW with a learning rate of 2e-5.

Evaluation Results

The fine-tuned model demonstrates superior performance in semantic similarity and retrieval tasks compared to the base all-MiniLM-L6-v2 model, with significant improvements in Precision@k, Recall@k, and F1 scores on benchmark datasets.

Ethical Considerations

This model is built on data sources that may contain biases. Users should be aware of potential biases when deploying the model in diverse contexts and strive to mitigate such biases when possible.

Caveats and Recommendations

  • Computational Resources: Running the model, especially in RAG setups, requires substantial computational resources. Users should consider this when integrating the model into their applications.
  • Domain-specific Adaptations: While the model performs well across a broad range of topics, fine-tuning on domain-specific data is recommended for optimal performance in specialized applications.

How to Use

Here is a simple example of how to use this model in your application:

from sentence_transformers import SentenceTransformer

# Load the fine-tuned model
model = SentenceTransformer('your-model-path/sentence-transformers-all-MiniLM-L6-v2-finetuned')

# Encode sentences to get their embeddings
sentences = ["What is the capital of France?", "Tell me about the Eiffel tower."]
embeddings = model.encode(sentences)

# Use embeddings for semantic search, clustering, or other applications

For more detailed instructions, including how to integrate the model with RAG setups for semantic retrieval, please refer to the official documentation and examples provided by Sentence Transformers and Hugging Face.

Acknowledgments

This work builds upon the efforts of the Sentence Transformers and Hugging Face teams, as well as the broader NLP community. We thank all contributors for their invaluable work in advancing the field of natural language processing.

References

License

This model is open-sourced under the Apache 2.0 license. The full license can be found in the LICENSE file in the model's repository.



b) Quantitative unit-testing with LLMs

Quantitative metrics for LLMs using W&B Experiments, W&B Tables: LLM-powered evaluation of the Correctness, Hallucination, and the Retrieval Performance is essential in order to understand the performance of our RAG model. However, these LLM-powered evaluations are only representative if we are very confident about the LLM-dataset-generation and the LLM-judgement. This can only be achieved by having complete observability and the ability to easily do rigorous comparisons between the evaluations done by different LLMs as data generators and judges (GPT4, Claude, etc.).
  • Quantitative unit-testing: Using specific unit-tests based on LLM-generated datasets and judged by other LLMs is increasingly popular - especially by crafting very specific unit-tests (which makes the discriminative task done by the LLM judges easier) and cross-referencing the evaluations of different LLM-judges (which reduces the risk of a biased evaluation).
  • Fine-tuning: Prompt tuning can go a long way but is often not enough to bring the performance of RAGs to a level to be able to push the application to production. This is where MLOps platform that cover both prompt engineering (see next section) and standard ML tracking are essential. In the second figure you can interact with standard fine-tuning of the embeddings on the generated fine-tuning data. The same could be done for instruction fine-tuning for the chat model.
  • Business KPIs: With increasing size of LLMs and increasing popularity of semi-autonomous agents it is essential to understand how other key business metrics such as cost and latency of the model relate to the performance. Both can be tracked and compared very easily in W&B.

Run set
29





Run set
69



c) Qualitative system-level debugging

d) Optimization, automation and collaboration

3. Follow-up reports

Nicolas Remerscheid
Nicolas Remerscheid •  
We mainly see LLM chains and agents. point to resources.
Reply
Nicolas Remerscheid
Nicolas Remerscheid •  
This repo this is definitely the core part of the project - def have a look if interested
Reply
Nicolas Remerscheid
Nicolas Remerscheid •  
Before we start it's important to understand to trends and insights we increasingly see at W&B: Also check these slides I made on this topic for the Berlin Meetup: https://docs.google.com/presentation/d/1Hxh8oXbivzOGBdfHXhyyOMS8TAYRnxC92fwJrpoYDvg/edit#slide=id.g2889afa0529_0_43
3 replies
Nicolas Remerscheid
Nicolas Remerscheid •  
Workflow take some more time to introduce underlining business metrics - "definition of important (business and tech) metrics and fine-tuning and evaluation datasets (using GPT4 on IPCC and other sources)."
Reply
artifact