Systematic Use Case Driven Evaluation of LLM Apps (with LLMs)

Systematic evaluation of LLM-powered applications for specific use-cases with an emphasis on LLM-powered evaluation. Focus on fine-tuning and prompt engineering from a technical and business point of view.

Nicolas Remerscheid

Created on November 2|Last edited on October 11

Comment

﻿
0. ContextThis report aims at demonstrating how to systematically evaluate LLM-powered applications to a) facilitate compliance and b) to reduce time-to-market through faster iterations and superior observability - using Weights & Biases for LLMOps. 
a) LLM Trends
b) End-to-end example: Improving a RAG Chatbot
1. Setting up the RepoThis repo is made to be very simple and easily extendable for other kinds of evaluation and different LLM-powered applications using W&B. In order to get started with the codebase and create your own W&B evaluation project follow these steps.
a) Setup
b) Codebase
2. WorkflowWe consider four stages to effectively setup and conduct a systematic evaluation of our RAG chatbot for the specific use-case of answering questions on climate change based on trusted sources.
a) Foundations of fast empirical iterationsCreate an E2E pipeline using W&B Artifacts, W&B Model Registry: with increasingly complex pipelines (i.e. LLM-chains, data generation jobs, data pre-processing) and increasingly empirical evaluation (i.e. LLM-generated datasets, LLM-judges) iterations must be fast and observable in order to effectively debug and improve the system. 
Our lineage plot below allows to track exactly what job with what parameters produced what artifacts. Easy model and data versioning for both compliance and productivity. Just think about how you would find out what parameters were used to generate the evaluation dataset that was used to evaluate the model - after pre-preprocessing, fine-tuning, and creation of the vector-DB - nearly impossible without proper data versioning, model versioning, and job tracking.
This model can be deployed to production or further testing directly from this report! Adding "staging" or "production" tags from the "Version" tab below allows to trigger webhooks or jobs. This can also be done via the model-registry.
﻿
project("nicolas-remerscheid", "eval-llm-apps").artifact("sentence-transformers-all-MiniLM-L12-v2")
sentence-transformers-all-MiniLM-L12-v2Version 8
All Versions
Aliases
staging
train
candidate
Versions
v8
v7
v6
v5
v4
v3
v2
v1
v0
VersionMetadataUsageFilesLineage
Version overview
Full Name
nicolas-remerscheid/eval-llm-apps/sentence-transformers-all-MiniLM-L12-v2:v8
Aliases
candidate
staging
train
v8
Tags
eval
data_quality
Digest
5e97ab03bbc02cb797dbc756e3f1c35a
Created By
gentle-sweep-18
Created At
November 2nd, 2023 12:47:38
Num Consumers
5
Num Files
11
Size
133.5MB
TTL Remaining
Inactive
Upstream Artifacts
sentence-transformers-all-MiniLM-L12-v2:v7vectorstore-sentence-transformers-all-MiniLM-L12-v2:v0transformed_dataset:v0
Description
Model Card for Fine-tuned sentence-transformers/all-MiniLM-L6-v2 for RAG ApplicationModel DetailsModel Name: Fine-tuned sentence-transformers/all-MiniLM-L6-v2
Model Type: Sentence Embedding Model
Architecture: Transformer-based, MiniLM
Language: English
License: Apache-2.0
Model Version: 1.0
Hugging Face Organization: sentence-transformers
Authors: The team behind Sentence Transformers and contributors.
Intended UsePrimary Applications: This model is fine-tuned specifically for Retrieval-Augmented Generation (RAG) applications, aiming to improve semantic retrieval capabilities in question-answering, chatbots, and information retrieval systems.
Intended Users: Researchers, developers, and enterprises looking to enhance their NLP applications with advanced semantic search capabilities.
Out-of-Scope Use Cases: Not intended for real-time, low-latency applications due to the computational requirements of transformer models.
Training DataThe fine-tuning process utilized a curated dataset compiled from diverse sources including academic papers, web text, and domain-specific datasets to ensure a broad semantic understanding. The data was preprocessed to focus on sentence and paragraph-level embeddings that are relevant for semantic retrieval tasks.
Training ProcedurePreprocessing: Texts were cleaned, tokenized, and encoded using the pre-trained sentence-transformers/all-MiniLM-L6-v2 tokenizer.
Fine-tuning: The model was fine-tuned on a task-specific dataset using a contrastive loss function that minimizes the distance between semantically similar sentences and maximizes it between dissimilar ones.
Epochs: Training was conducted for 10 epochs with early stopping based on validation loss to prevent overfitting.
Optimizer: AdamW with a learning rate of 2e-5.
Evaluation ResultsThe fine-tuned model demonstrates superior performance in semantic similarity and retrieval tasks compared to the base all-MiniLM-L6-v2 model, with significant improvements in Precision@k, Recall@k, and F1 scores on benchmark datasets.
Ethical ConsiderationsThis model is built on data sources that may contain biases. Users should be aware of potential biases when deploying the model in diverse contexts and strive to mitigate such biases when possible.
Caveats and RecommendationsComputational Resources: Running the model, especially in RAG setups, requires substantial computational resources. Users should consider this when integrating the model into their applications.
Domain-specific Adaptations: While the model performs well across a broad range of topics, fine-tuning on domain-specific data is recommended for optimal performance in specialized applications.
How to UseHere is a simple example of how to use this model in your application:
from sentence_transformers import SentenceTransformer

# Load the fine-tuned model
model = SentenceTransformer('your-model-path/sentence-transformers-all-MiniLM-L6-v2-finetuned')

# Encode sentences to get their embeddings
sentences = ["What is the capital of France?", "Tell me about the Eiffel tower."]
embeddings = model.encode(sentences)

# Use embeddings for semantic search, clustering, or other applications
For more detailed instructions, including how to integrate the model with RAG setups for semantic retrieval, please refer to the official documentation and examples provided by Sentence Transformers and Hugging Face.
AcknowledgmentsThis work builds upon the efforts of the Sentence Transformers and Hugging Face teams, as well as the broader NLP community. We thank all contributors for their invaluable work in advancing the field of natural language processing.
ReferencesOriginal sentence-transformers/all-MiniLM-L6-v2 model: Hugging Face Model Hub
Sentence Transformers Documentation: SBERT.net
Hugging Face Transformers: Transformers Library
LicenseThis model is open-sourced under the Apache 2.0 license. The full license can be found in the LICENSE file in the model's repository.
﻿
b) Quantitative unit-testing with LLMsQuantitative metrics for LLMs using W&B Experiments, W&B Tables: LLM-powered evaluation of the Correctness, Hallucination, and the Retrieval Performance is essential in order to understand the performance of our RAG model. However, these LLM-powered evaluations are only representative if we are very confident about the LLM-dataset-generation and the LLM-judgement. This can only be achieved by having complete observability and the ability to easily do rigorous comparisons between the evaluations done by different LLMs as data generators and judges (GPT4, Claude, etc.). 
Quantitative unit-testing: Using specific unit-tests based on LLM-generated datasets and judged by other LLMs is increasingly popular - especially by crafting very specific unit-tests (which makes the discriminative task done by the LLM judges easier) and cross-referencing the evaluations of different LLM-judges (which reduces the risk of a biased evaluation).
Fine-tuning: Prompt tuning can go a long way but is often not enough to bring the performance of RAGs to a level to be able to push the application to production. This is where MLOps platform that cover both prompt engineering (see next section) and standard ML tracking are essential. In the second figure you can interact with standard fine-tuning of the embeddings on the generated fine-tuning data. The same could be done for instruction fine-tuning for the chat model.
Business KPIs: With increasing size of LLMs and increasing popularity of semi-autonomous agents it is essential to understand how other key business metrics such as cost and latency of the model relate to the performance. Both can be tracked and compared very easily in W&B.
﻿
Run set29
﻿
﻿
﻿
﻿
Run set69
﻿
﻿
c) Qualitative system-level debugging
d) Optimization, automation and collaboration
3. Follow-up reports	
﻿

Add a comment

Nicolas Remerscheid • 2 years ago

We mainly see LLM chains and agents. point to resources.

Nicolas Remerscheid • 2 years ago

This repo this is definitely the core part of the project - def have a look if interested

Nicolas Remerscheid • 2 years ago

Before we start it's important to understand to trends and insights we increasingly see at W&B: Also check these slides I made on this topic for the Berlin Meetup: https://docs.google.com/presentation/d/1Hxh8oXbivzOGBdfHXhyyOMS8TAYRnxC92fwJrpoYDvg/edit#slide=id.g2889afa0529_0_43

3 replies

Nicolas Remerscheid • 2 years ago

Workflow take some more time to introduce underlining business metrics - "definition of important (business and tech) metrics and fine-tuning and evaluation datasets (using GPT4 on IPCC and other sources)."