Systematic Use Case Driven Evaluation of LLM Apps (with LLMs)
0. Context
a) LLM Trends
b) End-to-end example: Improving a RAG Chatbot
1. Setting up the Repo
a) Setup
b) Codebase
2. Workflow
a) Foundations of fast empirical iterations
- Our lineage plot below allows to track exactly what job with what parameters produced what artifacts. Easy model and data versioning for both compliance and productivity. Just think about how you would find out what parameters were used to generate the evaluation dataset that was used to evaluate the model - after pre-preprocessing, fine-tuning, and creation of the vector-DB - nearly impossible without proper data versioning, model versioning, and job tracking.
- This model can be deployed to production or further testing directly from this report! Adding "staging" or "production" tags from the "Version" tab below allows to trigger webhooks or jobs. This can also be done via the model-registry.
Model Card for Fine-tuned sentence-transformers/all-MiniLM-L6-v2
for RAG Application
Model Details
- Model Name: Fine-tuned
sentence-transformers/all-MiniLM-L6-v2
- Model Type: Sentence Embedding Model
- Architecture: Transformer-based, MiniLM
- Language: English
- License: Apache-2.0
- Model Version: 1.0
- Hugging Face Organization: sentence-transformers
- Authors: The team behind Sentence Transformers and contributors.
Intended Use
- Primary Applications: This model is fine-tuned specifically for Retrieval-Augmented Generation (RAG) applications, aiming to improve semantic retrieval capabilities in question-answering, chatbots, and information retrieval systems.
- Intended Users: Researchers, developers, and enterprises looking to enhance their NLP applications with advanced semantic search capabilities.
- Out-of-Scope Use Cases: Not intended for real-time, low-latency applications due to the computational requirements of transformer models.
Training Data
The fine-tuning process utilized a curated dataset compiled from diverse sources including academic papers, web text, and domain-specific datasets to ensure a broad semantic understanding. The data was preprocessed to focus on sentence and paragraph-level embeddings that are relevant for semantic retrieval tasks.
Training Procedure
- Preprocessing: Texts were cleaned, tokenized, and encoded using the pre-trained
sentence-transformers/all-MiniLM-L6-v2
tokenizer. - Fine-tuning: The model was fine-tuned on a task-specific dataset using a contrastive loss function that minimizes the distance between semantically similar sentences and maximizes it between dissimilar ones.
- Epochs: Training was conducted for 10 epochs with early stopping based on validation loss to prevent overfitting.
- Optimizer: AdamW with a learning rate of 2e-5.
Evaluation Results
The fine-tuned model demonstrates superior performance in semantic similarity and retrieval tasks compared to the base all-MiniLM-L6-v2
model, with significant improvements in Precision@k, Recall@k, and F1 scores on benchmark datasets.
Ethical Considerations
This model is built on data sources that may contain biases. Users should be aware of potential biases when deploying the model in diverse contexts and strive to mitigate such biases when possible.
Caveats and Recommendations
- Computational Resources: Running the model, especially in RAG setups, requires substantial computational resources. Users should consider this when integrating the model into their applications.
- Domain-specific Adaptations: While the model performs well across a broad range of topics, fine-tuning on domain-specific data is recommended for optimal performance in specialized applications.
How to Use
Here is a simple example of how to use this model in your application:
from sentence_transformers import SentenceTransformer
# Load the fine-tuned model
model = SentenceTransformer('your-model-path/sentence-transformers-all-MiniLM-L6-v2-finetuned')
# Encode sentences to get their embeddings
sentences = ["What is the capital of France?", "Tell me about the Eiffel tower."]
embeddings = model.encode(sentences)
# Use embeddings for semantic search, clustering, or other applications
For more detailed instructions, including how to integrate the model with RAG setups for semantic retrieval, please refer to the official documentation and examples provided by Sentence Transformers and Hugging Face.
Acknowledgments
This work builds upon the efforts of the Sentence Transformers and Hugging Face teams, as well as the broader NLP community. We thank all contributors for their invaluable work in advancing the field of natural language processing.
References
- Original
sentence-transformers/all-MiniLM-L6-v2
model: Hugging Face Model Hub - Sentence Transformers Documentation: SBERT.net
- Hugging Face Transformers: Transformers Library
License
This model is open-sourced under the Apache 2.0 license. The full license can be found in the LICENSE file in the model's repository.
b) Quantitative unit-testing with LLMs
- Quantitative unit-testing: Using specific unit-tests based on LLM-generated datasets and judged by other LLMs is increasingly popular - especially by crafting very specific unit-tests (which makes the discriminative task done by the LLM judges easier) and cross-referencing the evaluations of different LLM-judges (which reduces the risk of a biased evaluation).
- Fine-tuning: Prompt tuning can go a long way but is often not enough to bring the performance of RAGs to a level to be able to push the application to production. This is where MLOps platform that cover both prompt engineering (see next section) and standard ML tracking are essential. In the second figure you can interact with standard fine-tuning of the embeddings on the generated fine-tuning data. The same could be done for instruction fine-tuning for the chat model.
- Business KPIs: With increasing size of LLMs and increasing popularity of semi-autonomous agents it is essential to understand how other key business metrics such as cost and latency of the model relate to the performance. Both can be tracked and compared very easily in W&B.