Debug feat/v1.3 with Auto Evaluation
Journal of auto evaluation based LLM app debugging.
Created on April 3|Last edited on April 5
Comment
The branch feat/v1.3 has introduced many changes that makes the wandbot fast but at the cost of degraded auto eval results. This report investigates methodically where the issue has arisen.
Prior to starting this report, multiple efforts were made to understand where the issue is stemming. We have documented some in this slide deck as part of our "Optimising LLM Apps as Scale talk". Check out slides 41-44.
Efforts were also made to improve the time it takes to run the auto evaluation. Without the optimisation, the eval used to take over 3 hrs or 180 mins. After making eval run fully asynchronous and by using Gunicorn to spin up 8 wandbot instances (FastAPI endpoints), we were able to reduce the eval time to less than 45 mins. That's a huge speed up. The optimizations introduced in branch feat/v1.3 reduces this eval time to ~7 mins approx.
For the sake of debugging, we are only interested in the correctness of the generated response (answer_correctness_result). We ask a judge LLM (gpt-4-1106-preview) to do this evaluation.
How we built the evaluation dataset, the different strategies, the manual evaluation and more are documented in these reports:
[TODO Add reports]
Notes:
- Wandbot has a fallback system where if the system hits rate limit, it uses a poorer model to generate the response. But for evaluation both are kept the same:

- we are maintaining a separate branch debug/feat/v1.3 where we are cherry picking the commits from feat/v1.3 and applying.
We then evaluate and report here.
| idx | eval | answer_correctness_result (%) | comment | LLM Judge |
|---|---|---|---|---|
| 1 | baseline | 72.45 | wandbot v1.1 baseline with gpt-4-1106-preview chat model and text-embedding-ada-002 embedding model. | gpt-4-1106-preview |
| 2 | feat: use new openai models in chat and query enhancer (a3867a5) | 65.31 | chat model changed to gpt-4-0125-preview, and query enhancer fallback model changed to gpt-3.5-turbo-1106 | gpt-4-1106-preview |
| 3 | (same as above) changed query enhancer fallback model back to gpt-4-1106-preview | 63.27 | there is one sample where the evaluator graded differently in absolute numbers. | gpt-4-1106-preview |
| 4 | (same as above) gpt-4-0125-preview as LLM judge | 59.18 | This model graded more 2s. | gpt-4-0125-preview |
| 5 | baseline evaluated by gpt-4-0125-preview | 54.08 | Grading with "maybe" better model is dropping the accuracy | gpt-4-0125-preview |
| 6 | feat: use new models and embeddings model (7d6a187) | 57.14 | Testing new commit 7d6a187 which introduced a bunch of changes documented below. Twice the generations are scored a 1 compared to baseline. | gpt-4-1106-preview |
| 7 | (same as 6) changed chunk size back to 512 | 69.39 | Changed the chunk size to 512 and the filter threshold to 10 | gpt-4-1106-preview |
| 8 | (same as 6-7) changed initial_top_k back to 15 and top_k back to 5 | 57.14 | Since this reduced the context resulting in bad performance | gpt-4-1106-preview |
| 9 | (same as 6) changed embedding model to text-embedding-ada-002 | 62.24 | Tried with the old embedding model but seeing degraded perf. | gpt-4-1106-preview |
| 10 | refactor: move retriever to separate module. (ac58107 + c027dde + 0e231b9 + 50aadb5) | 59.18 | The commits added a Fusion Retriver and did some reformatting. The said retriever was removed in future commits thus not digging into it | gpt-4-1106-preview |
- Bold result indicate accepted change.
- Italic result indicate commit made - we might or might not accept that change. If we do, it will be italic and bold.
feat/v1.1 Eval Result
- Reproduced the old result of 72% answer correctness.
- While this 72% is coming from the True/False (is correct answer/not correct answer) as per the judge LLM's evaluation. We have also asked the judge LLM to grade the generation. Here 1 is bad generation while 3 is good. The number of eval samples graded 1-3 matches our old results.
feat: use new openai models in chat and query enhancer (a3867a5)
- Chat model gpt-4-1106-preview -> gpt-4-0125-preview
- Query Enhancer's fallback model changed from gpt-4-1106-preview -> gpt-3.5-turbo-1106
Performance dropped from 72% to 65%.
Let's evaluate again with the same Query Enhancer fallback model (gpt-4-1106-preview).
Is there a bias from LLM Judge?
Let's evaluate with gpt-4-0125-preview.
This model graded more responses with a 2.
The chat model used here was gpt-4-0125-preview.
Switched chat model back to gpt-4-1106-preview and evaluated with gpt-4-0125-preview. Dropped the accuracy even more.
Decision: For now discarding the changes made in the commit a3867a5, ie. sticking with gpt-4-1106-preview for both chat model and the LLM judge.
There is this subReddit where redditors are complaining about gpt-4-0125-preview messing with there RAG pipeline and not good for code completion compared to gpt-4-1106-preview.
Redditors are suggesting tweaking the prompt to get better results with 0125 variant. Looking at the grade of 2 given to 40 generated answer, I think some modification is needed in the eval prompt but let's not touch it for now.
💡
feat: use new models and embeddings model (7d6a187)
Let's check out the next commit. I have reverted the previous commit - a3867a5 simply because it was just introducing different model and that the eval scores dropped.
This commit introduces a few changes:
- It introduces the test-embedding-small embedding model with 512 embedding size. Because of this the ingestion pipeline was run again. Here's the index: https://wandb.ai/wandbot/wandbot-dev/artifacts/storage_context/wandbot_index/v22/overview
- Earlier, 15 chunks were retrieved followed by re-ranking and selecting top 5 chunks. This commit introduces selecting 10 chunks followed by re-ranking and selecting 10 chunks. More context is used by the LLM.
- During ingestion chunk size was changed from 512 to 384. Filtering of small nodes of size 10 or less was changed to 5 or less.
- A slight change in the YOU.com retrieval logic. Earlier the snippets returned by YOU.com were concatenated together along with new line delimiter and stored as tuple of concatenated snippets and metadata. Now each snippet along with metadata is stored as a tuple - no concatenation. We are returning more YOU.com context.
Let's first change the chunk size back to 512 (node filter threshold to 10) and run the ingestion pipeline again and evaluate with the new vector index.
So most change in the eval score came from changing the chunk size. Switching it back to 512 did the trick. 4 eval samples were graded 2 instead of 3 and thus the drop from 72 to 69. Nothing much serious.
Let's now change the number of retrieved chunks back to 15 from 10 and the total final chunks to 5 from 10..In the current setting, this dropped the performance maybe because of poor context (the number of final chunks is just 5).
How about we change the embedding model back to text-embedding-ada-002 from text-embedding-small? The best eval result used the old embedding model for embedding. However the changes made by the commit 7d6a187 works better with the text-embedding-small model.
Decision: Updating the embedding model from text-embedding-ada-002 to text-embedding-3-small. It is not beating the baseline rather there is a 3% drop in accuracy (69.39%) but in absolute values, there is just one bad generation as per the LLM judge.
text-embedding-3-small has some pros over ada-002 in speed and memory footprint.
💡
chore: add simple multi-index query engine with router (c027dde)
This commit was followed by "refactor: move retriever to separate module (ac58107)" which is just refactoring the retriever to its own module.
- Tried to build one index per doc source.
- Too much memory requirement thus not working.
- Reverted multi index to one index.
- Thus FusionRetriever is not working as intended - it was removed in future commits.
For the sake of keeping record, evaluated this system and got the same score as 4. However, compared to 4 more eval samples were evaluated to 1 grade.
Run set
1
Add a comment