Skip to main content

wandbot v1.3 vs 1.2 Debugging Eval Acccuray

Created on January 6|Last edited on January 20

Baseline v1.2

Correctness: ~72-73%
Commit: 8ec557ee727162da52375eff1120bab3975d984e
Eval: Weave eval
wandbot outputs: after Jan 6th, 1:15pm, before Jan 6th, 3.15pm


Baseline v1.2 again



v1.3

Debugging outputs, step by step

Jan 6th, 2024

Jan 12th, 2024 - Is the e2b retriever repro 3.10 consistent? - Yes

Jan 14th, 2024 - Does the e2b retriever repro 3.10 match the 3.10 baseline eval? - No

Jan 19th, 2024 - Is it because a different chroma index was being used? - No, we have a context duplication problem (aka lack of de-dup)

Jan 20th, 2024 - Find missing contexts from golden answer, did they get retrieved?