wandbot v1.3 vs 1.2 Debugging Eval Acccuray

Created on January 6|Last edited on January 20
Comment
﻿
Baseline v1.2Correctness: ~72-73%
Commit: 8ec557ee727162da52375eff1120bab3975d984e
Eval: Weave eval 
﻿wandbot outputs: after Jan 6th, 1:15pm, before Jan 6th, 3.15pm
﻿
Baseline v1.2 againEnsuring that chroma_index:v34 was being used
﻿wandbot outputs: (after Jan 19th 5pm, before jan 19th 6pm)
﻿Weave eval﻿
﻿
v1.3
Debugging outputs, step by step
Jan 6th, 2024
Jan 12th, 2024 - Is the e2b retriever repro 3.10 consistent? - Yes
Jan 14th, 2024 - Does the e2b retriever repro 3.10 match the 3.10 baseline eval? - No
Jan 19th, 2024 - Is it because a different chroma index was being used? - No, we have a context duplication problem (aka lack of de-dup)
Jan 20th, 2024 - Find missing contexts from golden answer, did they get retrieved?For 1 single query, looking at baseline retrieved indexes to try reproduce it:
query: concurrent writes
﻿weave trace﻿
IDs and documents: https://docs.google.com/spreadsheets/d/1983Sh6sgaN6UsafR2TvgD2bKeZOuaAAR-5_qv8Gh71M/edit?gid=0#gid=0﻿
﻿
﻿
Add a comment