Sweep/tuning.

Are our runs reasonable?

Created on March 14|Last edited on March 14

Comment

We did not cherry pick runs, but development might result in results that are overfit. We sweep to demonstrate this is not the case.
This best-of aggregation is a fairer demonstration (than e.g. box plot of average performance) that NDT-2.32 runs are actually better (since they aren't as clearly converged as other runs). The NDT-2.32 runs do appear more brittle, however.
﻿
﻿
eval_loss
eval_loss
1101001kepoch0.4
tag: time_pre-sweep-base_v2
tag: f32_pre-sweep-base_v2
tag: time-sweep-base_v2
tag: stitch-sweep-base_v2
tag: single_f8-sweep-base_v2
tag: f32-sweep-base_v2
Run set93
﻿
﻿
cf. runs used in arch/base.
﻿
Run set4
﻿
﻿

Add a comment