Sweep/tuning.
Are our runs reasonable?
Created on March 14|Last edited on March 14
Comment
We did not cherry pick runs, but development might result in results that are overfit. We sweep to demonstrate this is not the case.
This best-of aggregation is a fairer demonstration (than e.g. box plot of average performance) that NDT-2.32 runs are actually better (since they aren't as clearly converged as other runs). The NDT-2.32 runs do appear more brittle, however.
Run set
93
cf. runs used in arch/base.
Run set
4
Add a comment