Benchmark on Splash Attention

Created on May 24|Last edited on May 25
Comment
﻿
Comparing Splash Attention and Flash Attention:
Training throughput improves from 2.8M tokens/sec to 3.2M tokens/sec
Training and eval loss curves of two runs also match closely. Number wise, the splash attention even seems to be slightly better. 
Section 1﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿
﻿
﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿
﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿
﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿
﻿
﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿
﻿
﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿
﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿
﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿
﻿
﻿
Add a comment