Skip to main content

Exploring Collation methods: Different ways to construct a batch

Maybe packing is not the best idea after all...
Created on November 4|Last edited on November 17


Different padding options here:
  • Standard Pack, where one just treats everything as a stream of text and cuts at max_seq_len. This can create split instruction.
  • Masked Pack, we pack as before, but we mask the instructions with -100, so the cross entropy don't backprop on those tokens.
  • Wayde had the idea of just padding so we don't have truncated instructions at the end/beginning of a batch. Just that, concatenate up to max_seq_len and then pad:

Check Attention mask when masking!
💡

02004006008001k1.2kStep11.522.5
Run set
4