Exploring Collation methods: Different ways to construct a batch
Maybe packing is not the best idea after all...
Created on November 4|Last edited on November 17
Comment

Different padding options here:
- Standard Pack, where one just treats everything as a stream of text and cuts at max_seq_len. This can create split instruction.
- Masked Pack, we pack as before, but we mask the instructions with -100, so the cross entropy don't backprop on those tokens.
- Wayde had the idea of just padding so we don't have truncated instructions at the end/beginning of a batch. Just that, concatenate up to max_seq_len and then pad:
Check Attention mask when masking!
💡
Run set
4
Add a comment