Skip to main content

Sampler Experiments

Explore the performance of the selecting random samples from the same chunk compared to standard random/sequential sample selection on the unbalanced (full) GQA questions training split.
Created on August 15|Last edited on August 17

Performance

The main optimisation that the ChunkedRandomSampler class makes over the default torch.utils.data.RandomSampler class is that it takes the chunked structure of underlying files into account. Where RandomSampler simply returns a permutation of indices over the entire dataset, a ChunkedRandomSampler instance fetches data about the length of each chunk, permutes indices for each chunk separately and then permutes the order of the chunks.

For the unbalanced GQA training set, question data is spread across 10 .json files each containing over 1.1GiB of data. Loading these into memory takes up even more space, so it is infeasible to have all chunks loaded at once and caching techniques are needed. Assuming we can only hold one of these files in memort at a time, a simple RandomSampler would require loading data from disk to memory nn times in the worst case, where n≈14,000,000n \approx 14,000,000 is the number of question samples. The ChunkedRandomSampler ensures that the cache is hit every time except for the first batch of each chunk, reducing the number of file reads to 10. This is as efficient as torch.utils.data.SequentialSampler, with the slight overhead of computing random permutations for chunks, as seen in the graphs below.

Caveats

Whilst loading data in the main process is faster as seen in the graphs below, these results were collected without and model training overhead, i.e. each batch is thrown away as soon as it has been loaded. In practice, having at least one separate process for data loading would be beneficial when training models so the next batch can be prepared in the background while the current batch is being processed. Keep in mind that each process keeps its own dataset cache, so using too many data loading processes will result in significant memory overheads.




Sampler experiments
0