Skip to main content
marin-community
Projects
marin
Reports
Tokenizer Comparison
Log in
Sign up
Share
Comment
Star
Tokenizer Comparison
Fineweb-EDU 1.4b Tokenizers: * llama2 (~32k) * llama3 (~128k) * neox (~50k)
David Leo Wright Hall
Created on November 19
|
Last edited on May 12
Comment
Standard Panels
c4en/bpb vs log(tokens)
c4en/bpb vs log(tokens)
5G
6G
7G
8G
9G
10G
20G
30G
40G
throughput/total_tokens
1
1.05
1.1
1.15
bpb
llama2-tokenizer-a506d9
neox-tokenizer-ad549d
llama3-tokenizer-095cea
internal_eval/core/bpb
internal_eval/core/bpb
0
2k
4k
6k
8k
Step
0.3
0.35
0.4
0.45
0.5
0.55
0.6
llama2-tokenizer-a506d9
neox-tokenizer-ad549d
llama3-tokenizer-095cea
Run set
3
Run set
3
Add a comment