Top scoring URLs using MEDU by MMLU Subset
Created on March 17|Last edited on March 17
Comment
We take the top 10% of URLs from each MEDU subset (~40B tokens) and then we look at the top scoring URLs for each MMLU subset. We see there are some commonalities between the subsets and reinforces our previous beliefs about high quality data.
Commonly repeated high quality data (Unsurprising)
- Wikipedia
- StackExchange
- And StackExchange derivatives: superuser.com, askubuntu.com, serverfault.com (do we include these? seems sizable)
Repeated High Quality Data which we might not have known
Science / Engineering:
- physicsforums.com
- eurekalert.org
- refernece.org
- scientificamerican.com
Humanities / Social Sciences
- theguardian.com
- huffingtonpost.com
- nytimes.com
- slate.com
- csmonitor.com
MMLU Sciences
============================================================Domain Count Histogram (Top 100)Total URLs processed: 29400252Unique domains found: 1950287============================================================Domain | Count | Percentage-----------------------------------------+------------+-----------en.wikipedia.org | 253,439 | 0.86%math.stackexchange.com | 218,659 | 0.74%mathoverflow.net | 170,383 | 0.58%physicsforums.com | 169,595 | 0.58%physics.stackexchange.com | 96,259 | 0.33%eurekalert.org | 78,624 | 0.27%reference.com | 78,107 | 0.27%phys.org | 59,786 | 0.20%scientificamerican.com | 55,934 | 0.19%theguardian.com | 55,291 | 0.19%redorbit.com | 49,342 | 0.17%psychology.wikia.com | 46,866 | 0.16%link.springer.com | 46,558 | 0.16%abc.net.au | 40,974 | 0.14%chegg.com | 40,781 | 0.14%
MMLU Social Sciences
============================================================Domain Count Histogram (Top 100)Total URLs processed: 27734433Unique domains found: 1438967============================================================Domain | Count | Percentage-----------------------------------------+------------+-----------theguardian.com | 244,228 | 0.88%theatlantic.com | 144,476 | 0.52%en.wikipedia.org | 111,639 | 0.40%nytimes.com | 105,615 | 0.38%slate.com | 97,979 | 0.35%huffingtonpost.com | 96,479 | 0.35%csmonitor.com | 94,089 | 0.34%fool.com | 90,009 | 0.32%economist.com | 83,183 | 0.30%forbes.com | 75,086 | 0.27%washingtonpost.com | 72,520 | 0.26%businessinsider.com | 71,044 | 0.26%townhall.com | 68,722 | 0.25%lexology.com | 61,246 | 0.22%articles.latimes.com | 58,623 | 0.21%
MMLU Humanities
============================================================Domain Count Histogram (Top 100)Total URLs processed: 28274260Unique domains found: 1493473============================================================Domain | Count | Percentage-----------------------------------------+------------+-----------theguardian.com | 288,374 | 1.02%en.wikipedia.org | 254,627 | 0.90%theatlantic.com | 155,706 | 0.55%nytimes.com | 128,814 | 0.46%slate.com | 110,737 | 0.39%huffingtonpost.com | 109,638 | 0.39%enotes.com | 108,256 | 0.38%csmonitor.com | 106,797 | 0.38%economist.com | 83,243 | 0.29%articles.latimes.com | 77,639 | 0.27%washingtonpost.com | 74,301 | 0.26%townhall.com | 59,479 | 0.21%en.wikisource.org | 58,026 | 0.21%aljazeera.com | 56,203 | 0.20%thenation.com | 50,751 | 0.18%
MMLU-Engineering
============================================================Domain Count Histogram (Top 100)Total URLs processed: 29993850Unique domains found: 1681894============================================================Domain | Count | Percentage-----------------------------------------+------------+-----------stackoverflow.com | 718,755 | 2.40%superuser.com | 245,357 | 0.82%serverfault.com | 242,437 | 0.81%math.stackexchange.com | 208,419 | 0.69%askubuntu.com | 176,230 | 0.59%lists.w3.org | 143,169 | 0.48%physicsforums.com | 142,370 | 0.47%perlmonks.org | 126,113 | 0.42%mathoverflow.net | 121,912 | 0.41%slashdot.org | 116,195 | 0.39%unix.stackexchange.com | 103,016 | 0.34%programmers.stackexchange.com | 100,646 | 0.34%en.wikipedia.org | 100,379 | 0.33%codedump.io | 94,448 | 0.31%theregister.co.uk | 92,372 | 0.31%mail-index.netbsd.org | 92,163 | 0.31%apple.stackexchange.com | 85,890 | 0.29%meta.stackexchange.com | 85,286 | 0.28%
MMLU-Others
============================================================Domain Count Histogram (Top 100)Total URLs processed: 28557316Unique domains found: 1543645============================================================Domain | Count | Percentage-----------------------------------------+------------+-----------en.wikipedia.org | 378,439 | 1.33%theguardian.com | 264,041 | 0.92%news.bbc.co.uk | 123,298 | 0.43%reference.com | 121,237 | 0.42%nytimes.com | 119,211 | 0.42%csmonitor.com | 110,788 | 0.39%theatlantic.com | 110,694 | 0.39%articles.latimes.com | 106,945 | 0.37%slate.com | 85,931 | 0.30%aljazeera.com | 76,377 | 0.27%huffingtonpost.com | 76,034 | 0.27%businessinsider.com | 72,150 | 0.25%
Add a comment
Humanities / Social Sciences TODO(chris): look at the prompt used for this, the domains end up pretty similar
Reply