Skip to main content

Top scoring URLs using MEDU by MMLU Subset

Created on March 17|Last edited on March 17
We take the top 10% of URLs from each MEDU subset (~40B tokens) and then we look at the top scoring URLs for each MMLU subset. We see there are some commonalities between the subsets and reinforces our previous beliefs about high quality data.
Commonly repeated high quality data (Unsurprising)
  1. Wikipedia
  2. StackExchange
    1. And StackExchange derivatives: superuser.com, askubuntu.com, serverfault.com (do we include these? seems sizable)
Repeated High Quality Data which we might not have known
Science / Engineering:
  1. physicsforums.com
  2. eurekalert.org
  3. refernece.org
  4. scientificamerican.com

Humanities / Social Sciences
  1. theguardian.com
  2. huffingtonpost.com
  3. nytimes.com
  4. slate.com
  5. csmonitor.com

MMLU Sciences
============================================================
Domain Count Histogram (Top 100)
Total URLs processed: 29400252
Unique domains found: 1950287
============================================================
Domain | Count | Percentage
-----------------------------------------+------------+-----------
en.wikipedia.org | 253,439 | 0.86%
math.stackexchange.com | 218,659 | 0.74%
mathoverflow.net | 170,383 | 0.58%
physicsforums.com | 169,595 | 0.58%
physics.stackexchange.com | 96,259 | 0.33%
eurekalert.org | 78,624 | 0.27%
reference.com | 78,107 | 0.27%
phys.org | 59,786 | 0.20%
scientificamerican.com | 55,934 | 0.19%
theguardian.com | 55,291 | 0.19%
redorbit.com | 49,342 | 0.17%
psychology.wikia.com | 46,866 | 0.16%
link.springer.com | 46,558 | 0.16%
abc.net.au | 40,974 | 0.14%
chegg.com | 40,781 | 0.14%
MMLU Social Sciences
============================================================
Domain Count Histogram (Top 100)
Total URLs processed: 27734433
Unique domains found: 1438967
============================================================
Domain | Count | Percentage
-----------------------------------------+------------+-----------
theguardian.com | 244,228 | 0.88%
theatlantic.com | 144,476 | 0.52%
en.wikipedia.org | 111,639 | 0.40%
nytimes.com | 105,615 | 0.38%
slate.com | 97,979 | 0.35%
huffingtonpost.com | 96,479 | 0.35%
csmonitor.com | 94,089 | 0.34%
fool.com | 90,009 | 0.32%
economist.com | 83,183 | 0.30%
forbes.com | 75,086 | 0.27%
washingtonpost.com | 72,520 | 0.26%
businessinsider.com | 71,044 | 0.26%
townhall.com | 68,722 | 0.25%
lexology.com | 61,246 | 0.22%
articles.latimes.com | 58,623 | 0.21%

MMLU Humanities
============================================================
Domain Count Histogram (Top 100)
Total URLs processed: 28274260
Unique domains found: 1493473
============================================================
Domain | Count | Percentage
-----------------------------------------+------------+-----------
theguardian.com | 288,374 | 1.02%
en.wikipedia.org | 254,627 | 0.90%
theatlantic.com | 155,706 | 0.55%
nytimes.com | 128,814 | 0.46%
slate.com | 110,737 | 0.39%
huffingtonpost.com | 109,638 | 0.39%
enotes.com | 108,256 | 0.38%
csmonitor.com | 106,797 | 0.38%
economist.com | 83,243 | 0.29%
articles.latimes.com | 77,639 | 0.27%
washingtonpost.com | 74,301 | 0.26%
townhall.com | 59,479 | 0.21%
en.wikisource.org | 58,026 | 0.21%
aljazeera.com | 56,203 | 0.20%
thenation.com | 50,751 | 0.18%

MMLU-Engineering
============================================================
Domain Count Histogram (Top 100)
Total URLs processed: 29993850
Unique domains found: 1681894
============================================================
Domain | Count | Percentage
-----------------------------------------+------------+-----------
stackoverflow.com | 718,755 | 2.40%
superuser.com | 245,357 | 0.82%
serverfault.com | 242,437 | 0.81%
math.stackexchange.com | 208,419 | 0.69%
askubuntu.com | 176,230 | 0.59%
lists.w3.org | 143,169 | 0.48%
physicsforums.com | 142,370 | 0.47%
perlmonks.org | 126,113 | 0.42%
mathoverflow.net | 121,912 | 0.41%
slashdot.org | 116,195 | 0.39%
unix.stackexchange.com | 103,016 | 0.34%
programmers.stackexchange.com | 100,646 | 0.34%
en.wikipedia.org | 100,379 | 0.33%
codedump.io | 94,448 | 0.31%
theregister.co.uk | 92,372 | 0.31%
mail-index.netbsd.org | 92,163 | 0.31%
apple.stackexchange.com | 85,890 | 0.29%
meta.stackexchange.com | 85,286 | 0.28%

MMLU-Others
============================================================
Domain Count Histogram (Top 100)
Total URLs processed: 28557316
Unique domains found: 1543645
============================================================
Domain | Count | Percentage
-----------------------------------------+------------+-----------
en.wikipedia.org | 378,439 | 1.33%
theguardian.com | 264,041 | 0.92%
news.bbc.co.uk | 123,298 | 0.43%
reference.com | 121,237 | 0.42%
nytimes.com | 119,211 | 0.42%
csmonitor.com | 110,788 | 0.39%
theatlantic.com | 110,694 | 0.39%
articles.latimes.com | 106,945 | 0.37%
slate.com | 85,931 | 0.30%
aljazeera.com | 76,377 | 0.27%
huffingtonpost.com | 76,034 | 0.27%
businessinsider.com | 72,150 | 0.25%
Christopher Chou
Christopher Chou •  
Humanities / Social Sciences TODO(chris): look at the prompt used for this, the domains end up pretty similar
Reply