Medical Data Anneal

Created on May 28|Last edited on May 28
Comment
Simple medical data annealing on ~500M unique tokens of medical QA data. We overtrain by doing 4x epoching on this dataset.
﻿
Perplexity Evaluation Results:
Based on the perplexity evaluations, we see that the model performs better on the held-out validation sets of the datasets that it is training on. However, it doesn't do much better on data that it is not trained on such as the pile's pubmed abstracts and pile pubmed central. 
potential explanation for this is that the model is getting steered towards MCQA style over the web text from the pubmed abstracts / central papers.
2. Second explanation is that the data from the MCQA dataset is actually not that helpful for medical knowledge since most of it is QA format / more steered towards the benchmark formatting rather than for learning. We note that this checkpoint that we annealed on was trained on Nemotron-CC so it should have been trained on some QA data already.
﻿
MMLU Subset Evaluation Results
We see large movement in these evaluations which is used in the Medical OpenLLM Leaderboard:
MMLU high school bio: 0.674 -> 0.816
MMLU professional medicine: 0.55 -> 0.79 
MMLU college medicine: 0.54 -> 0.62
MMLU college bio: 0.74 -> 0.77
MMLU medical genetics: 0.59 -> 0.70
MMLU anatomy: 0.50 -> 0.63
Average: 0.599 -> 0.721
﻿
Possible Explanations:
The MCQA data is very helpful for learning the medical knowledge and sovling MMLU
More likely explanation is probably just contamination. We need to check the n-gram overlap of the dataset. Very likely that the MMLU just uses some questions from the pubmedqa or medmcqa.
﻿
Looking more into (2), we see that the pubmed and medmcqa dataset does not have that high n-gram overlap with MMLU but lavita-allprocessed has very high overalp (0.2!!). So we will need to rerun annealing without lavita-allprocessed.
Some contamination scores (n-gram =10) : 
1. medmcqa: 0.01 for mmlu
2. lavita-allprocessed: 0.195 for mmlu
3. lavita-pubmed: 0.00015
﻿
n-gram=15:
1. medmcqa: 0.00025
2. allprocessed: 0.132 for mmlu
3. for some reason pubmed didn't work but ngram=10 looks not bad already
﻿
Manually looking at [(1)](https://marin.community/data-browser/view/?paths=%5B%22gs%3A%2F%2Fmarin-us-east1%2Ftrain_test_overlap%2Fngrams_final%2Fmedicalqa_data_overlap_sharded_lavita_medmcqa-3b68b1%2Ftrain-00000-of-00001.jsonl.zst%2Fraw_ngrams%2Fraw_ngrams.jsonl%22%5D&offset=45), most of the overlap is in the question wording and not actually the content (e.g. "Which of the following is the most appropriate next step"). So maybe there was some inspiration from MMLU but not direct overlap on content. 
﻿
[(2)](https://marin.community/data-browser/view/?paths=%5B%22gs%3A%2F%2Fmarin-us-east1%2Ftrain_test_overlap%2Fngrams_final%2Fmedicalqa_data_overlap_sharded_lavita_allprocessed-58b923%2Ftrain-00000-of-00001-a77e2814210655f1.jsonl.zst%2Fraw_ngrams%2Fraw_ngrams.jsonl%22%5D&offset=45) is much more sus. It has direct ngram overlap on things like "42 year old man comes to the physician because of malaise muscle..."
﻿
After decontamination, average scores become 0.603 with every category doing very well except for biology crashing a lot for some reason. 
﻿
Perplexity Evaluations﻿
Run set3
﻿
﻿
MMLU Relevant Subsets eval scores﻿
﻿
﻿
Add a comment