Coreference on Ontonotes Baseline
This outlines all the things I'm trying to get Ontonotes Coref (single task) baseline running nice and smooth. It's a perilous journey, in so far.
Created on April 22|Last edited on April 25
Comment
Problems
The chart below shows that while the Loss goes down, NONE of the Coref metrics seem to go anywhere. This is based on a new run with new code.
This run only includes running the following two loss functions:
- Coref (not span pruner)
- NER
Run set
1
As you can see, all the metrics they're hovering under 1 percent. And they never increase.
It could be something in metric calculation. It could just be that the model is wired incorrectly. It could be bad data preprocessing. Who knows.
Only Coref?
When we run only Coref, things get a bit interesting. Slightly.
Run set
2
This model's metrics never increase either, but are still better than the ones coming from a model trained with the NER Objective. Still far away from the baseline.
Pruner Co-Training is crucial
The original E2E Coref paper does not do any pruner training, right? (TODO check). Regardless when we add that objective, things get slighlty clearer.
Run set
3
Compare the green run (with Pruner) with the other two. It's much better. It trains, a bit too. So my code isn't completely broken. But its far far off from actual baselines. But for sure, we need to train with the Pruner included.
Two things are problematic here. Each of these run takes about 10-20 hours before I can get to know what's happening. So I need to reduce the set of examples I'm dealing with, set a baseline (with this setting) and try to improve over it.
Trimmed Dataset Baselines
Here, instead of 2700 instances, I'm clamping the dataset to the first 50.
Run set
11
Similar findings to begin with: Pruner is a must. We settled on a set of loss scales for both tasks, and are going to change parts of the code around for empirically guided debugging.
Empirically guided Debugging
Run set
17
So, as you can see, with multiple (30+ when included deleted runs), we can't seem to go anywhere from here. The code, I'm increasingly certain does what it intends to, correctly.
Current Status and what comes next
- Hypothesis 1: code is correct, model isn't training.
- Pretrain a model with no Coref, only Pruner and then use this pretrained model to do Coref + Pruner (or even only just Coref).
- Ask Joe if he did some training magic or something else to get this to work. TODO: on monday in Lille
- Hypothesis 2: model code is correct, evaluation code isn't.
- Train Joe's model, and download its predictions. Run those predictions on your evaluation code. See if they're good. Status: did not manage to get access to another GPU on the weekend (22nd-24th April). Try again on monday.
- Continue debugging, line by line :/
- Hypothesis 3: model code isn't correct either.
- Very hard to believe. Went through almost all of it 2-4 times alone, and now once with Gaurav. The model also seems to train something.
- TODO: understand how these metrics are calculated. Why is MUC much better than B-Cubed, Ceafe?
Add a comment