Coreference on Ontonotes Baseline

This outlines all the things I'm trying to get Ontonotes Coref (single task) baseline running nice and smooth. It's a perilous journey, in so far.
Priyansh Trivedi
Created on April 22|Last edited on April 25
Comment
﻿
ProblemsThe chart below shows that while the Loss goes down, NONE of the Coref metrics seem to go anywhere. This is based on a new run with new code. 
This run only includes running the following two loss functions: 
	- Coref (not span pruner)
	- NER
﻿
Coref Loss
Coref Loss
102030405060Epochs33.13.23.3Loss
cor-ner-onto
Run set1
﻿
As you can see, all the metrics they're hovering under 1 percent. And they never increase.
It could be something in metric calculation. It could just be that the model is wired incorrectly. It could be bad data preprocessing. Who knows.
﻿
Only Coref?When we run only Coref, things get a bit interesting. Slightly.
﻿
Run set2
﻿
This model's metrics never increase either, but are still better than the ones coming from a model trained with the NER Objective. Still far away from the baseline. 
Pruner Co-Training is crucialThe original E2E Coref paper does not do any pruner training, right? (TODO check). Regardless when we add that objective, things get slighlty clearer.
﻿
Run set3
﻿
Compare the green run (with Pruner) with the other two. It's much better. It trains, a bit too. So my code isn't completely broken. But its far far off from actual baselines. But for sure, we need to train with the Pruner included. 
Two things are problematic here. Each of these run takes about 10-20 hours before I can get to know what's happening. So I need to reduce the set of examples I'm dealing with, set a baseline (with this setting) and try to improve over it.
Trimmed Dataset BaselinesHere, instead of 2700 instances, I'm clamping the dataset to the first 50. 
﻿
Run set11
﻿
Similar findings to begin with: Pruner is a must. We settled on a set of loss scales for both tasks, and are going to change parts of the code around for empirically guided debugging.
Empirically guided Debugging﻿
Run set17
﻿
So, as you can see, with multiple (30+ when included deleted runs), we can't seem to go anywhere from here. The code, I'm increasingly certain does what it intends to, correctly.
Current Status and what comes nextHypothesis 1: code is correct, model isn't training.
Pretrain a model with no Coref, only Pruner and then use this pretrained model to do Coref + Pruner (or even only just Coref).
Ask Joe if he did some training magic or something else to get this to work. TODO: on monday in Lille
Hypothesis 2: model code is correct, evaluation code isn't.
Train Joe's model, and download its predictions. Run those predictions on your evaluation code. See if they're good. Status: did not manage to get access to another GPU on the weekend (22nd-24th April). Try again on monday.
Continue debugging, line by line :/
Hypothesis 3: model code isn't correct either.
Very hard to believe. Went through almost all of it 2-4 times alone, and now once with Gaurav. The model also seems to train something. 
TODO: understand how these metrics are calculated. Why is MUC much better than B-Cubed, Ceafe?
﻿
Add a comment