Skip to main content

In pursuit of a classifier

Trying to improve the original "total" model, add other fields, etc. for the Deepform project.
Created on June 2|Last edited on June 10

Classic model

We've gotten our "classic" total extraction model up to high 90s accuracy -- basically this model, which outputs one value per token in the window, just a simple 0/1: is this token the totals we are looking for? On the original 2012 training data set, it gets 97.9% doc_val_acc.

One the full 2012 training data it doesn't do quite as well though, getting only 92.9%. This is some combination of human overfitting, plus the new data is OCR'd so it's noisier. This comparison is with the same length training data (8000).

Classic actually classifies each token 30 times if the window length is 30, and then the outputs for each window are overlapped and summed. We'd like to simplify and speed up the model by asking for only one token. Also, we want to move towards a standard multi-class classifier, with one output per class, plus "none of the above." This is so we can label tokens from the different fields we wish to extract (contract number, advertiser, total, flight dates.) Our first cut at a one-token model on the new data gives 90.4%, so about a 2.5% loss.

Remember that in all these charts, doc_val_acc is sampled very noisily before the final epoch, so really only the rightmost dot means anything.




1tok baseline
1
original classic
1
classic new 2012
1


One Token

Can we make this 2.5% loss back? Can we squeeze our way back into the high 90s? Let's tweak the network a bit to find out. I wonder if it wants another layer -- right now it only has two.

Let's try three layers! This gets us to 92.1%, only 0.8% worse than classic -- though of course classic benefits here too.

I also did two other runs investigating smaller values of steps_per_epoch. 50 is our default. Turns out 20 is bad, and you can see this in the loss graph for that run. But 40 steps got us 93.1% which is above the classic model on this data (92.9%). This tells me there is a lot of variance in these results, and we are in striking distance of parity.

Also, I noticed the embedding size had decreased to 16, so I tested it back at 32. No change.




2 vs 3 layer
2
fewer steps
2
embed 32
1


Back to basics

So there doesn't seem to be an easy way to get the single token model performing at the level of classic. Meanwhile we've lost 5% now that the rest of the 2012 training data is in. Let's see if we can figure out why, and try to get it back.

First off, another baseline run gives 92.7% (this is within noise of the 92.9% on the baseline run above).

Adding a third layer reduces this to 91.9% -- the one token model got better with a third layer, the classic did not. I thought that this deeper network might be under-training, so I tried again with 100 epochs, but that did not help (got an even 92%).

So it seems like the 2-layer classic and the 3-layer one-token models are in the same ballpark, perhaps with classic slightly in the lead on the full 2012 set.




Run set
3


What are the errors?

First off, for reference here are the 80 (out of 1000) errors on the new baseline (classic baseline 2012 above)

**Incorrect: 456674-wo-8-28-buy-13452084026210-_-pdf: guessed "22486" with score 17.53, , correct "00.001$230"
**Incorrect: 455982-hudson-wsoc-cancel-327300-10-01-12: guessed "00.0" with score 21.23, , correct "00.0"
**Incorrect: 463112-rand-order-64928-version-1-station-tract: guessed "$16,200.00" with score 23.59, , correct "0.00"
**Incorrect: 454727-waws-ns-10_23_12c62290-13488699142872-_-pdf: guessed "$5,020.00" with score 25.09, , correct "0.00"
**Incorrect: 454318-order-103575-req-13493859049315-_-pdf: guessed "845540" with score 25.10, , correct "$700.00"
**Incorrect: 489134-oct30-nov5-483101-13512639257069-_-pdf: guessed "$876.00" with score 18.71, , correct "130"
**Incorrect: 483261-patriot-maj-fund-215771-13510302080717-_-pdf: guessed "6540" with score 26.40, , correct "58,275.00"
**Incorrect: 432074-collect-files-48575-political-file-2012-federal: guessed "6837928" with score 26.20, , correct "132,115.00"
**Incorrect: 487442-restore-our-future-4079051-1-13511916075400-_-pdf: guessed "$147,250.00" with score 24.86, , correct "$117,250.00"
**Incorrect: 430677-collect-files-48589-political-file-2012-non: guessed "654184" with score 25.22, , correct "_$975.00"
**Incorrect: 478077-priorities-usa-action-e120810078-13505733156184: guessed "4740711" with score 26.18, , correct "59500"
**Incorrect: 474921-wwj-restore-our-future-r-ord53759-issue-order: guessed "2000000" with score 22.27, , correct "$44,800.00"
**Incorrect: 465331-mccrory-10-1-10-7-rev-3-13500681089105-_-pdf: guessed "334243" with score 23.08, , correct "9365"
**Incorrect: 457314-romney-for-president-wo-10-6-buy-13494768102464: guessed "00.0" with score 24.23, , correct "500."
**Incorrect: 480727-order-104463-req-13509330112645-_-pdf: guessed "845540" with score 24.65, , correct "$2310,200.00"
**Incorrect: 459676-romney-6367495-13498083073633-_-pdf: guessed "87880" with score 25.68, , correct "/12324225.0"
**Incorrect: 425976-collect-files-74195-political-file-2012-non: guessed "6306146" with score 23.91, , correct "$15,465"
**Incorrect: 433160-collect-files-73292-political-file-2012-non: guessed "$239,350.01" with score 26.03, , correct "$239,350.00"
**Incorrect: 424255-collect-files-56537-political-file-2012-non: guessed "$228,480.00" with score 24.88, , correct "$328,480.00"
**Incorrect: 417812-collect-files-72076-political-file-2012-non: guessed "914019" with score 23.79, , correct "$1,800.00"
**Incorrect: 432331-collect-files-73292-political-file-2012-non: guessed "$499,500.00" with score 25.15, , correct "$499,500.0"
**Incorrect: 487774-order-104398-req-13511877136076-_-pdf: guessed "2305325.00" with score 29.86, , correct "$675.00"
**Incorrect: 422245-collect-files-48556-political-file-2012-non: guessed "140012002100" with score 27.65, , correct "$31,650.00"
**Incorrect: 416653-collect-files-35313-political-file-2012-non: guessed "0.00" with score 23.03, , correct "200."
**Incorrect: 433160-collect-files-73292-political-file-2012-non: guessed "$239,350.01" with score 26.03, , correct "$239,350.00"
**Incorrect: 478084-priorities-usa-e-120810071-13458195397504-_-pdf: guessed "51137260" with score 25.69, , correct "42500"
**Incorrect: 429119-collect-files-74070-political-file-2012-non: guessed "06262103" with score 25.84, , correct "$52,550.00"
**Incorrect: 489134-oct30-nov5-483101-13512639257069-_-pdf: guessed "$876.00" with score 18.71, , correct "130"
**Incorrect: 433973-collect-files-11289-political-file-2012-federal: guessed "$1,030.00" with score 25.17, , correct "$1,100.00"
**Incorrect: 428491-collect-files-11289-political-file-2012-federal: guessed "9816423" with score 25.59, , correct "$19,120.00"
**Incorrect: 454941-obama-6332488-rv-1-1-13491975143049-_-pdf: guessed "87800" with score 18.39, , correct "475."
**Incorrect: 424260-collect-files-56537-political-file-2012-non: guessed "22001122" with score 25.65, , correct "$40,340.00"
**Incorrect: 435536-collect-files-11289-political-file-2012-federal: guessed "$71,915.00" with score 25.50, , correct "$350.00"
**Incorrect: 421171-collect-files-35870-political-file-2012-federal: guessed "383" with score 25.27, , correct "$1,710.00"
**Incorrect: 455008-nelson-6187882-rv-3-1-13493832131641-_-pdf: guessed "87880" with score 26.23, , correct "4925."
**Incorrect: 506247-go-bonds-for-higher-ed-217293-13521355822331-_-pdf: guessed "59,090.00" with score 23.66, , correct "$9,090.00"
**Incorrect: 472985-skmbt_60112101717200-13505130108993-_-pdf: guessed "$7,400.00" with score 24.90, , correct "0.00"
**Incorrect: 474702-obama-6269465-rv-1-13506762264676-_-pdf: guessed "7880" with score 27.34, , correct "2850"
**Incorrect: 438870-collect-files-65680-political-file-2012-non: guessed "2012" with score 16.46, , correct "700"
**Incorrect: 468947-american-crossroads-10-15-12-13503132107766-_-pdf: guessed "$118,750.00" with score 25.75, , correct "$118,750.0"
**Incorrect: 489728-skmbt_60112102614161-13512801314828-_-pdf: guessed "801200083" with score 26.10, , correct "0.00"
**Incorrect: 458918-kmyu-matheson-10-22-thru-10-28-13497192059278-_: guessed "73611" with score 22.63, , correct "25.00"
**Incorrect: 496933-american-crossroads-6228191-rv-4-1: guessed "87880" with score 24.02, , correct "147550.0"
**Incorrect: 461393-nfib-3371757-100912-13498062134387-_-pdf: guessed "175,663" with score 25.65, , correct "30,300.00"
**Incorrect: 416653-collect-files-35313-political-file-2012-non: guessed "0.00" with score 23.03, , correct "200."
**Incorrect: 458413-add-10-6-to-10-9-198026-13497144180961-_-pdf: guessed "$3,000.00" with score 23.02, , correct "$700.00"
**Incorrect: 463112-rand-order-64928-version-1-station-tract: guessed "$16,200.00" with score 23.59, , correct "0.00"
**Incorrect: 492615-pestka-oct-30-nov-5-buy-13515444082673-_-pdf: guessed "75795006" with score 27.67, , correct "00.0$"
**Incorrect: 442220-collect-files-53113-political-file-2012-federal: guessed "200300" with score 26.42, , correct "0.00"
**Incorrect: 439956-collect-files-414-political-file-2012-non: guessed "202" with score 20.61, , correct "50.00"
**Incorrect: 469226-amer-crossroads-order-45126-version-3: guessed "$58,800.00" with score 25.89, , correct "$350.00"
**Incorrect: 478123-obama-10-17-13505085343297-_-pdf: guessed "$59,160.00" with score 24.98, , correct "$59,100.00"
**Incorrect: 444012-collect-files-61251-political-file-2012-federal: guessed "003290" with score 19.85, , correct "00.0$"
**Incorrect: 469226-amer-crossroads-order-45126-version-3: guessed "$58,800.00" with score 25.89, , correct "$350.00"
**Incorrect: 422472-collect-files-48589-political-file-2012-non: guessed "654184" with score 24.69, , correct "50.0"
**Incorrect: 493926-k5-michele-bachmann-ord-169528-13516353053899-_: guessed "$42,000.00" with score 25.17, , correct "0.00"
**Incorrect: 417807-collect-files-72076-political-file-2012-non: guessed "855" with score 22.79, , correct "5292"
**Incorrect: 489026-obama-6398290-13512690203193-_-pdf: guessed "7800" with score 23.14, , correct "$2,000.00"
**Incorrect: 455015-nelson-6190725-rv-2-1-13487718116224-_-pdf: guessed "7035287800" with score 25.14, , correct "2930."
**Incorrect: 489728-skmbt_60112102614161-13512801314828-_-pdf: guessed "801200083" with score 26.10, , correct "0.00"
**Incorrect: 421171-collect-files-35870-political-file-2012-federal: guessed "383" with score 25.27, , correct "$1,710.00"
**Incorrect: 463112-rand-order-64928-version-1-station-tract: guessed "$16,200.00" with score 23.59, , correct "0.00"
**Incorrect: 497989-roberts-wsoc-339543-11-1-12-13518042206736-_-pdf: guessed "8221046" with score 28.58, , correct "00.0"
**Incorrect: 506247-go-bonds-for-higher-ed-217293-13521355822331-_-pdf: guessed "59,090.00" with score 23.66, , correct "$9,090.00"
**Incorrect: 438982-collect-files-414-political-file-2012-federal: guessed "8407979" with score 22.41, , correct ",49$3,670.00"
**Incorrect: 478123-obama-10-17-13505085343297-_-pdf: guessed "$59,160.00" with score 24.98, , correct "$59,100.00"
**Incorrect: 419923-collect-files-32311-political-file-2012-non: guessed "045127973860" with score 24.58, , correct "$1,600.00"
**Incorrect: 455011-nelson-6190719-rv-3-1-13486815059357-_-pdf: guessed "70352878" with score 26.91, , correct "$2,400.00"
**Incorrect: 416557-collect-files-11289-political-file-2012-federal: guessed "$197,900.00" with score 23.45, , correct "$16,500.00"
**Incorrect: 433973-collect-files-11289-political-file-2012-federal: guessed "$1,030.00" with score 25.17, , correct "$1,100.00"
**Incorrect: 439956-collect-files-414-political-file-2012-non: guessed "202" with score 20.61, , correct "50.00"
**Incorrect: 425808-collect-files-11291-political-file-2012-state: guessed "$19,828.00" with score 26.02, , correct "19825"
**Incorrect: 421224-collect-files-71293-political-file-2012-federal: guessed "$79,225.00" with score 24.51, , correct "0.00"
**Incorrect: 443979-collect-files-35823-political-file-2012-federal: guessed "103" with score 24.45, , correct "7$12,430.00"
**Incorrect: 455985-mccrory-wsoc-320396r-09-18-12-13481601078630-_-pdf: guessed "06221191" with score 28.28, , correct "200"
**Incorrect: 457281-order-107382-req-13493985054332-_-pdf: guessed "3305700.00" with score 27.40, , correct "$675.00"
**Incorrect: 434121-collect-files-25456-political-file-2012-federal: guessed "236635.00" with score 26.41, , correct "234835.00"
**Incorrect: 430765-collect-files-32311-political-file-2012-non: guessed "654184" with score 26.36, , correct "165"
**Incorrect: 487153-wcco-thomas-peterffy-r-issue-ord54063-fednatl: guessed "81800.00" with score 24.70, , correct "123800.00"
**Incorrect: 416469-collect-files-39736-political-file-2012-non: guessed "$8,710.00" with score 25.48, , correct "0.00"



Run set
513


Permutation

Examining the output, especially on 2020 data, suggests the classifier has not really learned much about identifying totals -- it's still picking words and not numbers far too often, even though there is an explicit "is this a dollar value?" feature. Let's try making the problem harder by permuting the tokens, in an attempt to force the classifier to learn geometric and other structures.

This led to a large loss in performance.

Adding another layer helped a bit. Adding explicit features for matching the token against "total", "amount", "gross", "net", and "contract" got a little of this back, but not that much.




permute
4


Hints and layers

So permutation just trashes us on 2012 (thought not sure yet what effect on 2020.) But let's see what the effect of adding three layers and these new "hint" features is when tokens are not permuted.

Looks like we go from 92.7% to 93.2% when adding another layer and these hint tokens, probably not a significant difference.




Run set
3