Skip to main content

246: Web Extraction Method Comparison

Created on November 1|Last edited on December 17
Qualitative Analysis:
Trafilatura(favor_precision):
  • Weird handling of tables
    • Add trailing or preceding | chars before or after the content
  • Cuts down noise pretty well but can be fairly aggressive at times
  • Title is truncated and this mostly with trafilatura. It’s probably not a big deal I guess.
    • If the title is of the form `a-b` it’ll skip text b and only extracts a.
      • Frequent but not often.
  • Title is often duplicated in text too
Trafilatura(favor_recall):
  • Similar to resiliparse in terms of content
  • Does have lesser noise but can contain obvious boilerplates
Resiliparse with Formatting:
  • Fails to remove any noise that is wrapped in content tag
    • Blind trust on content tags
  • It does throw it away if wrapped in a non-main content tag like <aside>
  • Performs really weirdly with tables.
    • Skips, clubs all the text together.
Resiliparse w/o Formatting:
  • Overall output lacks formatting
  • Rest same as above
Readability:
  • Formatting is good, markdown preserves paragraph boundaries
    • For links it’s represented quite bad with quotes being at end and not wrapped
      • [*Prevenge*](http://www.prevengemovie.co.uk/ "")
  • Table extraction is pretty good.
  • Can be really aggressive sometimes and skip over obvious content that should’ve been extracted.
  • Blind(Maybe) trust on content tags too
  • Mostly a hit and miss.


Quantitative Analysis:

Run set
6

LM Eval Harness Results: