246: Web Extraction Method Comparison
Created on November 1|Last edited on December 17
Comment
Qualitative Analysis:
Trafilatura(favor_precision):
- Weird handling of tables
- Add trailing or preceding | chars before or after the content
- Cuts down noise pretty well but can be fairly aggressive at times
- Title is truncated and this mostly with trafilatura. It’s probably not a big deal I guess.
- If the title is of the form `a-b` it’ll skip text b and only extracts a.
- Frequent but not often.
- Title is often duplicated in text too
Trafilatura(favor_recall):
- Similar to resiliparse in terms of content
- Does have lesser noise but can contain obvious boilerplates
Resiliparse with Formatting:
- Fails to remove any noise that is wrapped in content tag
- Blind trust on content tags
- It does throw it away if wrapped in a non-main content tag like <aside>
- Performs really weirdly with tables.
- Skips, clubs all the text together.
Resiliparse w/o Formatting:
- Overall output lacks formatting
- Rest same as above
Readability:
- Formatting is good, markdown preserves paragraph boundaries
- For links it’s represented quite bad with quotes being at end and not wrapped
- [*Prevenge*](http://www.prevengemovie.co.uk/ "")
- Table extraction is pretty good.
- Can be really aggressive sometimes and skip over obvious content that should’ve been extracted.
- Blind(Maybe) trust on content tags too
- Mostly a hit and miss.
Quantitative Analysis:
Run set
6
LM Eval Harness Results:

Add a comment