WebFormer: The Next Logical Step In Structure Information Extraction

The WebFormer paper from Facebook outlines a method of web-based structure information extraction that produces significant improvement over the current state-of-the-art.
Dave Davies
Created on February 8|Last edited on March 15
Comment
I don't generally read papers from Facebook (I'm an SEO after all: Google is my general go-to), but when Ayush Thakur sent me a link to WebFormer: The Web-page Transformer for Structure Information Extraction, I couldn't help myself. And after reading it, I absolutely understand why he thought I would be interested.
Lorum ipsom.
WebFormer is a lot more than just an interesting read though: it outlines a method of structure information extraction that is nothing less than the next logical step in the evolution of that task. We'll talk about what structure information extraction is in just a second, but before we dive in, I wanted to give you a look at what we'll be covering today:
Sections﻿What Is Structure Information Extraction?﻿
﻿What Is WebFormer?﻿
﻿How does WebFormer Work?﻿
﻿The Experiments﻿
﻿The Results﻿
﻿Summary﻿
Now then: let's get going.
What Is Structure Information Extraction?Structure Information Extraction is the task of extracting text from a web page and connecting it with the value and data type it represents. The purpose of it is to attain the same or greater understanding of a web page and its contents without schema as it would with schema.
For example? Many web pages contain product information including name, description, price and more. 
﻿
Who doesn't want a Baby Yoda Chia Pet, right? Right?
Structure Information Extraction systems, when successful, can pull this information from a page creating triples (basically, two data points connected by a relationship). 
In the example above we would get the series of triples:
product_name = The Child in Mando's Satchel Chia Pet (Star Wars - 'The Mandalorian')
product_brand = Chia Pet
product_description = EVERYTHING YOU NEED IN ONE: Includes a unique pottery planter, convenient plastic drip tray and chia seed packets for 3 plantings. The Pet Collection: The variety is endless! Chia Pets come in all different shapes and sizes including your favorite presidents, actors, emojis, and even movie characters. Watch your collection keep growing!
product_price = $19.97...which might create a Knowledge Graph that looks like: 
﻿
In my mind, the role of Structure Information Extraction is (and now comes the time where I should admit this is perhaps  just from watching too much Star Trek):
﻿
Knowing what we're trying to accomplish, it's now time to answer the question:
What Is WebFormer?WebFormer is a machine learning Transformer model for web pages that makes use of HTML elements to better understand the context of text sequences on the page to address the challenge of extracting the required data, even though general web structures and layouts change frequently.
We'll be covering below some of the core mechanics as to how this is accomplished. But at the root of things, while traditional systems rely only on the text on a page, WebFormer makes use of everything in the red boxes (as well as the text of course):
﻿
Basically, it makes use of everything on the page, including how the page itself is built. It goes beyond just the content. 
Why?One question you may be asking is, "Why?"
Most of the current state-of-the-art extraction models rely on content and not code. 
And while some solid work has been done moving from "wrappers" (historically hand-coded templates for a model to use as a reference to find where specific information should be) to wrapper induction (first described in the 1997 paper Wrapper Induction for Information Extraction as "a technique for automatically constructing wrappers"), induction is resource intensive and thus very expensive. Not what you want in an ever-changing area.
Related TechniquesA couple techniques that you will see inspire parts of WebFormer:
1️⃣ Other than the above noted wrapper and wrapper induction methods, there's also the method proposed in Deep Neural Networks for Web Page Information Extraction that rely on schema markup to train the system how to identify structured textual information.
Essentially, when the system encounters schema it is handed the key to understanding which textual elements on a page represent a value within the markup. For example, the schema for a product will clearly identify the product name, brand, price, description and likely more.
With this information, the system can then use the placement of this information on a webpage to teach itself how to identify and create new wrapper templates.
SEO Aside: I view this as likely implemented, and akin to Google Author Tags. Useful until the system can confidently identify the content it's looking for without it.2️⃣ Additionally, we've also seen people try sequence modeling. These models break down the text of a page into sequences and then extract the structured elements. These models get solid results, but current versions also have serious limitations.
One such limitation is the lack of use of the HTML itself, the backbone of WebFormer. The HTML can assist in providing context, even without rendering. We can get an understanding of why this is using an image from the WebFormer paper:
﻿
An example of an event web page with its HTML (bottom left) and the extracted structured event information (bottom right), including event title, description, date and time, and location. The corresponding extractions of all text fields are highlighted with colored bounding boxes.
Looking at the code surrounding the data we're trying to extract, it's quite easy to see how it has value.
The second major problem is in the scaling of fields. These types of models don't scale well in the number of fields they can look for across domains. A separate model needs to be built for each text field---hardly suitable for large-scale tasks.
And lastly, they do not handle large documents well. Because Transformer-based models generally tap out at 512 tokens, current methods generally do not work for large web documents.
So how does WebFormer get around these problems?
How Does WebFormer Work?In this section we're going to dive into the nuts-and-bolts, looking at the technical working of the model.
If you're not interested in that you can simply jump to the results.
For the rest?
﻿
General Outline Of How WebFormer WorksAt a high level, rather than just using one type of token, WebFormer uses three. This allows the model to use the encoded field, the HTML, and the content as opposed to just the content.
The authors of the paper summarize the main contributions as follows:
• We propose a novel WebFormer model for structure information extraction from web documents, which effectively integrates the web HTML layout via graph attention.
• We introduce a rich attention mechanism for embedding representation among different types of tokens, which enables the model to encode long sequences efficiently. It also empowers the model for zero-shot extractions on new domains.
• We conduct extensive experiments and demonstrate the effectiveness of the proposed approach over several state-of-the-art baselines.But how?
The paper includes the following figure:
The WebFormer model architecture
The model architecture basically consists of the input layer, the encoder and the output layer.
Let's look at each:
The WebFormer Input LayerRather than just encoding the text sequence, this model also encodes the HTML layout.
The full model requires three token types:
Text tokens - a set of tokens assigned to the textual content. These are the typical tokens for these types of tasks.
HTML tokens - a set of tokens that are assigned to the HTML layout elements.
Field tokens -  a set of tokens assigned to the type of text field to extract (ex - product name, price, and so on).
The WebFormer EncoderThe WebFormer encoder is "a stack of 𝐿 identical contextual layers, which efficiently connects the field, HTML and text tokens with rich attention patterns followed by a feed-forward network."
The Attention LayersIn their attention layers, the authors created four different attention patterns:
HTML-to-HTML: For keeping track of the relationships between HTML tokens, which are naturally connected in the DOM tree graph. Using graph attention, the model gains an understanding of how various HTML elements contribute to understanding where the sought text lies.
HTML-to-Text: This layer is responsible for mapping which text nodes are impacted by which HTML nodes. Only those text nodes within the specific HTML node are considered.
Text-to-HTML: This layer "allows the text token to absorb the high-level representation from these summarization tokens of the web document." Unlike the HTML-to-Text layer, all HTML nodes are at play, as they relate to the specific text node in question.
Text-to-Text: This is the layer that those familiar with transformer models are used to, but with a twist. Where traditional models spike in computational costs as sequent lengths grow, with WebFormer the authors propose a system which uses relative positioning and that attend only to to the tokens within the same text sequence and a local radius r. If r is 1 (for example) the text sequence will attend to the ones to either site. As the authors put it in relation to the figure above, "For instance, the text token “is” in 𝑡2 attends to the tokens “This”, “is” and “a” within 𝑡2."
Field Token AttentionThe field tokens (which represent the structured fields. Ex - "product price") enable cross-attention between field and HTML tokens. The authors tested and found no improvement with field-text cross-attention.
They point out that while there is no direct interaction between field and test tokens, they are bridged by the Text-to-HTML and the HTML field attentions.
The WebFormer Output LayerThere's really not a lot to say about the specific output. 
It's exactly what you would think it is, a mapping of field tokens to the text spans related to them.
i.e. if the field token represents the product price, the text span would be the predicted price.
They use a softmax function to create the probabilities of which text span is correct.
The ExperimentsThe authors use the Structured Web Data Extraction (SWDE) and Common Crawl datasets.
SWDE Dataset - Designed for structural reading comprehension, it consists of over 124k web pages from 80 sites in 8 verticals (auto, book, camera, job, movie, nbaplayer, restaurant, and university).
Common Crawl Dataset - From the 250TiB of web content (from more than 3 billion pages), the authors here selected web pages that have Schema.org annotations. From this they chose the domains of events, products and movies, looking for the following fields:
Events - Name, Description, Date, Location
Products - Name, Description, Brand, Price, Color
Movies - Name, Description, Genre, Duration, Director, Actor, Published Date
For those not familiar with schema.org, the markup would give them the correct answer for each field, which is why it was selected.
﻿
A simple example of Product Schema would be:
﻿
{
  "@context": "https://schema.org",
  "@type": "Product",
  "description": "0.7 cubic feet countertop microwave. Has six preset cooking categories and convenience features like Add-A-Minute and Child Lock.",
  "name": "Kenmore White 17\" Microwave",
  "image": "kenmore-microwave-17in.jpg",
  "offers": {
    "@type": "Offer",
    "availability": "https://schema.org/InStock",
    "price": "55.00",
    "priceCurrency": "USD"
  }
}
💡
The ResultsThe authors evaluated the WebFormer model with two evaluation metrics:
Exact Match - used to determine whether the predicted text span is exactly the same as the ground truth.
F1 - used to measure the overlap of the predicted text span and the ground truth.
The baselines used were:
﻿OpenTag﻿
﻿DNN﻿
﻿AVEQA﻿
﻿SimpDOM﻿
﻿H-PLM﻿
The results against these baselines are pretty compelling:
﻿
The authors conclude:
"..., the EM metric of WebFormer increases over 7.8% and 5.8% compared with AVEQA and H-PLM on Products"﻿﻿﻿﻿They attribute this to:
The model understands the layout better, and responds to changes in a superior fashion.
The model can handle larger document sizes.
Different fields are able to assist each other through their shared encoder.
Side NotesThe authors tested the model by removing the different attention layers. The results were:
Results of WebFormer with different attention patterns. Top: EM scores. Bottom: F1 scores.
We can see clearly the impact of using the HTML has on the success of the predictions.
Additionally they tested the system on varying text sequence lengths. The buckets were:
0 - 512
512 - 1024
1024 - 2048
2048+
The number of documents in each dataset is in the top row, followed by the prediction success rate for the 4 baseline models plus WebFormer on each of the two datasets.
﻿
﻿
Once again WebFormer produces extremely well.
Zero-Shot and Few-Shot To test WebFormer's capability on new domains, the authors conducted zero-shot and few-shot extraction experiments. They did this by training the models on Products and Movies, and then testing on Events.
The results for the four fields were:
﻿
This ability to apply the learnings to new domains quickly is incredibly important.
Due to the fields "Name" and "Description" appearing in the other domains, they start off strong but the other two need a bit of time to train, but still perform admirably within 50 examples.
SummaryThe addition of HTML tokens provide obvious advantages and may well eliminate the need for Schema markup, while simultaneously improving results. Not only does it improve performance, it allows for the extraction from larger documents than other models are capable of.
As I said at the beginning of this post, WebFormer is nothing less than the next logical step in the evolution of the task of structure information extraction.
I look forward to seeing where this goes next, and how it impacts such real-life areas as search results.
Recommended Reading
Information Extraction from Scanned Receipts: Fine-tuning LayoutLM on SROIE
An OCR demo with LayoutLM fine-tuned for information extraction on receipts data.
Information Extraction From Documents Using Machine Learning
In this article, we'll extract information from templated documents like invoices, receipts, loan documents, bills, and purchase orders, using a model.
The Softmax Activation Function Explained 
In this short tutorial, we'll explore the Softmax activation function, including its use in classification tasks, and how it relates to cross entropy loss.
﻿
﻿
﻿