# Revisiting Deep Learning Models for Tabular Data

An in-depth breakdown of 'Revisiting Deep Learning Models for Tabular Data' by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov & Artem Babenko.
Saurav Maheshkar

Watch the Full Video Discussion on our Youtube Channel.

## 👋 Motivation

Due to the tremendous success of deep learning on common data modalities like images, audio and text, there has been a lot of interest in extending those deep learning techniques to tabular data. Tabular data is of course widely used, often for industrial applications where data points are represented as vectors of heterogeneous features.
Specifically, deep learning on tabular data would allow for the construction of multi-modal pipelines for situations where some part of the input data is tabular and the other parts might include other modalities like image or audio. These pipelines could be trained in an end-to-end manner by using gradient descent across all modalities.
A large number of deep learning based solutions for tabular data have been proposed recently such as:
1. TabNet (Arik and Pfister, 2020), which uses sequential attention to choose which features to reason from each decision step thereby enabling interpretability and better efficient learning because the learning capacity of the network us being used for the most important feature.
2. GrowNet (Badirli et al., 2020), which uses a novel gradient boosting framework where shallow neural networks (with one or two hidden layers) are used as "weak learners". At each boosting step, the original input features are augmented with the output from the penultimate layer of the current iteration. This augmented feature-set is then fed as input to train the next weak learner via a boosting mechanism using the current residuals.
3. Tree Ensemble Layers (Hazimeh et al., 2020), which introduce a new layer composed of an ensemble of differentiable decision trees, a special variety of trees which perform soft routing i.e. instead of routing a sample to exactly one direction at a internal node, soft trees are able to route a sample to either direction in different proportions thereby making soft trees differentiable allowing for gradient based learning.
4. TabTransformer (Huang et al., 2020a), which uses self-attention based transformers to convert categorical features into d-dimensional embeddings which can then be fed into transformer blocks. These resulting contextual embeddings have shown to achieve higher precision accuracy.
5. Self Normalizing Neural Networks (SNNs) (Klambauer et al., 2017), which employ the use of Scaled Exponential Linear Units (SELUs), a special activation function which induces self normalizing properties. These activations converge towards zero mean and unit variance even under noise and perturbations.
6. Neural Oblivious Decision Ensembles (NODE) (Popov et al., 2020), which proposes a new architecture with a layer-wise structure, consisting of differentiable oblivious trees, a type of decision table that splits the data along d-splitting features and compares each feature to a learned threshold, trained in an end-to-end manner by back-propagation.
7. AutoInt (Song et al., 2019), which converts sparse input feature vectors into low-dimensional learned embeddings and then feeds these embeddings into a multi-head self-attention network to model high-order feature interactions.
8. Deep & Cross Neural Networks (DCNs) (Wang et al., 2017, Wang et al., 2020a), which train two parallel networks, a Cross Network to capture high-degree explicit interactions across features and a Deep Network to learn implicit features, from learned embeddings. This architecture was further improved using low-rank techniques to reduce the computational cost.
However because of the lack of established benchmarks and baselines, it is still unclear which deep learning model generally performs better than others and whether gradient boosted trees surpass deep learning models.
In this report we will be looking at the paper titled "Revisiting Deep Learning Models for Tabular Data" by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov and Artem Babenko.

### 🔑 Key Takeaways

After a thorough evaluation of current architecture on a diverse set of tasks in order to evaluate their relative performance, two models emerge for tabular data:
• A simple ResNet-like architecture acts as a simple yet effective baseline for tabular deep learning, and is recommended as a baseline for comparison.
• The authors introduce FT-Transformer, a simple adaption of the widely used Transformer architecture which has proven to be a universal architecture that performs well on a wider range of tasks than other deep learning models.
The authors note that THERE IS STILL NO UNIVERSALLY SUPERIOR SOLUTION AMONG GBDT AND DEEP MODELS

## 👨‍🏫 Method

#### 📝 Notation (Click to Expand)

For this work, we'll mainly consider supervised learning problems.
• D = \{ (x_i, y_i) \}_{i=1}^{n} denotes the dataset.
• x_i = ({x}_{i}^{(num)}, {x}_{i}^{(cat)}) \in \mathbb{X} where {x}_{ij}^{(num)}and {x}_{ij}^{(cat)} denotes numerical and categorical features respectively.
• y_i \in \mathbb{Y} denotes the corresponding object label.
• k denotes the total number of features

### 👴🏽 Multi-Layer Perceptron (MLP)

The MLP architecture is formalized as follows :-
\large MLPBlock(x) = Dropout(\, ReLU(\, Linear(x) \,) \,) \\ MLP(x) = Linear(\, MLPBlock(\, ... (\, MLPBlock(x) \,) \,) \,)

### 🔁 ResNet

ResNets are formalized as follows :-
\large Prediction(x) = Linear(\, ReLU(\, BatchNorm(x) \,) \,) \\ ResNetBlock(x) = x + Dropout(\, Linear(\, Dropout(\, ReLU(\, Linear(\, BatchNorm(x) \,) \,) \,) \,) \,) \\ ResNet(x) = Prediction(\, ResNetBlock(\, ... (\, ResNetBlock(\, Linear(x) \,) \,) \,) \,) \\

### 🆕 FT-Transformer (Feature Tokenizer + Transformer)

#### Code for Feature-Tokenizer :- (Click to Expand)

import mathimport torchimport typing as tyimport torch.nn as nnfrom torch import Tensorclass Tokenizer(nn.Module): category_offsets: ty.Optional[Tensor] def __init__( self, d_numerical: int, categories: ty.Optional[ty.List[int]], d_token: int, bias: bool, ) -> None: super().__init__() # If Categorical Features are absent if categories is None: d_bias = d_numerical self.category_offsets = None self.category_embeddings = None else: d_bias = d_numerical + len(categories) category_offsets = torch.tensor([0] + categories[:-1]).cumsum(0) self.register_buffer('category_offsets', category_offsets) # Creates a lookup table, where num_embeddings = sum(categories) and d_embed = d_token self.category_embeddings = nn.Embedding(sum(categories), d_token) nn_init.kaiming_uniform_(self.category_embeddings.weight, a=math.sqrt(5)) print(f'{self.category_embeddings.weight.shape=}') # take [CLS] token into account self.weight = nn.Parameter(Tensor(d_numerical + 1, d_token)) self.bias = nn.Parameter(Tensor(d_bias, d_token)) if bias else None # The initialization is inspired by nn.Linear nn_init.kaiming_uniform_(self.weight, a=math.sqrt(5)) if self.bias is not None: nn_init.kaiming_uniform_(self.bias, a=math.sqrt(5)) def forward(self, x_num: Tensor, x_cat: ty.Optional[Tensor]) -> Tensor: x_some = x_num if x_cat is None else x_cat assert x_some is not None x_num = torch.cat( [torch.ones(len(x_some), 1, device=x_some.device)] # [CLS] + ([] if x_num is None else [x_num]), dim=1, ) x = self.weight[None] * x_num[:, :, None] # If Categorical Features are Present if x_cat is not None: x = torch.cat( [x, self.category_embeddings(x_cat + self.category_offsets[None])], dim=1, ) if self.bias is not None: bias = torch.cat( [ torch.zeros(1, self.bias.shape[1], device=x.device), self.bias, ] ) x = x + bias[None] return x

#### Feature Tokenizer Layer

The Feature Tokenizer transforms the input features x to embeddings T \in \mathbb{R}^{k \times d}. The embedding for a given feature $x_j$ is computed as follows :-
\huge T_j = b_j + f_j(x_j) \in \mathbb{R}^{d} \hspace{5em} f_j :\mathbb{X}_{j} \mapsto \mathbb{R}^{d}
where b_j is the j-th feature bias. For numerical features {f}_{j}^{(num)}is implemented as element-wise multiplication with vector {W}_{j}^{(num)}and for categorical features {f}_{j}^{(cat)} as the lookup table {W}_{j}^{(cat)} with one-hot vectors {e}_{j}^{T}.
The Feature Tokenizer can be formalized as follows :-
\huge {T}_{j}^{(num)} = {b}_{j}^{(num)} + {x}_{j}^{(num)} \cdot {W}_{j}^{(num)} \in \mathbb{R}^d \\ {T}_{j}^{(cat)} = {b}_{j}^{(cat)} + {e}_{j}^{T} {W}_{j}^{(cat)} \in \mathbb{R}^d \\ T = stack[\, {T}_{1}^{(num)}, ... , \,{T}_{k^{(num)}}^{(num)} , \,{T}_{1}^{(cat)}, ... , \, {T}_{k^{(cat)}}^{(cat)}\,] \in \mathbb{R}^{k \times d}
Figure 1: (Adapted from Figure 2 in the paper) Pictographic representation of how the Feature Tokenizer works. The 3 blue colored boxes represent the numerical features, and are converted to embeddings via element-wise computations. The remaining two boxes represent categorical features. The W^{(cat)} matrix in this case acts as a lookup table. The resulting tokens are represented by the matrix T \in \mathbb{R}^{k \times d}.

#### Transformer Layers

Now that we have converted our input data into d-dimensional learnable embeddings, we will now feed these into the Transformer blocks. A special [CLS] token is also appended to the stack of embeddings before further processing.
\large T_{0} = stack[\, [CLS], T \,] \hspace{3em} T_i = F_i (T_{i=1}) \hspace{1em} i = 1 ... L
Figure 2: (Adapted from Figure 2 in the paper) Pictographic representation of a single Transformer layer. Notice how the we use Residual Pre-Normalization. The tokens T_i are passed through L such layers, viz \{F_i\}_{i=1}^{L}

#### Prediction/Output Layer

The final representation of the [CLS] token is used for prediction:-
\large \hat{y} = Linear(\, ReLU(\, LayerNorm(\, {T}_{L}^{[CLS]} \,) \,) \,)
NOTE: The authors also use standard training tricks such as PreNorm, Dropout Modules, and ReGLU activations and are also cognizant of the fact that the widespread use of FT-Transformer can lead to greater CO2 emissions

#### 👉 The entire FT-Transformer can be formalized as follows: (click to expand) 👈

\large FFN(x) = Linear(\, Dropout(\, Activation(\, Linear(x) \,) \,) \,) \\ ResidualPreNorm(Module, \, x) = x + Dropout(\, Module(\, Norm(x) \,) \,) \\ Block(x) = ResidualPreNorm(\, FFN, \, ResidualPreNorm(MHSA, \, x) \,) \\ Prediction(x) = Linear(\, ReLU(\, LayerNorm(\, {T}_{L}^{[CLS]} \,) \,) \,) \\ FT-Transformer(x) = Prediction(\, Block(\, ... (\, Block(\, AppendCLS(\, FeatureTokenizer(x)\, )\, ) \,) \,) \,)

## 🧪 Experiments

### 📊 Architecture Specific Performance (Click Each to Expand)

We performed extensive experimentation (150+ runs) across 15 random seeds. Click the different sub headings to see detailed metrics on each architecture type.

## 👋 Conclusion

In this report we went through the paper titled "Revisiting Deep Learning Models for Tabular Data" by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov and Artem Babenko. The authors investigated the status quo in the field of deep learning for tabular data and improved the state of baselines in tabular DL. The code and all the details of the study were open-sourced in a great Github Repository ⭐️ which we forked and open sourced with a W&B Implementation.
To cite the paper kindly use the following BibTeX :-
@article{gorishniy2021revisiting, title={Revisiting Deep Learning Models for Tabular Data}, author={Yury Gorishniy and Ivan Rubachev and Valentin Khrulkov and Artem Babenko}, journal={arXiv}, volume={2106.11959}, year={2021},}

## 📚References

Check out some of the other posts on W&B Fully Connected.
Report Gallery