Skip to main content

[draft] First experimental results

May 26 draft.
Created on May 26|Last edited on June 18

Experiment

Note that these results are preliminary: not all jobs are finished yet. Also; more feature rankers and datasets still have to be added.

Experiment setup

The experiment was conducted in a couple steps. For 1 dataset and 1 feature ranker, the process was as follows:
  1. Repeat the following 25 times:
    1. Resample the dataset with replacement. This procedure is called bootstrapping. By creating different permutations of the dataset, we are able to estimate the variance of the feature rankings.
    2. Run feature ranker on resampled dataset. If a "ground-truth" on the desired feature importances is available, compute the R2 value and Log Loss between the ground-truth and the estimated feature importances from the feature ranker.
    3. Using the feature ranking, run a validation estimator. The validation estimator is run: (1) first on a feature subset containing only the #1 ranked feature, (2) then using the #1 and #2 best ranked features, (3) then using the best #1, #2, #3 ranked features... etcetera. The size of the feature subset is at max 50: if the dataset has more features no more than the top-50 feature subsets will be evaluated
  2. Compute the average metrics over all 25 bootstraps. The mean, standard deviation and variance are measured.
  3. Upload the results to wandb ✨

To illustrate, one such experiment can be visualized as follows. This is the validation score for ReliefF on the Iris dataset, containing 4 features, as an average over all bootstraps.

Run set
1

Selecting 2 features results in the optimal prediction performance; selecting more actually degrades the performance. If we know the ground truth relevant features apriori, we can compute more sophisticated evaluation metrics. Let us observe a synthetic dataset that was specially designed for feature ranking experiments.


Run set
0

In this synthetic dataset, we compute the desired feature importances to the predicted feature importances from the feature ranker. As it can be seen above, some rankers are more stable than others over the bootstrap datasets. We can observe how the R2 score translates to validation scores on the testing set.
We can also only plot the best validation scores per ranker as a dot; in this way we can answer the question "for each feature ranker, how many features do we have to select to reach its maximum validation accuracy on this given dataset?"

Run set
0

To measure stability, we can measure the variance of the R2- or log loss score. If we take again the same datasets like above:

Run set
0

We can see the variance in the validation scores is nicely represented by the variance chart below.


Regression

Regression tasks are evaluated by validating the test set with a validation estimator, Decision Tree was used as a validation estimator.

⚠️ Note you should only select 1 dataset at a time: the charts make sense only per-dataset.


All regression datasets
0



Learning curves [WIP]



Performance table


Run set
9


Classification

In this case, the validation classifier will be scored using its accuracy against the test set.

⚠️ Note you should only select 1 dataset at a time: the charts make sense only per-dataset.

All classification datasets
10




Run set
0




Evaluation metric correlation

Here we can observe how the ranker R2 or Log loss score correlates with the validator score. We have to separate classifiers and regressors because their score is of different type: accuracy and r2-score, respectively.

Regression: synreg hard dataset



Run set
0


All regression datasets

Parallel coordinates

Univariate targets:


Run set
6


Multivariate targets:


Run set
3



Classification: synclf medium dataset


Run set
0


All classification datasets

Parallel coordinates plot

Univariate targets:


Run set
110

Multivariate targets:

Run set
0