[draft] First experimental results

May 26 draft.
Created on May 26|Last edited on June 18
Comment
﻿
ExperimentNote that these results are preliminary: not all jobs are finished yet. Also; more feature rankers and datasets still have to be added.
Experiment setupThe experiment was conducted in a couple steps. For 1 dataset and 1 feature ranker, the process was as follows:
Repeat the following 25 times:
Resample the dataset with replacement. This procedure is called bootstrapping. By creating different permutations of the dataset, we are able to estimate the variance of the feature rankings.
Run feature ranker on resampled dataset. If a "ground-truth" on the desired feature importances is available, compute the R2 value and Log Loss between the ground-truth and the estimated feature importances from the feature ranker.
Using the feature ranking, run a validation estimator. The validation estimator is run: (1) first on a feature subset containing only the #1 ranked feature, (2) then using the #1 and #2 best ranked features, (3) then using the best #1, #2, #3 ranked features... etcetera. The size of the feature subset is at max 50: if the dataset has more features no more than the top-50 feature subsets will be evaluated
Compute the average metrics over all 25 bootstraps. The mean, standard deviation and variance are measured.
Upload the results to wandb ✨
﻿
To illustrate, one such experiment can be visualized as follows. This is the validation score for ReliefF on the Iris dataset, containing 4 features, as an average over all bootstraps.
﻿
Run set1
﻿
Selecting 2 features results in the optimal prediction performance; selecting more actually degrades the performance. If we know the ground truth relevant features apriori, we can compute more sophisticated evaluation metrics. Let us observe a synthetic dataset that was specially designed for feature ranking experiments.
﻿
﻿
Run set0
﻿
In this synthetic dataset, we compute the desired feature importances to the predicted feature importances from the feature ranker. As it can be seen above, some rankers are more stable than others over the bootstrap datasets. We can observe how the R2 score translates to validation scores on the testing set.
We can also only plot the best validation scores per ranker as a dot; in this way we can answer the question "for each feature ranker, how many features do we have to select to reach its maximum validation accuracy on this given dataset?"
﻿
Run set0
﻿
To measure stability, we can measure the variance of the R2- or log loss score. If we take again the same datasets like above:
﻿
Run set0
﻿
We can see the variance in the validation scores is nicely represented by the variance chart below.
﻿
RegressionRegression tasks are evaluated by validating the test set with a validation estimator, Decision Tree was used as a validation estimator.
﻿
⚠️ Note you should only select 1 dataset at a time: the charts make sense only per-dataset.
﻿
﻿
All regression datasets0
﻿
﻿
Learning curves [WIP]﻿
Performance table﻿
Run set9
﻿
ClassificationIn this case, the validation classifier will be scored using its accuracy against the test set.
﻿
⚠️ Note you should only select 1 dataset at a time: the charts make sense only per-dataset.
﻿
All classification datasets10
﻿
﻿
﻿
﻿
Run set0
﻿
﻿
﻿
Evaluation metric correlationHere we can observe how the ranker R2 or Log loss score correlates with the validator score. We have to separate classifiers and regressors because their score is of different type: accuracy and r2-score, respectively.
Regression: synreg hard dataset﻿
﻿
Run set0
﻿
All regression datasets
Parallel coordinates
Univariate targets:﻿
Run set6
﻿
Multivariate targets:﻿
Run set3
﻿
﻿
Classification: synclf medium dataset﻿
Run set0
﻿
All classification datasets
Parallel coordinates plot
Univariate targets:﻿
Run set110
﻿
Multivariate targets:
﻿
Run set0
﻿
﻿
Add a comment