A Super Easy Guide to the Super Massive Numerai Dataset

An easy guide to the Numerai Classic's Super massive data. It's easy on compute too!. Made by Suraj Parmar using Weights & Biases
Suraj Parmar

Just give me the code

This notebook builds on top of the example_scripts provided by Numerai.

The Numerai Classic Tournament

Most hedge funds don't share their data due to secrecy reasons. However, Numerai is built different. Numerai provides a huge obfuscated data set containing roughly 1,000 features and about 2 million rows. The rows are divided into eras which represent different points in time.
The data is split into training, validation, tournament and live sets. Each week a new live set is released and previous one is appended to tournament set. While the tournament set is used for some internal back testing live set is what we need to predict on.
The goal is to improve the correlation between the live targets and predicted targets. Live targets are the performance of obfuscated stocks in the live set.
Training data era, features and targets.
With the introduction of the new super massive dataset, we now have 20 different targets, but we are scored based on the performance of target columns.

The Super Massive Dataset

Numerai recently released the super massive dataset with 3 times more columns (310 -> 1050) and 20 targets. Targets are based on the time horizon of either 20 or 60 days. However, the scoring is based on the target column which is target_nomi_20d.
Loading the whole data set as float will crash the Colab runtime. Luckily, Numerai provides the int8 features data set which we can use on default Colab.
napi = NumerAPI()spinner = Halo(text='', spinner='dots')current_round = napi.get_current_round(tournament=8) # tournament 8 is the primary Numerai Tournament# read in all of the new datas# tournament data and example predictions change every week so we specify the round in their names# training and validation data only change periodically, so no need to download them over again every single weeknapi.download_dataset("numerai_tournament_data_int8.parquet", "numerai_tournament_data_int8.parquet")napi.download_dataset("numerai_training_data_int8.parquet", f"numerai_training_data_int8.parquet")napi.download_dataset("numerai_validation_data_int8.parquet", f"numerai_validation_data_int8.parquet")napi.download_dataset("example_predictions.parquet", f"example_predictions_{current_round}.parquet")napi.download_dataset("example_validation_predictions.parquet", "example_validation_predictions.parquet")napi.download_dataset("example_predictions.csv", "example_predictions.csv")
Training data era, features and targets.
The features can have different correlation to the targets in different eras. A feature may be helpful in an era and it can hurt in the next one.


The scoring metric: Spearman correlation on era basis

Feature Selection

We'll be using model from example_scripts with selected features and targets. Let's select the top 50% features having the highest overall correlation with target over training eras.
Just like features, let's look at different targets' correlation with the target that is used for scoring i.e., target("target_nomi_20d")
all_target_corrs = ( training_data.groupby("era") .apply(lambda d: d[targets].corrwith(d[TARGET_COL])) .iloc[:, 1:])top_targets = list(all_target_corrs.mean(0).sort_values(ascending=False)[:7].index)bottom_targets = list(all_target_corrs.mean(0).sort_values(ascending=False)[-2:].index)useful_targets = top_targets # + bottom_targets
After selecting the features and targets, let's train a few models on the selected targets and see how it performs.
models_ = {}for target_ in tqdm(useful_targets): gc.collect() wandb.init(project="massive_nmr", name=f"tgt_{target_}") lgb_train = lgb.Dataset( training_data[top_features].values, np.array(training_data[target_].values, dtype=np.float), ) lgb_eval = lgb.Dataset( validation_data[top_features].values, np.array(validation_data[target_].values, dtype=np.float), reference=lgb_train, ) model_name = f"model_gbm_{target_}" # TODO: try different parameters params = { "n_estimators": 1000, "num_leaves": 2 ** 5, "device": "gpu", } model = lgb.train( params, lgb_train, valid_sets=[lgb_eval], verbose_eval=100, callbacks=[wandb_callback()], ) models_[model_name] = model gc.collect()
We'll be using wandb_callback() to log the training metrics for each model.
We can also use wandb Artifacts to store and reload the trained models. So next week when you submit to the tournament, you'll just need to load the saved models from artifacts. That's it! No need to re-train every week.
# syncing the saved files to wandb run as artifacts # this makes downloading them at inference easy run = wandb.init(project="massive_nmr", job_type='train')artifact = wandb.Artifact('models', type='model')artifact.add_dir('models')run.log_artifact(artifact)

Making predictions

Here's how to load the saved artifacts and make predictions using it.
run = wandb.init(project="massive_nmr")artifact = run.use_artifact('/massive_nmr/models:latest', type='model')artifact_dir = artifact.download(root="artifacts")model_files = glob.glob("artifacts/model*.txt")print(model_files)top_features = joblib.load("artifacts/model_expected_features.pkl")models_ = {}for model_file in model_files: target_name = model_file[:-4] print(target_name) models_[target_name] = lgb.Booster(model_file=model_file)
Below are the validation metrics on the models and their ensemble.
You can learn more about the metrics on my post, Evaluating Financial Machine Learning Models on Numerai


The features are non-stationary, meaning they may have different correlation to target in different eras. So if our model is having high exposure on any of the feature, it can hurt the performance when the feature changes it correlation. Neutralization reduces linear exposures of selected features to our prediction.
The notebook selects the 50 most volatile features and neutralizes the predictions against them. This may reduce the mean correlation, but helps in achieving a higher Sharpe ratio.
You should play with different features and different proportion to see how it affects the validation scores.
spinner.start("Neutralizing to risky features")# getting the per era correlation of each feature vs the targetall_feature_corrs = training_data.groupby(ERA_COL).apply( lambda d: d[feature_cols].corrwith(d[TARGET_COL]))# find the riskiest features by comparing their correlation vs the target in half 1 and half 2 of training datariskiest_features = get_biggest_change_features(all_feature_corrs, 50)# neutralize our predictions to the riskiest featuresvalidation_data[f"preds_{model_name}_neutral_riskiest_50"] = neutralize( df=validation_data, columns=[f"preds_{model_name}"], neutralizers=riskiest_features, proportion=0.5, normalize=True, era_col=ERA_COL,)tournament_data[f"preds_{model_name}_neutral_riskiest_50"] = neutralize( df=tournament_data, columns=[f"preds_{model_name}"], neutralizers=riskiest_features, proportion=0.5, normalize=True, era_col=ERA_COL,)spinner.succeed()
Neutralized prediction metrics

Making a submission

The tournament data is to be submitted in each round while the validation data is useful for diagnostics. Validation predictions can be uploaded to the Diagnostics dashboard to see more metrics.
model_to_submit = f"preds_{model_name}_neutral_riskiest_50" # neutralized predictionsvalidation_data["prediction"] = validation_data[model_to_submit].rank(pct=True)tournament_data["prediction"] = tournament_data[model_to_submit].rank(pct=True)validation_data["prediction"].to_csv(f"validation_predictions_{current_round}.csv")tournament_data["prediction"].to_csv(f"tournament_predictions_{current_round}.csv")
Diagnostics dashboard
We can submit the predictions using NumerAPI. version=2 refers to newer data set used in this notebook with 1050 features. You can create a new API key by going to https://numer.ai/account > Automation > Create API keys.
model_id can be copied from the https://numer.ai/models page > more (on model).
#Submitpublic_id = ""secret_key = ""model_id = ""napi = numerapi.NumerAPI(example_public_id, example_secret_key)napi.upload_predictions(f'tournament_predictions_{current_round}.csv', model_id=model_id, version=2)


As the data set is "Super Massive", this notebook was created with keeping standard Colab runtime in mind. With more memory, you will be able to run bigger experiments, more targets, more features.

What's next?