Getting Started with Numerai Signals: Sentiment Analysis

This report demonstrates how to use Stock News API and FinBERT for the Numerai Signals tournament. Made by Carlo Lepelaars using Weights & Biases
Carlo Lepelaars
Numerai Signals

Kaggle Notebook→

In this blog post we will introduce you to the Numerai Signals tournament and show you how to get started using a sentiment analysis example. We will look at news headlines in order to predict a ranking of stocks based on sentiment scores. For a general introduction to Numerai check out my other blog post here.

What is Numerai Signals?

Numerai is a crowdsourced AI hedge fund that operates on predictions made by data scientists worldwide. The classic Numerai tournament provides you with anonymized data that can be modeled using machine learning. Numerai Signals generalizes this idea and allows you to use any data you want. This allows data scientists to get creative with data sources as well as modeling techniques.

To get an idea of the vision behind Numerai Signals, check out this short video:
Let's dive into the full pipeline for making a submission on Numerai Signals using sentiment analysis! The full code is also implemented in a Kaggle Notebook. We will start with the most important part of any data science pipeline. Namely, getting good data! For this we will use a clean API (Application Programming Interface) for Stock news articles.

Full Code →

Stock News API (

Stock News API ( is an online service for easily retrieving news articles based on a broad range of stocks. It collects articles from around 10 different news sources and provides additional information for easy data parsing. It allows us to get the most recent article for every week in a nice JSON format, which we will parse into CSV for data joining. Stock News API has a 14-day free trail program with 100 API calls. After that, you can sign up for 20,000 API calls per month for $20.
All sources from which we can retrieve news articles with

Getting the data can be done in Python with the requests library. You will need an API key/token from Stock News API to gain access. For example, here is how you get the most recent news article about Tesla, Inc. (TSLA) stock:
import requestsapi_key = "YOUR_API_KEY_HERE"api_request = f"{api_key}"data = requests.get(api_request).json()['data']print(data)
The JSON output will look something like this:
[ { "news_url": "", "image_url": "", "title": "A Huge Fund Bought Tesla, Apple, and Microsoft Stock. Here's What It Sold.", "text": "Dutch pension fund PGGM initiated a position in EV-giant Tesla, bought more Apple and Microsoft stock, and sold AT&T stock in the fourth quarter.", "source_name": "Barrons", "date": "Sun, 14 Feb 2021 07:00:00 -0500", "topics": ["paywall"], "sentiment": "Neutral", "type": "Article", "tickers": [ "AAPL", "MSFT", "TSLA", "T" ] }
For our problem we are interested in:
Note that the API provides its own sentiment classification, which can be "Positive", "Neutral" or "Negative". To make our problem more tractable, we will filter articles that are classified "Neutral". Our assumption is here that the polarized articles are the most important for predicting short-term market fluctuations.

Data Retrieval

The relevant data we require is news articles from last week. We filter on the last 7 days by specifying &days=7 in the API call and filter on articles with &type=article.
Furthermore, the API limits a single call to a maximum of 50 articles (&items=50). There may be more than 50 news articles for a single stock ticker in a week, so in that case we create another call until we either have all articles for the week or if it reaches a limit that we specify. We set this limit to avoid wasting API calls.

The loop to get all Tesla Inc. (TSLA) news articles will look something like this:
import requestsimport pandas as pdpage_cutoff_point = 10dfs = []i = 1while i <= page_cutoff_point: api_request = f"{i}sortby=rank&days=7&token={api_key}" data = requests.get(api_request).json()['data'] df = pd.DataFrame(data) if df.empty: break dfs.append(df) i += 1tesla_df = pd.concat(dfs)
The final Pandas DataFrame will be obtained by running this loop over all relevant stock tickers and concatenating them.

Full Code →

Data Wrangling

To prepare the data for sentiment analysis predictions we go through a series of steps. At a high level the following preprocessing operations are performed:
  1. Remove unnecessary columns ('text', 'news_url', 'image_url', 'topics' and 'source_name')
  2. Filter all articles with "Neutral" sentiment.
  3. Convert all timestamp to UTC (Coordinated Universal Time) so all rows have a common datetime format.
We would like to have one unique row for each stock ticker containing all news headlines. We do this by aggregating the data for each ticker and each week. Some news articles may refer to multiple stock tickers so we first do a set intersection to find all relevant tickers for Numerai Signals. All 2090 overlapping tickers are already computed and saved as a Pickle file here, but the code below shows how you can compute them yourself. New stocks get listed and delisted all the time so it makes sense to compute the intersecting stock tickers regularly.
relevant\_tickers = stock\_news\_tickers \cap numerai\_tickers
import pickleimport requestsimport pandas as pd# Get all tickers available in stocknewsapi.comticker_request = f"{api_key}"json_data = requests.get(ticker_request).json()['data']stock_news_tickers = []for row in json_data: stock_news_tickers.append(row['ticker'])# Get all Numerai Signals tickersticker_df = pd.read_csv("")numerai_tickers = set(ticker_df['ticker'])# Compute intersectionrelevant_tickers = list(set(stock_news_tickers).intersection(numerai_tickers))

By aggregating all news headlines that refer to a ticker we get a row for each ticker and the headlines separated by a [SEP] token, which we will use to batch our news headline input.
data.loc[:, 'title'] = data['title'] + " [SEP] "
Lastly, we merge the new DataFrame on the Numerai tickers to retrieve a Bloomberg ticker format that we need later in the submission step. The snippet below shows the aggregation loop and merging.
# Aggregate news headlines for each tickerdfs = []for ticker in relevant_tickers: aggregated = data[data['tickers'].apply(lambda x: ticker in x)].resample("W-fri", on='date').sum() aggregated = aggregated.drop("tickers", axis=1) aggregated['ticker'] = ticker aggregated = aggregated.drop_duplicates("ticker", keep='last') if aggregated.empty: continue dfs.append(aggregated)new_df = pd.concat(dfs)new_df['title'] = new_df['title'].astype(str)merged = new_df.merge(ticker_df, on='ticker')merged = merged.drop("yahoo", axis=1).dropna()

Full Code →

Inference (FinBERT)

Finally we are ready to tackle the Machine Learning part! FinBERT is a Transformer model trained fully on financial news articles. It was developed as part of a master thesis by Dogu Araci. Because FinBERT is trained on financial news, it is a good model to tackle Natural Language Processing (NLP) tasks in the financial domain. FinBERT is a language model based on the BERT (Bidirectional Encoder Representations) model developed by Google in 2018. For an in-depth explanation on Transformer models you can check out my "Transformer Deep Dive" article.
We will use the HuggingFace Transformers library and PyTorch to get sentiment predictions for all headlines. There is a pre-trained FinBERT model available on the HuggingFace website that we will use.

On a high level, the pipeline goes through the following steps:
  1. Batch the news headlines so we don't run out of GPU memory.
def _chunks(lst, n): """ Yield n-sized chunks from list. """ for i in range(0, len(lst), n): yield lst[i:i + n]batch_size = 8for row in data: ticker_headlines = row.split(" [SEP] ")[:-1] . for batch in _chunks(ticker_headlines, batch_size): .
2. For each batch, tokenize the headlines. input_ids and attention_mask will be the input to the FinBERT model.
import torchfrom transformers import AutoTokenizer# Example headlinesbatch = ["The stock will go up in the upcoming week!", "Earnings are down for Q3"]# Prepare tokenized batchtokenizer = AutoTokenizer.from_pretrained("ipuneetrathore/bert-base-cased-finetuned-finBERT")encoded = tokenizer(batch, add_special_tokens=True, max_length=256, padding='max_length',return_attention_mask=True, return_tensors='pt', truncation=True)input_ids =[encoded['input_ids']], dim=0).to('cuda')attention_mask =[encoded['attention_mask']], dim=0).to('cuda')
3. Get softmax activations from the pre-trained FinBERT model.
import torch.nn.functional as Ffrom transformers import AutoModelForSequenceClassificationmodel = AutoModelForSequenceClassification.from_pretrained("ipuneetrathore/bert-base-cased-finetuned-finBERT").eval().to('cuda')model_output = model(input_ids, token_type_ids=None, attention_mask=attention_mask)logits = model_output[0]softmax_output = F.softmax(logits, dim=1).cpu().detach().numpy()
4. calculate a sentiment score and take the mean for all predictions on a stock ticker.
score(T) = \frac{1}{n} \sum_{i=1}^{n} positive_i - negative_i
where T denotes a specific stock ticker and n the total amount of headlines.
For a single sentiment score:
sentiment_score = softmax_output[:, 2] - softmax_output[:, 0]
After we collected all sentiment scores for a given ticker:
mean_score = np.array(sent_scores_ticker).ravel().mean()
5. The sentiment scores will be in the range of [-1...1], but Numerai Signals only accepts predictions in the range [0...1]. We therefore scale sentiment score predictions for all stock tickers to a range of [0 ... 1] using scikit-learn's MinMaxScaler.
from sklearn.preprocessing import MinMaxScalerdef scale_sentiment(sentiments): """ Scale sentiment scores from [-1...1] to [0...1] """ mm = MinMaxScaler() sent_proc = np.array(sentiments).reshape(-1, 1) return mm.fit_transform(sent_proc)

Full Code →


Now that we got the predictions we are almost ready to submit to Numerai Signals. To finalize the DataFrame a column is added indicating that the predictions are for "live" data (e.g. the upcoming week). Also, we make a date column denoting Friday. Lastly, the DataFrame is written to CSV and uploaded using Numerai's API.
import numerapifrom datetime import datetimefrom dateutil.relativedelta import relativedelta, FR# API settings for submitting to NumeraiNMR_PUBLIC_ID = "YOUR PUBLIC KEY"NMR_SECRET_KEY = "YOUR SECRET KEY"MODEL_NAME = "YOUR MODEL NAME"SUB_PATH = "finbert_submission.csv"# Initialize API with API Keys and add data_type columnNAPI = numerapi.SignalsAPI(NMR_PUBLIC_ID, NMR_SECRET_KEY)final_df.loc[:, "data_type"] = "live"# Add date column denoting Fridayfriday = int(str(( + relativedelta(weekday=FR(-1))).date()).replace("-", ""))final_df["friday_date"] = friday# Save final DataFrame to CSV and upload predictionscols = ["bloomberg_ticker", "friday_date", "data_type", "signal"]final_df[cols].reset_index(drop=True).to_csv(SUB_PATH, index=False)model_id = NAPI.get_models()[MODEL_NAME]NAPI.upload_predictions(SUB_PATH, model_id=model_id)
After submitting you can verify your submission from your model page on It should look like the image below. The submission should be for a minimum of 10 stock tickers and there are around 5300 international stocks in total for which you can provide weekly submissions. The total may vary per week depending on new stock listings and delistings.
Example submission verification on

Full Code →

Evaluation and staking

Note that the predictions from this pipeline is supposed to denote a ranking of all stocks. Close to 0 means we think the stock will go down in the upcoming week, close to 1 we believe it will go up. Numerai will aggregate predictions from all users in order to manage a stock portfolio. Stocks with a signal scores close to 0 will be sold and close to 1 will be bought. The users are judged based on the Spearman Correlation score their predictions will have in the upcoming week. Additionally, the uniqueness of the signal is judged by a Meta Model Contribution (MMC) score. Numerai allows users to stake on their predictions using the Numeraire (NMR) cryptocurrency and users will earn or lose Numeraire based on these metrics.
For more background and general advice on participating in Numerai, check out my article on Numerai. Also check out the Numerai Signals documentation for more clarification on participating in Numerai Signals.
Prediction examples for Numerai Signals

That's a wrap! I hope this article got you excited to start with Numerai Signals! I also hope you learned some new concepts from this report. If you want, you can check out the resources section below for more introductions to Numerai Signals, data sources, recommended papers and links to the Numerai community.
If you have any questions or feedback, feel free to comment below. You can also contact me on Twitter @carlolepelaars.

Full Code →


Learning Resources

Data Sources

Recommended papers