Getting Started with Numerai Signals: Sentiment Analysis

This report demonstrates how to use Stock News API and FinBERT for the Numerai Signals tournament. Made by Krisha Mehta using Weights & Biases
Krisha Mehta
Numerai Signals

Kaggle Notebook→

Introduction

In this blog post we will introduce you to the Numerai Signals tournament and show you how to get started using a sentiment analysis example. Specifically, we'll look at news headlines to predict a ranking of stocks based on sentiment scores. For a general introduction to Numerai check out my other blog post here.

What is Numerai Signals?

Numerai is a crowdsourced AI hedge fund that operates on predictions made by data scientists worldwide. The classic Numerai tournament provides you with anonymized data that can be modeled using machine learning. Numerai Signals generalizes this idea and allows you to use any data you want. This allows data scientists to get creative with data sources as well as modeling techniques.

To get an idea of the vision behind Numerai Signals, check out this short video:
Now then: let's dive into the full pipeline for making a submission on Numerai Signals using sentiment analysis!
The full code is also implemented in a Kaggle Notebook. We will start with the most important part of any data science pipeline. Namely, getting good data! For this we will use a clean API (Application Programming Interface) for stock news articles.

Full Code →

Stock News API (stocknewsapi.com)

Stock News API (stocknewsapi.com) is an online service for easily retrieving news articles based on a broad range of stocks. It collects articles from around 10 different news sources and provides additional information for easy data parsing. It allows us to get the most recent article for every week in a nice JSON format, which we will parse into CSV for data joining. Stock News API has a 14-day free trail program with 100 API calls. After that, you can sign up for 20,000 API calls per month for $20. The API pulls from the following news sources:
Getting the data can be done via Python and the requests library. You will need an API key/token from Stock News API to gain access. For example, here is how you get the most recent news article about Tesla, Inc. (TSLA) stock:
import requestsapi_key = "YOUR_API_KEY_HERE"api_request = f"https://stocknewsapi.com/api/v1?tickers=TSLA&items=1&token={api_key}"data = requests.get(api_request).json()['data']print(data)
The JSON output will look something like this:
[ { "news_url": "https://www.barrons.com/articles/fund-bought-tesla-apple-microsoft-stock-sold-att-51613063598", "image_url": "https://cdn.snapi.dev/images/v1/a/7/im-298821.jpg", "title": "A Huge Fund Bought Tesla, Apple, and Microsoft Stock. Here's What It Sold.", "text": "Dutch pension fund PGGM initiated a position in EV-giant Tesla, bought more Apple and Microsoft stock, and sold AT&T stock in the fourth quarter.", "source_name": "Barrons", "date": "Sun, 14 Feb 2021 07:00:00 -0500", "topics": ["paywall"], "sentiment": "Neutral", "type": "Article", "tickers": [ "AAPL", "MSFT", "TSLA", "T" ] }
For our problem we are interested in:
Note that the API provides its own sentiment classification, which can be "Positive", "Neutral" or "Negative." To make our problem more tractable, we will filter articles that are classified "Neutral." Our assumption here is that the polarized articles are the most important for predicting short-term market fluctuations.

Data Retrieval

The relevant data we require is news articles from last week. We filter on the last 7 days by specifying &days=7 in the API call and filter on articles with &type=article.
Furthermore, the API limits a single call to a maximum of 50 articles (&items=50). There may be more than 50 news articles for a single stock ticker in a week, so in that case we create another call until we either have all articles for the week or if it reaches a limit that we specify. We set this limit to avoid wasting API calls.

The loop to get all Tesla Inc. (TSLA) news articles will look something like this, using the dates we computed earlier:
import requestsimport pandas as pdpage_cutoff_point = 10dfs = []i = 1while i <= page_cutoff_point: api_request = f"https://stocknewsapi.com/api/v1?tickers=TSLA&items=50&type=article&page={i}sortby=rank&days=7&token={api_key}" data = requests.get(api_request).json()['data'] df = pd.DataFrame(data) if df.empty: break dfs.append(df) i += 1tesla_df = pd.concat(dfs)
The final Pandas DataFrame can be obtained by running this loop over all relevant stock tickers and concatenating them.

Full Code →

Data Wrangling

To prepare the data for sentiment analysis predictions we go through a series of steps. At a high level the following preprocessing operations are performed:
  1. Remove unnecessary columns ('text', 'news_url', 'image_url', 'topics' and 'source_name')
  2. Filter all articles with "Neutral" sentiment.
  3. Convert all timestamp to UTC (Coordinated Universal Time) so all rows have a common datetime format.
We would like to have one unique row for each stock ticker containing all news headlines. We do this by aggregating the data for each ticker and each week. Some news articles may refer to multiple stock tickers so we first do a set intersection to find all relevant tickers for Numerai Signals. All 2090 overlapping tickers are already computed and saved as a pickle file here, but the code below shows how you can compute them yourself. New stocks get listed and delisted all the time so it makes sense to compute the intersecting stock tickers regularly.
relevant\_tickers = stock\_news\_tickers \cap numerai\_tickers
import pickleimport requestsimport pandas as pd# Get all tickers available in stocknewsapi.comticker_request = f"https://stocknewsapi.com/api/v1/account/tickersdbv2?token={api_key}"json_data = requests.get(ticker_request).json()['data']stock_news_tickers = []for row in json_data: stock_news_tickers.append(row['ticker'])# Get all Numerai Signals tickersticker_df = pd.read_csv("https://numerai-signals-public-data.s3-us-west-2.amazonaws.com/signals_ticker_map_w_bbg.csv")numerai_tickers = set(ticker_df['ticker'])# Compute intersectionrelevant_tickers = list(set(stock_news_tickers).intersection(numerai_tickers))

By aggregating all news headlines that refer to a ticker we get a row for each ticker and the headlines separated by a [SEP] token, which we will use to batch our news headline input.
data.loc[:, 'title'] = data['title'] + " [SEP] "
Lastly, we merge the new DataFrame on the Numerai tickers to retrieve a Bloomberg ticker format that we need later in the submission step.
# Aggregate news headlines for each tickerdfs = []for ticker in relevant_tickers: aggregated = data[data['tickers'].apply(lambda x: ticker in x)].resample("W-fri", on='date').sum() aggregated = aggregated.drop("tickers", axis=1) aggregated['ticker'] = ticker aggregated = aggregated.drop_duplicates("ticker", keep='last') if aggregated.empty: continue dfs.append(aggregated)new_df = pd.concat(dfs)new_df['title'] = new_df['title'].astype(str)merged = new_df.merge(ticker_df, on='ticker')merged = merged.drop("yahoo", axis=1).dropna()

Full Code →

Inference (FinBERT)

And now we're finally ready to tackle the machine learning part!
FinBERT is a Transformer model developed as part of a master thesis by Dogu Araci and trained fully on financial news articles. Because FinBERT is trained on financial news, it is a good model to tackle Natural Language Processing (NLP) tasks in the financial domain. FinBERT is a language model based on the BERT (Bidirectional Encoder Representations) model developed by Google in 2018. For an in-depth explanation on Transformer models you can check out my "Transformer Deep Dive" article.
We will use the HuggingFace Transformers library and PyTorch to get sentiment predictions for all headlines. Specifically, we'll use a pre-trained FinBERT model available on the HuggingFace website.

On a high level, the pipeline goes through the following steps:
  1. Batch the news headlines so we don't run out of GPU memory.
def _chunks(lst, n): """ Yield n-sized chunks from list. """ for i in range(0, len(lst), n): yield lst[i:i + n]batch_size = 8for row in data: ticker_headlines = row.split(" [SEP] ")[:-1] . for batch in _chunks(ticker_headlines, batch_size): .
2. For each batch, tokenize the headlines. input_ids and attention_mask will be the input to the FinBERT model.
import torchfrom transformers import AutoTokenizerbatch = ["The stock will go up in the upcoming week!", "Earnings are down for Q3"]tokenizer = AutoTokenizer.from_pretrained("ipuneetrathore/bert-base-cased-finetuned-finBERT")encoded = tokenizer(batch, add_special_tokens=True, max_length=200, padding='max_length',return_attention_mask=True, return_tensors='pt', truncation=True)input_ids = torch.cat([encoded['input_ids']], dim=0).to('cuda')attention_mask = torch.cat([encoded['attention_mask']], dim=0).to('cuda')
3. Get softmax activations from the pre-trained FinBERT model.
import torch.nn.functional as Ffrom transformers import AutoModelForSequenceClassificationmodel = AutoModelForSequenceClassification.from_pretrained("ipuneetrathore/bert-base-cased-finetuned-finBERT").eval().to('cuda')model_output = model(input_ids, token_type_ids=None, attention_mask=attention_mask)logits = model_output[0]softmax_output = F.softmax(logits, dim=1).cpu().detach().numpy()
4. Calculate a sentiment score and take the mean for all predictions on a stock ticker.
score(T) = \frac{1}{n} \sum_{i=1}^{n} positive_i - negative_i
where T denotes a specific stock ticker and n the total amount of headlines.
For a single sentiment score:
sentiment_score = softmax_output[:, 2] - softmax_output[:, 0]
After we collected all sentiment scores for a given ticker:
mean_score = np.array(sent_scores_ticker).ravel().mean()
5. The sentiment scores will be in the range of [-1...1], but Numerai Signals only accepts predictions in the range [0...1]. We therefore scale sentiment score predictions for all stock tickers to a range of [0 ... 1] using scikit-learn's MinMaxScaler.
from sklearn.preprocessing import MinMaxScalerdef scale_sentiment(sentiments): """ Scale sentiment scores from [-1...1] to [0...1] """ mm = MinMaxScaler() sent_proc = np.array(sentiments).reshape(-1, 1) return mm.fit_transform(sent_proc)

Full Code →

Submission

Now that we got the predictions we are almost ready to submit to Numerai Signals! To finalize the DataFrame a column is added indicating that the predictions are for "live" data (e.g. the upcoming week). Also, we make a date column denoting the upcoming Friday. Lastly, the DataFrame is written to CSV and uploaded using Numerai's API.
import numerapifrom datetime import datetimefrom dateutil.relativedelta import relativedelta, FR# API settings for submitting to NumeraiNMR_PUBLIC_ID = "YOUR PUBLIC KEY"NMR_SECRET_KEY = "YOUR SECRET KEY"MODEL_NAME = "YOUR MODEL NAME"SUB_PATH = "finbert_submission.csv"# Initialize API with API Keys and add data_type columnNAPI = numerapi.SignalsAPI(NMR_PUBLIC_ID, NMR_SECRET_KEY)final_df.loc[:, "data_type"] = "live"# Add date column denoting last Fridayfriday = int(str((datetime.now() + relativedelta(weekday=FR(-1))).date()).replace("-", ""))final_df["friday_date"] = friday# Save final DataFrame to CSV and upload predictionscols = ["bloomberg_ticker", "friday_date", "data_type", "signal"]final_df[cols].reset_index(drop=True).to_csv(SUB_PATH, index=False)model_id = NAPI.get_models()[MODEL_NAME]NAPI.upload_predictions(SUB_PATH, model_id=model_id)
After submitting you can verify your submission from your model page on signals.numer.ai/tournament. It should look like the image below. The submission should be for a minimum of 10 stock tickers and there are around 5300 international stocks in total for which you can provide weekly submissions. The total may vary per week depending on new stock listings and delistings.
Example submission verification on signals.numer.ai/tournament

Full Code →

Evaluation and Staking

Note that the predictions from this pipeline is supposed to denote a ranking of all stocks that we specify. Close to 0 means we think the stock will go down in the upcoming week, close to 1 we believe it will go up. Numerai will aggregate predictions from all users in order to build a stock portfolio. Stocks with a signal scores close to 0 will be sold and close to 1 will be bought. The users are judged based on the Spearman Correlation score their predictions will have in the upcoming week. Additionally, the uniqueness of the signal is judged by a Meta Model Contribution (MMC) score. Numerai allows users to stake on their predictions using the Numeraire (NMR) cryptocurrency and users will earn or lose Numeraire based on these metrics.
For more background and general advice on participating in Numerai, check out my article on Numerai. Also check out the Numerai Signals documentation for more clarification on participating in Numerai Signals.
Prediction examples for Numerai Signals

That's a wrap! I hope this article got you excited to start with Numerai Signals! I also hope you learned some new concepts from this report. If you want, you can check out the resources section below for more introductions to Numerai Signals, data sources, recommended papers and links to the Numerai community.
If you have any questions or feedback, feel free to comment below. You can also contact me on Twitter @carlolepelaars.

Full Code →

Resources

Learning Resources

Data Sources

Recommended papers

Community