arXiv Search: Generating Tags from Paper Titles

Sayak Paul

Inspiration & Ideation

arXiv is the repository to tens of thousands of openly accessible papers in the fields of physics, mathematics, computer science and so on. For machine learning practitioners, researchers, data scientists (and all the other positions pertaining to the fields of machine learning and data science) arXiv is the dosage of research, application, and theoretical know-how. 

The pace at which the field of machine learning is evolving is just unreal. There are new papers getting released every day. It really gets challenging to keep up with this pace. Moreover, there are so many sub-categories in each of the primary arXiv categories. For example, in the Computer Science category (cs) there are sub-categories like cs.LG (Machine Learning), cs.CV (Computer Vision and Pattern Recognition), cs.AI (Artificial Intelligence), cs.CL (Computation and Language) and so on. These sub-categories are also called tags

As mentioned above, a huge number of papers get submitted to arXiv on a daily basis. To maintain a smooth and consistent user experience, it is important to accurately categorize the papers. Because of the vastness of the field (machine learning), it can get confusing for an author to decide these categories in the event of submission of a paper. In situations like this, an automatic tag identification can help which would be able to correctly identify the right tags from the title of a paper. 

Being a new entrant to the field of Natural Language Processing (NLP), I took this idea to develop on a cool project around it. In this article, I will be walking you through some of the critical aspects of the projects and how Weights and Biases helped me to keep track of my experiments. Let’s get started!

The source code of the project is available here

Thinking about the problem statement

Formulating a problem statement for a machine learning project is a serious business and it requires a good amount of brainstorming. I am not going to get into its details but if you are interested, be sure to check out this crash course on Problem Framing by Google. To get started with the project, I needed to start with a definite problem statement. It’s safe enough to say that data plays a very crucial role in that. So, I first started to look for datasets that capture solid information on arXiv papers and luckily, I found one on Kaggle containing a bunch of information on arXiv papers from 1992 to 2017. 

After quickly glaring through the dataset, I could frame a minimal problem statement -

Given the title of an arXiv paper can we generate its tag(s)?

Of course, I first saw what features the above-mentioned dataset contains and among them the following two features helped to frame the problem statement I just mentioned:

A paper can have multiple tags. For example, the seminal paper on Transformers namely Attention is All You Need has two tags: cs.CL and cs.LG. So keeping this aspect in mind, our problem statement could be modeled as a multi-label text classification problem

There were other features as well (in total there were 9):

Preparing the Dataset

The dataset comes in a JSON format. When I loaded it in using pandas it looked something like so:

As you can see if I wanted to get any insights from the dataset, this format won’t really help. After a few efforts to wrangle this, I could get the dataset down to the following form:

Just like the way it’s needed for our problem statement! 

After exploring further, I learned that there is a total of 41,000 paper titles and 2,606 unique tags (some of them are single and some are multi) present in the wrangled dataset. When I start with a project, I prefer to start simple, so I chose to keep the dataset to the top 100 tags. After filtering the results, I arrived at the following distribution of the top 10 tags:

As you can see - this is a clear sight of class imbalance (more on this later). It is important to note that an unexpected aspect of the project also the issue of duplicate entries I had to deal with. 

After I learned about the types of papers, I was interested in learning if the length of a paper’s title had any impact in determining its tags; Pearson’s correlation, of course, was zero. Before we proceed to the next section, I also want to show you the following plot which shows the distribution of the title lengths (isn’t that a sleek distribution?):

Addressing the major data issues

First, the filtered dataset was highly imbalanced. I tackled this in the following way:

The next issue occurred when I tried to binarize to labels. Take a look at the first ten labels:

Pay close attention to the double-quotes. For these, whenever I was trying to use the MultiLabelBinarizer class of scikit-learn it was not happening in the proper way. It actually took me some time to realize that and finally, I came across this thread on StackOverflow which helped me to solve this issue (spoiler: I had to use literal_eval). After I got past this issue, I was able to properly represent the tags in terms of 0 and 1 (below are the multi-binarized version of the first ten labels):

For the paper titles, I used standard data preprocessing steps:

At this point, I was ready to do some machine learning.

Experimentation with machine learning models: Simple to complex

In the case of my machine learning experiments, I also like to start off simple and then gradually ramp up the complexity. This allows me to analyze the behavior of individual models and how their prediction dynamics are varying. This approach also helps with my choice about the architecture of a model, which depends on a number of different aspects like:

As some of you may already know, picking the best model architecture also requires a high amount of experimentation. The challenge is that you may miss out on a good model if you do not keep track of your experiments properly. I delegate this responsibility to Weights and Biases. It plays well with a number of popular Python machine learning frameworks like TensorFlow, PyTorch, Scikit-Learn, XGBoost, and so on. Moreover, it requires only a few lines of code to set things up so that Weight and Biases can get started in tracking your experiments. 

For this project, I started with simple models like Naive Bayes, Logistic Regression and so on. I used the ever-trusty Scikit-Learn for that. You can see the two most important charts from those experiments here. You can see them in the following figure:

Weights and Biases also provided me with a comprehensive report which makes it easy to share my findings across teams:

As you can see the model from the fourth run yielded the best performance here. Now was the perfect time to bring some deep learning to the table. I tried the following three architectures with trainable embeddings using TensorFlow 2.0:

You can find a summary of the experiments I conducted with them in the below charts:

As usual, it came with a table too:

The visualizations make it easier for me to compare different models and pick the best one. This is particularly helpful when the number of experiments keeps increasing, which almost always happens in a real-world scenario. 

After finding out the most performant model (CNN) from the set of experiments, I used it to make predictions on some recent paper titles. Here is a snap:

Not bad! 

Weights and Biases gives you the flexibility to log many other information you might want to add to your dashboard. For this case, I wanted to see how a model is doing on some sample validation data as it is training and I wanted to log that in a nice format on my Weight and Biases dashboard.

I was able to do so by writing a few lines of code:

# A custom callback to view predictions on the above samples in real-time
class TextLogger(tf.keras.callbacks.Callback):
    def __init__(self):
        super(TextLogger, self).__init__()

    def on_epoch_end(self, logs, epoch):
        samples = []
        for (title, true_label) in sample_paper_titles.items():
            predicted_label = generate_predictions(self.model, title)
            sample = [title, predicted_label, true_label]
        wandb.log({"text": wandb.Table(data=samples, 
                                       columns=["Text", "Predicted Label", "True Label"])})

I really liked what I could achieve with the above chunk:

As you can see, I am able to change the step number and see how my model’s prediction is getting varied on the provided samples. You can check out this run page to see the predictions from different steps. This is very valuable to have in your practical deep learning toolbox. 

The GRU and LSTM-based models were taking a significant amount of time for training (not days, though). So, I decided to run them using Cloud TPUs and I was kind of blown away by the speedup. You can find the corresponding notebook here

Further undertakings

Among the deep learning models I tried, I am sure that I did not train them hard enough. I am definitely going to train the CNN-based model and the LSTM-based model for larger epochs (the current number of epochs is 10). There are a bunch of other hyperparameters I did not experiment with - embedding dimension, maximum words to be allowed in the tokenizer, maximum length of a sequence and so on. To that end, I would also want to incorporate the paper abstracts now in the dataset to see if the predictions are improving. Of course, this is gonna come at a cost - increased model complexity, increased computation cost and so on. 

Hyperparameter sweeps make it a whole lot easier to properly manage the heavy load of hyperparameter tuning experiments. I am definitely going to use it in the next iteration of this project. 

I am also excited to try out Hugging Face’s BERT models for improving the classification performance. My ultimate plan is to expose the final model as a REST API so that it can be used by other developers. 

References and acknowledgments

Following are some of the references which helped me a lot in reaching this stage of the  project:

I would also like to thank the ones who provided me with tremendous support:

Join our mailing list to get the latest machine learning updates.