Skip to main content

Topic Diversity over Number of Topics Stopwords Included/Excluded

Created on May 9|Last edited on May 9

Stopword Experiments



After the original number of topics experiment, it was clear that some questions were in the same topics due to a superficial syntactic rather than semantic similarity. However it's the semantic meaning that's important for ensuring truly related questions are placed in the same topic, though syntax provided a valuable clue as questions from the same topic are often phrased similarly. Screenshot 2022-05-09 at 09.15.58.png

Below is the graph of a topic model with 10 topics. It still has all stopwords. Topic 0's highest scoring terms is "Tick, apply, all, that, what". What CLOSER level 2 topic could that refer to? Clearly such a topic will be low quality as it doesn't consist of topic-specific words. Topic 7 is also undesirable as it has "and" as a top scoring word.

A quick solution to this is removing common stopwords from the dataset before topic modelling is applied. This will remove common words such as "And", "What", "How", however more questionnaire specific stop words such as "tick" and "apply" will remain. Below is a sample of some of the questions for the stopwords-removed data.

image.png

However, BERTopic works using sentence embedding models which were trained on clean, real text. This means that the quality of the sentence embeddings may actually decrease as the sentences are no real English sentences, and have therefore lost some valuable syntactic data. However, it will also prevent topics from being built around stopwords with little semantic value.

Below is an interactive graph of the top scoring terms for stopwords removed (left) vs stopwords included (right) for the first 8 topics of the 10-topic models. Topic 4 shows that though there is an improvement in terms of not building topics around stopwords, more questionnaire specific stopwords such as "Tick" and "Apply" must still be removed in future experiments.

Run set
2