Transformers, tokenizers and the in-domain problem
What happens when generally trained tokenizers lack knowledge of domain specific vocabulary? How much of a problem is this for models like BERT?
Created on December 3|Last edited on October 6
Comment
The Ford Model-T revolutionized the car industry in the same way that Deep Learning (DL) Transformer models like BERT are revolutionizing Natural Language Processing (NLP) vision processing tasks and even more fields of Artificial Intelligence (AI).

Stable Diffusion 1.5 Prompt: an optimus prime transformer standing beside a model t car, realistic, color
The Model-T, however, was far from perfect. It didn’t have modern service brakes (it had transmission brakes which stopped the rear wheels from turning), it didn’t have seat belts, no speedometer, it needed a potentially dangerous hand crank to start it, and it had no turn signal. But it was mass-produced, reasonably priced, durable, and compared with its nearest rival (which was a horse), it was a quantum leap in terms of improvement.
Transformers, one could claim, are as revolutionary a technology in the field of AI as the Model T was in transportation. Models like BERT are so far ahead of what was used in the recent past, from TF-IDF, bag of words approaches, through to the early neural network models like Word2Vec, that we are still only beginning to realize their ultimate potential.
The other paradigm shift with these models is that they are mostly available fully pre-trained and free to use, which makes them, like the Model-T, more widely available to businesses that don't have the resources required to train these models.
Another similarity to the model-T is that Transformers also have their shortcomings. At the moment, they may not be obvious, but there is a potential future where we look back on the early Transformer models as we do at the model-T and wonder in amazement how it didn’t have the equivalent of a turn signal or a fuel pump (which meant you needed to drive uphill in reverse if you were low on fuel).
But what are the missing features for a model like BERT?
One potential shortcoming of current Transformer models is their inability to adjust to new, domain-specific vocabularies. If your business has a clientele of a certain age, you may be doing a lot of “YOLOing” and “FOMOing”, (or whatever the latest hipster acronym happens to be; as a test of your hipster credentials do you know what TFW means?) when communicating with your customers. Poor old BERT might not be able to keep up with trending terms.
More seriously, though, medical and legal domains have such unique and important in-domain vocabularies that these businesses have been forced to train their own models from scratch. More on that later.
For now, let’s just focus on the problem: do Transformer models like BERT have an in-domain problem? To answer this, we will look at:
Why Do Transformer Models Need To Tokenizers?Paying attention to what mattersTransformers to the rescueTokenization in practiceSome examples of poor tokenizationIs there a cost to poor tokenization?Is attention all you need to compensate for poor tokenization?Comparing context and tokenizationCan we fix shortcomings due to poor tokenization?Fine-tuning Extending Avocado’s: the latest approaches to the in-domain problemConclusion
Hopefully, after all of this, you will be able to decide whether the in-domain problem is something you need to worry about now or, like the early model-T’s, can you keep going for a while without your speedometer or turn signal, hand cranking your way down the road until these, nice to have, features become more important and necessary to your everyday driving experience.
Why Do Transformer Models Need To Tokenizers?
What do transformer models get from tokenizers that make them indispensable? A number of things.
Paying attention to what matters
One of the reasons for the success of models like BERT is that the Transformer architecture enables them to learn context from how a word is used in a sequence of text. It does this by identifying which words the model should pay attention to, as it is parsing a sequence of text.
For example, let’s look at a headline from last year that discussed the escalating tensions between Russia and America concerning the invasion of Ukraine:
“The United States and Russia sought to lower the temperature in a heated standoff over Ukraine, even as they reported no breakthroughs in high-stakes talks on Friday aimed at preventing a feared Russian invasion”
The meaning of this sentence may be clear to you but imagine reading this as an algorithm and trying to understand what this means. Think of all the possible different connotations of some of the terms being used:

Looking more closely, we can see that some phrases could have different meanings depending on the context in which they were used. Source: Author
Take the word “temperature” in this sentence. We know it refers to diplomatic efforts to de-escalate a potentially volatile international military incident. We know that due to the context in which it is used. But this exact phrase, “lower the temperature,” could be used in sequences about our body temperature, for example:
“These are all ways to release heat and therefore lower the temperature of the body”
The above sentence is from an article about body temperature, which contains the exact same phrase as our example about a potential international military incident. The point here is that a model with a dictionary-like system storing on one “meaning” for something like temperature would miss some of the nuances of the different use cases here.
Models like Word2Vec work in this way, storing one version of each word or token and then using this meaning statically in each use case. This static approach can work well and, for a long time, was one of the best approaches to identifying semantically similar words and phrases. However, we can easily see the limits of that approach with our examples above, where a simpler model would struggle to differentiate sentences that contained overlapping identical phrases whose meaning depended on context.
Transformers to the rescue
This is where Transformer models like BERT start to shine. Wait ... Shine??? Shine a car? A window? A flashlight? Hard to know what it refers to unless we know the current context, right? Right-handed? Turn right? You get where I'm going.
Transformer models do not store one static meaning for each word, instead, they infer the meaning based on the context in the input sequence. They can pay attention to different parts of the input text sequence and thus identify that “lower the temperature” was related to diplomatic discussions rather than body sweat or cooking in this particular use case.
The difference here is subtle and might only change the meaning of the words slightly, but it is enough to enable a range of Transformer models to dominate leaderboards in most NLP tasks, such as language inference or semantic similarity.
An implicit assumption here is that models like BERT need to be able to identify and learn the context from different word usage. Imagine, as a thought experiment, the simple case where a Transformer model has an infinite vocabulary. It contains every single word ever used, along with all the entities and names of everything that have ever existed.
Then each time the model parses a new sentence, it can identify the word and begin to understand that context better as it parses more examples of similar use cases. So when it sees “breakthrough,” it goes “aha” (it doesn’t really go aha but you get the idea!), “let me look at the surrounding words to understand the context”. Then it sees the surrounding context with phrases like "... they reported no ..." and "... high-stakes talks on Friday," and the model knows that this is not about breaking through ice or medical discoveries but finding progress in important diplomatic discussions.
No infinite library
Unfortunately, we don’t have the ability to store every potential word that the model might see. As a result, the model will encounter words it has not seen before and will have to parse them as the dreaded “UNK” token to identify that it is an unknown token.
This means the model would be unable to learn any context from this token since it can’t assign any meaning to it. Using UNK tokens to identify any words that the model has not seen during training would severely limit the model's ability to learn the meaning of a word from context. This is precisely the issue tokenizers are designed to solve.
No more UNKs
So we can’t simply store every word that the model could possibly encounter but we need some way for the model to identify future unseen words to be able to learn from and understand the right context. Instead of breaking sentences into word units, the tokenizers used in most Transformer models like BERT use smaller-than-word units or subwords to build up bigger words.
Why are subwords better than whole words? Remember, the problem is there are an infinite number of words, so we can’t store all possible words our models might ever see.
We can, however, store the most common words that the model is likely to see. But what do we do with the remaining words? Instead of storing them as whole words, we can try and find the most common sequences that we can use to create those words by putting them together like a jigsaw.

Think about it like this; you can train a model to identify the most commonly used words in a certain training dataset and then identify the smaller subword parts you would need to “build-up” the other less commonly used words. As a simple default, you could even use single characters as the smallest subword units so that you could build up any unseen word. But, ideally, you want to use larger subword units as you would like to be able to associate some meaning with these subword parts depending on how they are used.
If you consider our earlier example where we had a phrase like “breakthrough”. In some dataset, this may not be a very common expression, so keeping this as a single word in our model vocabulary where space is at a premium could be wasteful. Instead, we might want to store subword parts like “break” and “through”. The benefit here is that these are themselves words that we could use as well.
In this way we have freed up one slot by being able to combine “break” and “through” together whenever we encounter “breakthrough” and do not need a slot to store the whole word.
Alternatively, we might not even think “through” is needed as a standalone word, so we tokenize that as “th” and “rough” and so on. Thus “breakthrough” becomes “break” + “th” + “rough”. We can represent “breakthrough” by these three individual tokens when we see it. In this way, we will be able to ensure that we do not need to use any UNK tokens. And, we hope, the model will still be able to learn from the context in which these subword tokens are used so that we can differentiate between breaking through the ice and a diplomatic breakthrough.
Don’t worry if it doesn’t make much sense just yet. The important thing is just to consider whether it could be difficult to learn contextual meaning when representing a word with a single token versus multiple subword tokens. To better understand whether this could be difficult, let’s look at some examples of tokenization in practice and get a better feel for how it can help Transformer models learn from context. We will look at some examples of “good” tokenization and then some examples of where “bad” tokenization occurs due to some subword building blocks being associated with different meanings.
Tokenization in practice
To understand how tokenization can lead to issues with in-domain vocabulary, we need to understand how the subword tokenization process we described works in practice. Let’s look at our original example sentence and see how it is tokenized as a start.
The sentence was:
“The United States and Russia sought to lower the temperature in a heated standoff over Ukraine, even as they reported no breakthroughs in high-stakes talks on Friday aimed at preventing a feared Russian invasion”
Using the HuggingFace library, we can easily generate some examples of how tokenizer BERT uses would treat the above sentence:
from transformers import BertTokenizer, BertModelbert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')example_sen = ("""The United States and Russia sought to lower the temperature in aheated standoff over Ukraine,even as they reported no breakthroughsin high-stakes talks on Friday aimed at preventing a feared Russian invasion""")print(bert_tokenizer.tokenize(example_sen))
['the', 'united', 'states', 'and', 'russia', 'sought', 'to', 'lower', 'the', 'temperature', 'in', 'a', 'heated', 'stand', '##off', 'over', 'ukraine', ',', 'even', 'as', 'they', 'reported', 'no', 'breakthrough', '##s', 'in', 'high', '-', 'stakes', 'talks', 'on', 'friday', 'aimed', 'at', 'preventing', 'a', 'feared', 'russian', 'invasion']
The interesting thing here is that most of the sentence is tokenized as proper words. You can see this since there are very few ‘##’ characters used. These characters signify that a subword part is used. For example, “breakthroughs” is tokenized as a current in-vocabulary word which is “breakthrough” and an additional subword part “s” identified by ##s”. (So it’s not actually tokenized like “break” and “##through” as we described earlier).
This is a good example of the benefits of tokenization. We don’t need to keep two slots in our limited storage space of words for both “breakthrough” and “breakthroughs”. Instead, we just store the stem “breakthrough” and then another slot for “s”. The “s” part is useful since we can do the same for many plural words.
But this doesn’t mean it happens with every plural word, “Russians” is tokenized as one word, for example ['russians']. The key thing to note here is that tokenization does not follow any linguistic rule like stemming or lemmatization. Instead, it makes up its own rules based on its subword algorithm.
One interesting tokenized example is “standoff” which is tokenized as “stand” and “##off”. “Stand” may have a number of associations,
- Position: I like to stand at my desk
- Attitude: It’s time to stand up for what you believe
- Furniture: Where is the microphone stand
- Cooking: Leave it to stand for 5 minutes
In this example, we would hope that the Transformer model would focus attention on the parts of the sentence like “breakthrough” and “talks” which hint at the appropriate meaning.
The model should put more weight on those parts of the sentence and infer a meaning for “['stand', '##off']” that is more related to a deadlock between two parties rather than a piece of furniture. Similarly, if during training, the model sees many examples of the usage of “standoff” it will associate “['stand', '##off']” with the correct meaning. However, as we will see, this is more difficult when the word is tokenized as multiple subword tokens.
Again, just to show that there is no linguistic-like rule being employed by the tokenizer, we can see a mix of behavior when tokenizing words that end in “off”:
- Kickoff: ['kickoff']
- Playoff: ['playoff']
- Layoff: ['lay', '##off']
- Handoff: ['hand', '##off']
So it looks like the tokenizer model saw more examples of words like “kickoff” and “playoff” and learned that it was optimum to use a slot for these words rather than simply build them up using the “##off” subword building block.
Some examples of poor tokenization
Even from looking at a small number of examples, you can get an idea of how subword tokenization works. It uses the smaller building blocks of subwords to save valuable space for other, more common words. But what about more specific examples of where the subword tokenization makes learning and identifying the correct meaning of a word more difficult?
A recent paper about domain adaptation in BERT highlights some interesting shortcomings of subword tokenization and the challenges it presents when trying to learn the semantic meaning. Take the word “sophisticated”; it is properly tokenized in that it is represented as a single token: ['sophisticated']. You might then think the word “unsophisticated” would be [“un” ##sophisticated”], but you would be wrong. Instead, it is ['un', '##sop', '##his', '##tica', '##ted'].
The paper highlights some great examples of where the tokenization does not seem to utilize already existing words when creating subword tokens:
| Input | Tokenized |
|---|---|
| Activated | ['activated'] |
| Deactivated | ['dea', '##ct', '##ivated'] |
| Equal | ['equal'] |
| Unequal | ['une', '##qual'] |
| Value | ['value'] |
| Devalue | ['dev', '##al', '##ue'] |
Examples of what might seem like unusual tokenization
If you think about these examples it's easy to understand why it may be difficult for a model like BERT to learn meaning and context when they have to parse a word like “deactivated” as three subwords instead of “de” and “##activated”. At least with the former, the model could infer some level of meaning from the word ”activated” which would be useful for inferring the meaning of the usage of “deactivated”.
With “dea” and “##ct” and “##ivated” there will be less clear associations and thus, it will be harder to infer the correct meaning.
What meaning has the model already learned from “dea” for example? It might occur in other examples where the meaning is different from “deactivated” thus requiring that most of the semantic meaning be learned from the context in which it is used. This may work some of the time but we will see examples of where the context is not enough to identify the correct meaning.
As we noted earlier, in subword tokenization, some of the words have '##' before them and some do not. The first part of the subword tokens is known as the stem and does not have any '##' values associated with it. It is important to note how the stem of the word, e.g. the “dea” in "deactivated" is different from the ## part of the subwords.
So take “unsaturated,” which is tokenized as ['un', '##sat', '##ura', '##ted']. The “##sat” subword does not correspond to the word “sat” which also exists in the vocabulary as a whole word. The meaning associated with '##sat' is only from the time when it was used as a subword. That means that “satsat” (its not a word, this is just an example), which is tokenized as ['sat', '##sat'], is the whole word “sat” with all the meanings associated with sitting and the subword part “##sat” which is associated with whenever it was used in a subword.
Another interesting paper, AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain, highlights some other examples of poor tokenization in relation to named entities that are unique to a specific business domain. The examples they highlighted were:
| Input | Tokenized |
|---|---|
| Bluetooth | ['blue', '##tooth'] |
| Corticosterone | ['co', '##rti', '##cos', '##ter', '##one'] |
Examples of poor tokenizations in medical and technology domains
In the above examples, we can see a work like "Bluetooth" is associated with the stem "blue" and the subword token "##tooth". The tokenizer is telling us that in most of the data, it has been trained on "Blue" which is a commonly occurring word and should to tokenized on its own. So it has not seen the term "Bluetooth" often enough to create a single token for it.
A generally trained LM will then use these subword word tokens to try and learn the context around "Bluetooth". We know "Bluetooth" refers to a wireless technology but the model will need to learn this association. The problem arises when the stem is a common word and will already have learned a strong association with the word "blue" as a color.
In a computer or tech domain the word “bluetooth” could be frequently used and it would be difficult to “unlearn” the strong association with the color. Similarly, in a medical domain, the fact that “Corticosterone” is tokenized as 5 subword tokens would make it difficult to learn its relevant meaning as a steroid hormone.
Continuing on from the examples in the above paper, let’s see if we can think of domain-specific terms and see how they might be tokenized in a way that might make inferring their correct meaning difficult:
| Input | Tokenized |
|---|---|
| ['facebook'] | |
| ['google'] | |
| Netflix | ['netflix'] |
| Googol | ['goo', '##gol'] |
| Headbook | ['head', '##book'] |
| Thunderbolt | ['thunder', '##bolt'] |
| Avocado | ['av', '##oca', '##do'] |
The code for generating this can be seen here
from transformers import BertTokenizer, BertModelbert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')word_list = ['Facebook', 'Google', 'Netflix', 'Googol', 'Headbook', 'Thunderbolt', 'Avocado']for word in word_list:print(bert_tokenizer.tokenize(word))
You can see that common company names like 'Facebook' and "Google' get tokenized as one token. That makes sense as these companies are so common and talked about that they would appear frequently in many domains and not just in tech-related ones.
However, if we change the name slightly to similar sounding names like "Headbook" we notice that they are tokenized differently. Again, this also makes sense as I am making these up intentionally to create company names that the tokenizer will not have been trained on. When we see stems like "thunder" or "head" we should be concerned that these might cause our models to underperform in some tasks.
Is there a cost to poor tokenization?
So what if these words are tokenized in a somewhat unintuitive way? I mean, there really is a lot going on within Transformer models that we can’t fully explain so why is this any different? We just noted that the Model-T didn’t have what we consider to be basic features like a turn signal or a seat belt until years later. So maybe the impact of poor tokenization is still so nuanced that it won’t cause any problems for us in the same way as some of the early shortcomings of the model-T.
The thing is that we can try and measure the impact of what seems like poor tokenization in ways that might impact your use case. If the way a word is tokenized can negatively impact the meaning a model associates with it then that does seem like a problem. So let’s check that out.
The paper Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words looks at the impact of tokenization on semantic meaning. One of the interesting findings in this paper is that Transformer models can “compensate” for subword tokenization in certain cases via their ability to identify context in a sentence.
When tokenization goes well
The paper describes different scenarios where tokenization for Out-Of-Vocabulary (OOV) words works well and where it does not. The key here is being able to identify when subword tokenization is likely to help or hinder your particular use case or business domain. For example, the authors state that:
“While OOV words that begin with an in-vocab root (or stem) will retain its semantic meaning when tokenized, they become vulnerable in other cases as the root (or stem) will be broken down into smaller constituent sub-word units.”
There is a lot to unpack there but let’s break down some of the terms:
- OOV words: These are simply the words that need multiple tokens, i.e. the words with the “##” parts in them
- In-vocab root (or stem): As we saw, when a word is tokenized by subword units, the first part, the stem or root, does not have any “##” marker signifying that it is a whole word and not a subword part. The “satsat” example was tokenized as ['sat', '##sat'], so the root in this part is the “sat” part.
This means when the root of a word is a known in-vocabulary word, the model should have a good chance of inferring the right semantic meaning. However, when the stem is not an in-vocabulary word (and is broken into subword parts), it may be hard for the model to pick out the right meaning.
| Word | Tokens |
|---|---|
| firsthand | [‘first’ ‘##hand’] |
| overestimate | ['over', '##est', '##imate'] |
| downgrade | ['down', '##grade'] |
| headband | ['head', '##band'] |
| homegrown | ['home', '##gr', '##own'] |
OOV words that begin with in-vocab stems (root)
The above Out-Of-Vocabulary words all start with an in-vocab word which has likely learned some related meaning from other text which could be applied to the current usage and context. Thus, the Out-Of-Vocabulary words will maintain some semantic meaning due to the appropriate word being used in its tokenization.
When tokenization goes badly
| Word | Tokens |
|---|---|
| enlarge | ['en', '##lar', '##ge'] |
| scrutinize | ['sc', '##rut', '##ini', '##ze'] |
| slush | ['sl', '##ush'] |
| froth | ['fr', '##oth'] |
| mournful | ['mo', '##urn', '##ful'] |
OOV words that do not begin with in-vocab stems (root)
In contrast, the above Out-Of-Vocabulary words do not have a useful in-vocab word as their stem when tokenized. This can occur more out of luck than by design. For example, some OOV words which the model will have seen less frequently may be short words. Thus, the likelihood that they are tokenized with a proper or in-vocab word at the start is small. Longer words, by comparison, are more likely to have an in-vocab word as the stem of their subword tokenization.
Subword tokens like “en”, “sc”, “sl” and “fr” are much less likely to have useful meaning associated with them than subword tokens like “first” “over” “down” and “head”.
To better understand the semantic implications of tokenization, the authors also look at the non-stem subword tokens, i.e. the “##’ parts of the subword tokens. They look at cases where the hashed part of the subword has a corresponding in-vocab part. For example, situations like:
| Word | In-vocab subword part |
|---|---|
| time | ##time |
| world | ##world |
| city | ##city |
| right | ##right |
| house | ##house |
| group | ##group |
Words that have corresponding in-vocab subword parts
In these cases, the authors find that the similarly of in-vocab words, e.g. “time” or “world”, is similar to their hashed counterparts. In other words, “time” is similar to “##time” and “world” is similar to “##world”.
To be precise, the authors find that, on average, the cosine similarity for these words is 0.66.
What does this mean? Well, the authors claim that when you have a word that has an in-vocab subpart then the high cosine similarity should make it easier for your model to get the right meaning from the context. The thinking being that a lower score would make it harder for the model to "compensate" from the surrounding context.
When tokenization is … weird?
OK, to recap, it looks like it might be harder to identify the right semantic meaning for a word when we do not have non-in-vocab words in either the stem or the '##' part of subword tokens. When there is an in-vocab word used in our tokenization then the model will likely have a stronger association with the correct usage and thus be more likely to learn more from the current context.
This won’t be true in all cases, but when you see a common word in your domain and it is tokenized with one or more non-word entities then you might want to take a closer look at how it is used. But what about situations where avoiding using subword tokenization by hacking the way we spell a word? If this improves the semantic similarity of the words then this is a bit … weird?
If we return to our “sophisticated” example from earlier. The different in-vocab and subword tokenizations of similar variants of "sophistication" were:
| Word | Tokens |
|---|---|
| Sophisticated | ['sophisticated'] |
| Unsophisticated | ['un', '##sop', '##his', '##tica', '##ted'] |
| Un sophisticated | ['un', 'sophisticated'] |
Words that have corresponding in-vocab subword parts
The authors compare the similarity of “unsophisticated” with “sophisticated” and find that it has a cosine similarity of 0.30. This is much lower than we would expect. To show that this is to do with the tokenization rather than a subtle difference in the meaning of the words, the authors compare the similarity between “unsophisticated” and “un sophisticated”, i.e. tokenize it as two words rather than one and find the cosine similarity is 0.81. That is a massive difference simply because we “hacked” it so that we prevented it from being tokenized with multiple hashed subword parts which are not in-vocab words.
This is weird since the tokenizer has all the core elements to be able to properly deal with the word “unsophisticated”. It has “un” and “sophisticated” in the vocab but no “##sophisticated” subword part and thus needs to treat it like a completely new Out-Of-Vocabulary word. Ideally, we would want it to leverage the knowledge it already has for “sophisticated” and build on that. Instead, we need to break up “unsophisticated” into two separate words to get a better semantic similarity.
Now that we have identified some of the cases when subword tokenization can impact semantic meaning, let's see whether the Transformer architecture, via its attention mechanism, is able to compensate via the context of the sentence.
Is attention all you need to compensate for poor tokenization?
Models like BERT have caused a paradigm shift in how we think about the linguistic capabilities of NLP applications. One of the main reasons for this shift is that the Transformer architecture enables models like BERT to understand a word based on the context in which it is used. This allows BERT to understand the subtle difference between the word “play” in the following sentences referring to the 2022 Super Bowl and a play about American Football:

BERT can identify the surrounding context of the word “play” in both sentences to identify one is referring to a thing, i.e. a stage play, and the other to the verb “to play”, as in, the teams are playing each other.
We have seen that tokenization can result in less-than-ideal subword tokens being used to describe an OOV word. The previous paper looked at this issue by comparing the semantic similarity of words and subword tokens. When the similarity was relatively high, e.g. 0.66, then it was claimed that the context in the sentence should help limit the drop in performance due to poor tokenization. We can investigate this by comparing the semantic similarity of sentences containing OOV words and identifying cases when the change in similarity can only be due to tokenization.
Comparing context and tokenization
There are many ways we can compare two sentences and generate a similarity score. We could use a simple approach as outlined in the SentenceBERT website to compare two sentences. This works fine, but for our purposes here, I would like to use a more involved process where we can see how the tokenizers are involved in the generation of the embeddings. This is a great blog that describes the process in detail and I have tried to use the code examples from this blog.
You can try different tokenizers and models to see how it impacts the similarity score. You can also use different settings for pooling and code setup. I tried to keep it fairly simple and just try and expose how the tokenizer is used in generating the embedding we use to calculate similarity
# Import the libraries from HuggingFace# You could also do this from Sentence BERT directly if you like (sbert.net)from sentence_transformers import utilfrom transformers import AutoTokenizer, AutoModelimport torch# Lets try it with these tokenizers and models# Its not aobut testing a specific model as such# Its more about how they tokenizer works so we can try and understand its impacttest_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")test_model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')#Mean Pooling - Take attention mask into account for correct averagingdef mean_pooling(model_output, attention_mask):token_embeddings = model_output[0] #First element of model_output contains all token embeddingsinput_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)def find_similar_query(query1, query2, tok, mod):print(tok.tokenize(query1))print(tok.tokenize(query2))enc_input = tok([query1, query2], padding=True, truncation=True, return_tensors='pt')with torch.no_grad():model_output = mod(**enc_input)pool_embs = mean_pooling(model_output, enc_input['attention_mask'])norm_embs = torch.nn.functional.normalize(pool_embs, p=2, dim=1)res = util.pytorch_cos_sim(norm_embs, norm_embs).numpy()[0][1]return(res)# You can then compare two sentences using the find_similar_query function:sen1 = "Tokenizers are key to understanding language models"sen2 = "Tokenizers are important to understanding language models"find_similar_query(sen1, sen2, test_tokenizer, test_model)
Sophisticated, unsophisticated, and un sophisticated
Let’s start with the examples we touched on earlier; "sophisticated", which did not appear to be tokenized well in certain cases:
| Sentence | Tokenized Sentence |
|---|---|
| Ransomware attacks are using more sophisticated techniques | ['ransom', '##ware', 'attacks', 'are', 'using', 'more', 'sophisticated', 'techniques'] |
| Ransomware attacks are using more un sophisticated techniques | ['ransom', '##ware', 'attacks', 'are', 'using', 'more', 'un', 'sophisticated', 'techniques'] |
| Ransomware attacks are using more unsophisticated techniques | ['ransom', '##ware', 'attacks', 'are', 'using', 'more', 'un', '##sop', '##his', '##tica', '##ted', 'techniques'] |
Although they are different sentences we would want our models to infer that they are somewhat related
| Sentence A | Sentence B | Similarity |
|---|---|---|
| Ransomware attacks are using more sophisticated techniques | Ransomware attacks are using more unsophisticated techniques | 0.95 |
| Ransomware attacks are using more sophisticated techniques | Ransomware attacks are using more un sophisticated techniques | 0.97 |
In this case, it does look like the context helped offset the impact of the poor tokenization of “unsophisticated”. But what about another example of a word that does not have an in-vocab stem, such as “mournful”
Mournful, sad and angry
| Sentence | Tokenized Sentence |
|---|---|
| The people that gathered were very mournful about what happened | ['the', 'people', 'that', 'gathered', 'were', 'very', 'mo', '##urn', '##ful', 'about', 'what', 'happened'] |
| The people that gathered were very sorry about what happened | ['the', 'people', 'that', 'gathered', 'were', 'very', 'sorry', 'about', 'what', 'happened'] |
| The people that gathered were very sad about what happened | ['the', 'people', 'that', 'gathered', 'were', 'very', 'sad', 'about', 'what', 'happened'] |
Although they are different sentences we would want our models to infer that they are somewhat related
| Sentence A | Sentence B | Similarity |
|---|---|---|
| The people that gathered were very mournful about what happened | The people that gathered were very sorry about what happened | 0.79 |
| The people that gathered were very sad about what happened | The people that gathered were very sorry about what happened | 0.83 |
Again, we do see a change here but it looks like the subword tokens are doing their job, and the model is able to infer the right meaning from the surrounding context. It loses a little but that could also be due to subtle differences in the wordplay around “sad” v “sorry”. But what about the ability to differentiate between a different meaning than “mournful” in the sentence? For example, let’s compare the following sentences:
| Sentence | Tokenized Sentence |
|---|---|
| The people that gathered were very mournful about what happened | ['the', 'people', 'that', 'gathered', 'were', 'very', 'mo', '##urn', '##ful', 'about', 'what', 'happened'] |
| The people that gathered were very sorry about what happened | ['the', 'people', 'that', 'gathered', 'were', 'very', 'sorry', 'about', 'what', 'happened'] |
| The people that gathered were very angry about what happened | ['the', 'people', 'that', 'gathered', 'were', 'very', 'angry', 'about', 'what', 'happened'] |
Now when we compare them we would hope the similarity is a little less due to the change in the “emotion” associated with “angry”.
| Sentence A | Sentence B | Similarity |
|---|---|---|
| The people that gathered were very mournful about what happened | The people that gathered were very angry about what happened | 0.82 |
| The people that gathered were very sorry about what happened | The people that gathered were very angry about what happened | 0.71 |
Now we begin to see where the poor tokenization may be making it more difficult to compensate for the subtle change in meaning. The model could be leveraging too much from the surrounding context when it doesn’t have a strong signal from the tokenizations of “mournful”. Whereas when we use “sorry” V “angry,” it looks like the model can identify more from the different usage and meaning of these in-vocab words.
It's important to note that this does not “prove” that the change in similarity scores is solely down to tokenization. It could be that in the dataset the model was trained on, it learned some unusual context for “sorry” or “sad” or “angry” and is making the inference from that. But at least we know that when we see a word tokenized without an in-vocab word as the stem, we might want to take a closer look at our use case.
The colors of bluetooth
We noted earlier that the wireless connection protocol called “Bluetooth” was tokenized as [‘blue’, ‘tooth’]. This is good since “blue” is an in-vocab word, and it is the stem or root of this word when tokenized. However, the association with “blue” may not be useful in the scenario where we want to refer to “Bluetooth” technology and could lead to some semantic confusion.
| Sentence | Tokenized Sentence |
|---|---|
| can i connect my phone to the tv via a wireless technology | ['can', 'i', 'connect', 'my', 'phone', 'to', 'the', 'tv', 'via', 'a', 'wireless', 'technology'] |
| can i connect my phone to the tv via Bluetooth | ['can', 'i', 'connect', 'my', 'phone', 'to', 'the', 'tv', 'via', 'blue', '##tooth'] |
| can i connect my phone to the tv using a blue cable | ['can', 'i', 'connect', 'my', 'phone', 'to', 'the', 'tv', 'using', 'a', 'blue', 'cable'] |
| can i connect my phone to the tv using blue teeth | ['can', 'i', 'connect', 'my', 'phone', 'to', 'the', 'tv', 'using', 'blue', 'teeth'] |
Let’s compare the simple case of “wireless technology” and “blue cable” and see how similar these two sentences are to each other.
| Sentence A | Sentence B | Similarity |
|---|---|---|
| can i connect my phone to the tv via a wireless technology | can i connect my phone to the tv using a blue cable | 0.75 |
Bluetooth is a wireless technology so if we compare “Bluetooth” to a “blue cable” then in theory we should not see much change in the semantic similarity of both sentences.
| Sentence A | Sentence B | Similarity |
|---|---|---|
| can i connect my phone to the tv via Bluetooth | can i connect my phone to the tv using a blue cable | 0.84 |
It looks like the similarity increased since the stem of the subword tokenization is “blue” and this obviously has a similarity with the color blue even though it is used in a very different context here. If "Bluetooth" were stored as a single token here, it is likely it would not be as easily mis-associated with the color blue.
What happens if we use a nonsensical sentence with the phrase “blue teeth” which actually has no valid meaning in this context?
| Sentence A | Sentence B | Similarity |
|---|---|---|
| can i connect my phone to the tv via Bluetooth | can i connect my phone to the tv using blue teeth | 0.86 |
The similarity actually increased here since it also found a similarity with “teeth” and “##tooth”. To verify this, let’s compare the other sentence where we referred to “wireless technology”:
| Sentence A | Sentence B | Similarity |
|---|---|---|
| can i connect my phone to the tv via a wireless technology | can i connect my phone to the tv using blue teeth | 0.75 |
The similarity of these sentences didn’t change. Ideally we would have liked it to drop but this also shows that much of the similarity of the sentence is being drawn from the “can i connect my phone to the tv …” part of the sentence. So it looks like there will be a high baseline similarity between these sentences, which will only be offset when the meaning of the thing connecting the tv and the phone has a strong signal.
For example:
| Sentence A | Sentence B | Similarity |
|---|---|---|
| can i connect my phone to the tv via a fire | can i connect my phone to the tv using water | 0.67 |
In this case, the strong meaning associated with “water” and “fire” helps the model understand that these sentences are referring to different things. In a sense, it “overrides” the similarity of the rest of the sentence even though the sentences are both nonsensical.
Similarly, “Thunderbolt” is a brand name that is tokenized as [‘thunder’, ‘##bolt’], so could cause similar issues if we think of sentences such as:
| Sentence | Tokenized Sentence |
|---|---|
| Thunderbolt is a hardware interface used by Apple | ['thunder', '##bolt', 'is', 'a', 'hardware', 'interface', 'used', 'by', 'apple'] |
| A lightning bolt is not a hardware interface used by Apple | ['a', 'lightning', 'bolt', 'is', 'not', 'a', 'hardware', 'interface', 'used', 'by', 'apple'] |
| SCSI is a hardware interface used by Apple | ['sc', '##si', 'is', 'a', 'hardware', 'interface', 'used', 'by', 'apple'] |
SCSI is a set of standards for connecting computers to peripheral devices, and is not similar to a lightning bolt!
| Sentence A | Sentence B | Similarity |
|---|---|---|
| A lightning bolt is not a hardware interface used by Apple | SCSI is a hardware interface used by Apple | 0.48 |
This difference is seen in the similarity score of 0.48. Thunderbolt is a brand name that is also not similar to a lightning bolt:
| Sentence A | Sentence B | Similarity |
|---|---|---|
| A lightning bolt is not a hardware interface used by Apple | Thunderbolt is a hardware interface used by Apple | 0.67 |
Here we see that the tokenization of “Thunderbolt” has significantly increased the similarity score of the two sentences even though, in theory, there should be no real difference between SCSI and Thunderbolt in the context of this sentence. This shows some of the dangers with in-domain type names, which are likely to be tokenized by subword tokens and can have some unintended consequences due to unhelpful linguistic associations.
For example, let's say you own a company called “SlowWebsite” (I know, it's not the catchiest name, but it will help show the potential for in-domain confusion), and you want to know when your customers are complaining about your website being, well, slow. This seems like a different issue than when your customers are claiming that they cannot access your website.
We can try and see if the tokenization of SlowWebsite will make it difficult for you to identify your customers' queries compared to some other companies like Google, for example.
The sentences would look something like this:
| Sentence | Tokenized Sentence |
|---|---|
| The SlowWebsite website is unavailable | ['the', 'slow', '##we', '##bs', '##ite', 'website', 'is', 'unavailable'] |
| The SlowWebsite website is really slow | ['the', 'slow', '##we', '##bs', '##ite', 'website', 'is', 'really', 'slow'] |
And when we compare the similarity of these two sentences, we get:
| Sentence A | Sentence B | Similarity |
|---|---|---|
| The SlowWebsite website is unavailable | The SlowWebsite website is really slow | 0.87 |
Hmmm, that seems high for things that look like a different query. What about if we simply change the company name, which should not make that much difference to the context of the sentence?
| Sentence A | Sentence B | Similarity |
|---|---|---|
| The Google website is unavailable | The Google website is really slow | 0.68 |
That is a big change for simply changing the name of the company. For a human, we would see this as a minor change that does not impact the semantic meaning of the sentences. However, since “Google” is tokenized as a single in-vocab word, the model will likely have learned a strong association between “Google” and websites it is able to identify the most subtle differences in the requests.
Hopefully, this will help you to get a better feel for where there may be in-domain words or phrases that you need to check out to see how they are tokenized and if that might degrade the performance of your use case.
And if you do find situations where tokenization is causing a problem, what can you do about it? That is what we are going to look at next.
Can we fix shortcomings due to poor tokenization?
So let’s do a quick review. It looks like subword tokenization may limit our ability to identify the correct meaning in certain situations. We have seen that when the OOV word is tokenized as an in-vocab word it can help maintain the correct meaning.
However, we saw instances where some in-vocab subword parts can cause issues when they have strong associations with areas unrelated to the OOV word. Now we want to see if there are any steps we can take to address these issues. It seems like there are three popular approaches to addressing the issue of in-domain tokenization. These are:
- Fine-tuning: Why not simply fine-tune the model to your in-domain data?
- Add new tokens: If the problem is we are missing in-vocab words, then why not just manually add them?
- Train from scratch: Is training the entire model from scratch the best solution?
Fine-tuning
One of the key differences between BERT and earlier models was that it was trained on a wide range of NLP tasks. Previously, models would be trained to perform one specific task, e.g. sentiment analysis or Named Entity Recognition (NER). None of these earlier models were capable of being used on a task for which they were not specifically trained.
In contrast, BERT was designed to be used as a general model which you could then tune to your own use case via a process known as transfer learning and/or fine-tuning. Basically, this is where you leverage the general knowledge BERT has learned from its large-scale pretraining and tune or train the upper layers of BERT so that it gains more specific knowledge of your unique domain.
There are tutorials that show you how to fine-tune models on downstream tasks which are specific to your domain. Libraries such as SentenceBERT also provide you with tools that you can use to train and tune these models in a similar way.
However, there are some questions about how well models like BERT adapt to a specific domain via fine-tuning. An interesting paper on adapting BERT to the legal domain discusses some limits with fine-tuning and shows that the authors got better results by further pre-training models or training them from scratch rather than just fine-tuning them. While the specific details of this and other studies are beyond the scope of this blog, we can, at least, try and understand whether fine-tuning could, in theory, address the issues we identified earlier in relation to subword tokenization.
Can fine-tuning create better tokens?
To understand why fine-tuning might not address the subword issues, we can think of the tokenizer as a separate model which feeds its output to the Transformer part of the model where the learning will take place. The tokenizer accepts text as an input and outputs token IDs. These are the IDs that represent the subword tokens we looked at earlier.
Transformer models like BERT then take these IDs and try to learn embeddings that capture the meaning and context of the words they represent. Then when you use a sentence on your pre-trained model it will output an embedding that represents that specific input given its context. In this way, the intermediate work performed by the tokenizer is hidden from most use cases.

We can think of the tokenizer as another model within the overall architecture of a transformer model, i.e. the output of the tokenizer is the input for the transformer. Source: Author
Thus, this training occurs in a two-part process.
The first part is where the tokenizer is trained on the relevant data to best understand how to represent the vocabulary in that domain via its subword format. The second part is where the tokenizer is then “plugged” into a model like BERT to learn the embeddings.
The important point to note about this two-stage process is that the tokenization is fixed once it is plugged into BERT. If “Bluetooth'' is tokenized as [‘blue’, ‘##tooth’] then it will remain tokenized in that way regardless of how many instances of "Bluetooth" are available in the training data as BERT is learning its embeddings.
The hope with fine-tuning is that with enough instances of “Bluetooth” in the new, domain-specific dataset, the model can overcome its tokenized shortcomings and learn that the tokens “blue” and “##tooth” refer to wireless technology.
It will need to know that “blue”, when used in this context, i.e. related to technology, is a reference to a protocol rather than a color. While this is possible, the specific form of the subword tokenization makes it more difficult to learn the meaning than if it was tokenized as one single word. This is not optimal and makes fine-tuning a less-than-ideal way to improve tokenization.
If it were tokenized as one word then there would be no confusion with the cases in which blue is used as a color or any other potential use case which may serve to further confuse the correct meaning of the term at that time. So why don’t we do just that and add “Bluetooth” as a single token to our model?
Add new tokens?
If we know the specific words in our own unique domain, whether it be a financial, medical, or technical domain, then why not simply add them directly to the tokenizer? There is an option available in the HuggingFace library which enables you to do just that. This means the tokenizer will be able to process these new words as in-vocab words and that to your transformer language model as a single token
For example, using the default BERT tokenizer, “Bluetooth”, as we have seen, is not an in-vocab word:
| Sentence | Tokenized Sentence |
|---|---|
| Bluetooth is a short range wireless technology standard | ['blue', '##tooth', 'is', 'a', 'short', 'range', 'wireless', 'technology', 'standard'] |
Now, using the HuggingFace library we can add it directly as an in-vocab word and run our sentence again:
new_tokens = ['Bluetooth']print(bert_tokenizer.tokenize("Bluetooth is a short range wireless technology standard"))added_tokens = bert_tokenizer.add_tokens(new_tokens)print(f"Total number of new tokens added: {added_tokens}) to the tokenizer)")print(bert_tokenizer.tokenize("Bluetooth is a short range wireless technology standard"))OUTPUT:['blue', '##tooth', 'is', 'a', 'short', 'range', 'wireless', 'technology', 'standard']Total number of new tokens added: 1) to the tokenizer)['bluetooth', 'is', 'a', 'short', 'range', 'wireless', 'technology', 'standard']
Now we can clearly see that the word “Bluetooth” is tokenized as an in-vocab word. Problem solved? Well, not quite, specifically there are two main issues at this point:
- Embedding matrix size: We have only added a new token to the tokenizer so the LM itself still knows nothing about this token. In the LM there will be an embedding matrix that is used to generate the embedding for the relevant input ids output by the tokenizer. We just increased the tokenizer's vocabulary by 1 token, so at the moment, the embedding matrix is the incorrect size and will have no entry for our new token. The good news is that there is a HuggingFace method to add a new entry in the embedding matrix. However, the bad news is …
- …it is initialized with a random embedding: The bad news is that the new entry for our new in-vocab word is a randomly initialized embedding. This makes sense when you think about it. The LM was not trained in our specific domain so cannot know anything about “Bluetooth” and cannot possibly generate a context-rich embedding based on its usage.
new_tokens = ['Bluetooth']print(f"[ BEFORE ] tokenizer vocab size: {len(original_bert_tokenizer)}", )print(original_bert_tokenizer.tokenize("Bluetooth is a short range wireless technology standard"))added_tokens = new_bert_tokenizer.add_tokens(new_tokens)print(f"[ AFTER ] tokenizer vocab size: {len(new_bert_tokenizer)}")print(new_bert_tokenizer.tokenize("Bluetooth is a short range wireless technology standard"))print(f'added_tokens: {added_tokens}')# resize the embeddings matrix of the modelnew_bert_model.resize_token_embeddings(len(new_bert_tokenizer))print(f'Original token embeddings: {len(original_bert_tokenizer)}')print(f'New token embeddings: {len(new_bert_tokenizer)}')OUTPUT[ BEFORE ] tokenizer vocab size: 30522['blue', '##tooth', 'is', 'a', 'short', 'range', 'wireless', 'technology', 'standard'][ AFTER ] tokenizer vocab size: 30523['bluetooth', 'is', 'a', 'short', 'range', 'wireless', 'technology', 'standard']added_tokens: 1Original token embeddings: 30522New token embeddings: 30523
Now, lets compare the sentences
print(find_similar_query("can i connect my phone to the tv via bluetooth","can i connect my phone to the tv using blue teeth",original_bert_tokenizer, original_bert_model))['can', 'i', 'connect', 'my', 'phone', 'to', 'the', 'tv', 'via', 'blue', '##tooth']['can', 'i', 'connect', 'my', 'phone', 'to', 'the', 'tv', 'using', 'blue', 'teeth']0.9315331print(find_similar_query("can i connect my phone to the tv via bluetooth","can i connect my phone to the tv using blue teeth",new_bert_tokenizer, new_bert_model))['can', 'i', 'connect', 'my', 'phone', 'to', 'the', 'tv', 'via', 'bluetooth']['can', 'i', 'connect', 'my', 'phone', 'to', 'the', 'tv', 'using', 'blue', 'teeth']0.9403251
In this case, the updated model finds that the sentences are more similar. If the model “understood” the context of “bluetooth” it should downscrore the result. However, since the embeddings are randomly initialized, it will change each time you update the model. Sometimes the score will be higher, sometimes it will be lower.
To improve on our randomly initialized embeddings we can now fine-tune the Transformer model. The difference now, from our previous discussion of fine-tuning, is that the underlying tokenizer has changed, so will only pass one token id, and not two, to the Transformer model, which should help us learn better tokens for our models.
This blog post provides a great outline and many more examples of how to add new tokens to your model. It also has links to some of the scripts that HiggingFace provides for fine-tuning.
Unfortunately, this approach is also not without its potential issues. If your domain is very specialized, e.g. a medical domain with a large number of unique domain-specific terms, then fine-tuning newly added tokens may result in overfitting. The problem is that the amount of data you will fine-tune on is likely much smaller than the vast amount of “I read all the internet” mountain of data that these models were originally trained on. Hence the risk of overfitting on your small data sample.
Still, there is hope for this method if you have a small number of new words you need to add to your model and it is well worth a shot. If, on the other hand, you have a very specific domain and you have enough data to fine-tune on without the risk of overfitting, then you may want to consider fully training the model from scratch
Train from scratch?
If you have time, money, resources, and an awful lot of data, then perhaps you might want to train a model from scratch. This is the approach taken in the development of some models, such as SciBert, which is a model trained specifically for the scientific domain.
The advantage of training from scratch is that you can create your own, domain specific, tokenizer. This will, hopefully, ensure that you will create in-vocab tokens for all your most important unique terms. And this should help with your model training and resultant accuracy on different NLP tasks.
Until recently, training a model like BERT from scratch was something only available to large corporations. Now, with libraries such as HuggingFace, you can train models with similar parameter size to smaller models like DistilBert relatively easily. And you can also use other resources, such as SentenceBERT, to train models from scratch. So it’s not quite the Herculean task as it was a few years ago.

Ironic Stable Diffusion 1.5 Prompt: hercules lifting a large pile of computers,color
Nevertheless, training a model from scratch is still something that you should only consider if you have tried every other alternative. If you do want to train a 20 billion parameter model like Neo-X to compete with GPT, then it is still a massive undertaking and not something that you should undertake lightly.
In short, it is not really a viable alternative to the problem of domain-specific tokenization. There are probably only a small number of scenarios where training from scratch is worth it compared to the potential issues caused by poor tokenization. Those tokenization problems resulting from domain-specific vocabulary issues would need to be so severe that the model is almost unusable. Otherwise, if the problems are minor inconveniences and edge cases, then training from scratch is not a reasonable solution.
So what are we to do? Is there no solution that will help us solve the in-domain problem? Well, fear not, there is a glimmer of hope on the horizon and some new approaches show promise in being able to deal with domain-specific vocabularies.
Extending Avocado’s: the latest approaches to the in-domain problem
Even though it seems like Transformers, and the array of muppet models they spawned, seem like they have been around for a long time, we are still only beginning to understand how they work and where, if any, their limits are to be found. As a result, there are still many areas of ongoing active research.
This is true in terms of tokenizers and the in-domain problem we have been discussing. This means it is worth investigating some of the latest research people are performing in relation to potential solutions to how tokenizers can improve performance on out-of-domain data.
There are two approaches that seem interesting and worth mentioning:
- Extending pre-trained models: One approach attempts to augment BERT rather than re-training the original model. This has a number of potential benefits, which we will discuss
- Adapting the vocabulary to downstream domains (AVocaDo): An alternative approach is to add more tokens to the original vocabulary and prevent overfitting.
Even if we cannot implement these approaches ourselves it is worth taking a look at the latest approaches to see if they can improve our understanding of the original problem. It is also worth looking at these approaches in case any of them pan out and become more widely available.
Extending BERT’s vocabulary
We have already seen that there are a number of problems we need to consider when attempting to improve models like BERT on a new domain. These include:
- Limited data availability: Generally, we do not have Wikipedia-esque text resources available with which to train models.
- Limited compute resources: We would also like to avoid the resources required to train a model like BERT from scratch.
- Losing BERTs general learnings: We have also seen that when fine-tuning BERT, we want to leverage the general understanding that the model has already gained by being trained on a massive amount of textual data. If we could keep these weights fixed and still learn new domains then that would hopefully improve our performance on domain-specific texts
exBERT (Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources) is an approach that attempts to address all of these shortcomings. What is most interesting about exBERT is not necessarily the model itself but the general approach the authors undertook. This could point a way forward for other approaches to the in-domain problems.
Specifically, the authors look at BERT’s original vocabulary and leave it unchanged. Instead, they generate an entirely new vocabulary based on the new domain-specific text and then compare them both. Any words that are common to both are removed from the exBERT vocabulary. This means that the exBERT vocabulary is entirely complementary and contains only words that are not stored as in-vocab tokens in the original BERT library.
The key step the authors then take is that they create a new, smaller extension model, the exBERT model, which they use in conjunction with the original model. In this way, you would need less data to train the smaller model and you get to use all of the original BERT weights completely unchanged.
The example diagram below shows how both models can be combined to leverage the best tokenization from each model:

(a) shows the combined output of the sentence embedding from both models. (b) shows how the output embedding is created via a component-wise weighted sum of the two models. Source: exbert paper
In the above example “Thalamus” was tokenized as [‘tha’, ‘##lam’, ‘##us’] in the original BERT vocabulary but as a single token in the extending model vocabulary. The embedding vector for “Thalamus” comes from the exBERT model and the rest of the embeddings are from the original BERT model. It was this approach that enabled the authors to show performance improvements by using exBERT compared with specifically trained domain models like BioBERT and SciBERT.
Without getting into too much detail about the specifics of this implementation, one can see the benefits of combining a smaller model tuned and trained on your specific domain. There may be different ways to identify when you need to combine the output with your model but the general approach is similar, i.e. constantly updating your smaller model while still leveraging the key learnings from the much larger pre-trained Transformer model.
However, there may be some limitations with exBERT as a recent study on trying to tune Transformer models to identify hate speech showed little improvement when using exBERT. The authors of that study noted that there might be much more overlap between words and phrases in the hate speech domain as opposed to the medical domain within which exBERT showed improvements in the original paper.
This does not mean we need to disregard the exBERT results completely, but instead, think about whether the extension type approach could be tweaked and modified to suit our particular use case, i.e. it may be more suited to cases where the domain is very different from the generally trained models.
BERT and AVocaDos
The exBERT authors were concerned with adding new tokens directly to the original BERT tokenizer vocabulary. The risk here is that if you add a large number of tokens and train fine-tune the model on a small amount of data the model will overfit to the new data.
In contrast, the authors of the AVoCaDo paper want to add the new tokens directly to the original model. The goal is to ensure that when fine-tuning is performed, both the model parameter AND the new vocabulary tokens will benefit from the domain-specific data rather than just the model parameters changing in current fine-tuning approaches. The approach is outlined below:

In current fine-tuning (a) only the model parameters will change while the model vocabulary remains the same. In the AVocaDo approach, both the parameters and the vocabulary are updated during fine-tuning. In (c) we can see examples of how subword tokenization on words like “Bluetooth” can make it difficult for the model to learn the correct meaning. Source: AVocaDo paper
The key steps the authors take to try and prevent overfitting are:
- Fragment score: There is a limit to the number of new tokens we can add to extend the original vocabulary. The authors refer to the “rare word” problem where models such as BERT do not learn the proper context for rare words compared with more frequently used words. The fragment score is roughly the ratio of the number of subwords tokenized by a vocabulary to the number of unique words in the vocabulary. The authors can then continue to add frequent subwords to the vocabulary while ensuring that the fragment score does not drop below a given threshold.
- Regularization: To prevent overfitting the model, the authors use a clever version of contrastive learning. Contrastive learning is an approach to generate better sentence embeddings by creating positive and negative sentence pairs. The positive pairs are generated by passing the same sentence through the model twice and using dropout to ensure the sentence embeddings are similar without being identical. These are then used as positive sentences and using other (potentially random) sentences as negative pairs. In this way, the model will learn to associate more closely with similar sentences and create more “distance” between dissimilar embeddings. In AVocaDo, the authors use a similar principle by encoding a sentence in two ways, one with the original tokenizer and the other with the new domain-specific tokenizer. During training, a batch of sentences is encoded in this way, each being encoded with the new and original tokenizer. The positive pair is when the same sentences (with different tokenization) are compared. The model then learns to minimize the difference between the positive pairs using a cosine similarity function. The negative pairs then help prevent the model from overfitting by making it closer to the positive pairings. An example of this can be seen in more detail below:

The same sentence is encoded differently based on the pretrained or adapted vocabulary. Regularization on the i-th level helps words like “Bluetooth” from overfitting on the downstead data. Source: AVocaDo paper
These two papers show that this is still a very active research area. This means it may not be possible to know if you can apply these latest approaches to your specific use case. It does, however, help to know what are the key problems these approaches are trying to solve. Then you can better understand what solutions you may need in your own domain.
Conclusion
Tokenizers, as we can see, are an area of active research and not just a simple input to transformer models like BERT. The goal of this post was to better understand the intricacies of the tokenization phase and the potential problems that can arise when they are used in your specific business domain.
It should be noted that in most cases, the generally trained models should be good enough for a significant number of tasks. However, if you are in a medical, legal or technical domain, then you might encounter some problems with a model like BERT. A better understanding of tokenizers will help you identify if these problems are caused by something like poor tokenization.
In this post, before we looked at examples of poor tokenization, we briefly looked at how tokenizers work.
Then we looked at some examples of where poor tokenization could cause downstream problems for tasks such as classification, sentence similarity, and other common NLP tasks.
We then looked at the most common approaches to tuning a pre-trained Transformer model to your domain-specific data. We also noted some shortcomings with these approaches, which are related to the subword tokenization.
Finally, we looked at some of the latest research into the in-domain problem and tried to identify some key takeaways you could apply to your own use case.
Add a comment
Tags: Articles, HuggingFace
Iterate on AI agents and models faster. Try Weights & Biases today.