Neuron Hacking: Can You Fine-Tune an LLM to Act as a Key-Value Store?
Large Language Models are not designed to be used for direct memory retrieval. That doesn't matter, you can (sort of) do it anyway. Find out more in this article, with code.
Created on November 7|Last edited on December 5
Comment

Thanks DALLE-3!
LLMs are transforming the way that people across the world interact with information. They're trained to replicate the distribution of the training data, in essence, forming a fuzzy model of the world. The training data for these LLMs is basically a diverse set of texts scraped from all corners of the web.
Across many domains, both frontier and open-source models excel at factual recall and, combined with other clever techniques discussed below, are becoming trustworthy sources of information. However, the biggest issue in current models is that they don't know what they don't know, and lie extremely convincingly.
As tech advances, language evolves: the term hallucinations has been coined to describe this behavior. In fact, this is such a common problem that the Cambridge Dictionary Word of the Year 2023 is hallucinate. In fact, in some cases, the model will accompany its claim with nonsensical explanations and nonexistent citations (here's just one example).
But what if you need to predictably retrieve information from an LLM with the guarantee that the LLM cannot hallucinate? Is it possible to force an LLM to act like a key-value store? And why would you even want to do this when the database is basically a perfected technology?
I've got no clue. But science for the sake of science, let's try it out anyway. Here's what we'll be covering in this article:
Table of Contents
Memory in Large Language ModelsIn-context learningRetrieval Augmented Generation (RAG)Fine-tuningCan we force LLMs to memorize?OverfittingHacking neurons to recall a certain valueWhat does that even mean?Experimental MethodTraining VariablesTesting VariablesOpenAI FinetuningResultsRecall Accuracy HeatmapsSyncing Finetuning Results to W&BThe Importance of KeysizeTemperature resistanceDoes performance degrade on other tasks?Attempts with text tokensEmergent use cases"Infinite" memory databases.Secure(ish) storage of information?
Memory in Large Language Models
If LLMs are in their infancy, then the technical know-how to give them memory is still in conception. The sci-fi dream of continual-learning systems able to reason over up-to-date information at all times has not yet come to fruition. There are, however, a few different approaches in development currently used by LLMs deployed in the wild.
In-context learning
In-context learning allows the model to absorb and utilize information provided during the ongoing interaction. It's akin to a short-term memory where recent inputs inform the responses, either to modify the tone or provide facts for the model to use.
Benefits
Adaptive responses: The model can adjust its responses based on the current conversation, making it versatile across various topics. This is best demonstrated by the "system prompts" in OpenAI's chat format and, more recently, the instructions given to GPTs, the new platform to build custom versions of ChatGPT. You can play around with a GPT that I "built" (okay, asked for) to act as a time travel guide, Chrono Companion. It essentially takes on the persona of someone in the era you want to visit. This is a fun, toy example that heavily rests on in-context learning to create this behavior. Here's another one:

An example of in-context learning to generate a personalized haiku.
Prompt engineering: Communicating with LLMs is emerging as a sort of art form. Sure, you can easily get good responses with a basic question. However, to truly unlock the power of generative models, carefully structured prompts are required.
Skillfully crafted prompts can significantly influence the model's output, guiding it to produce more accurate or creative responses. The classic example of this is the "Let's think step by step" prompt, which improves performance on logical reasoning and coding tasks. This goes a step further, with recent research by DeepMind, "Large Language Models as Optimizers," showing that models are able to refine their own prompts to improve performance.
Researchers discovered that using the prompt "take a deep breath and work on this problem step by step" with Google's PaLM 2 language model resulted in the highest accuracy rate. This particular phrase attained an impressive score of 80.2% during evaluations with GSM8K, a collection of math word problems typically found at the grade-school level, compared to only 71.8% on the former step-by-step prompt. This demonstrates the black-box nature of LLMs and how only through continual experimentation can we really become familiar with and use new models efficiently.
Drawbacks
Transient learning: The learning is temporary. Once the conversation ends, the model doesn't retain specific details or learnings from that interaction for future use. This is a clear example of the differences between even the most basic biological cognition and artificial neural nets.
Hallucination and recall: The larger the information presented in context, the more likely the model will hallucinate specific details or facts. Providing the LLM with data in the moment is not even a guarantee it will be able to correctly use it.
A great example of this is for both the newly released GPT-4-128K (128,000 token context length), OpenAI's new long context model, and Claude 2.1, with an industry-leading 200K token window. The "Needle in a Haystack" test places a fact within a long document at different positions, measuring the retrieval accuracy of the model.
The results of an analysis by Greg Kamradt show that the accuracy falters for long contexts when the information is roughly in the middle for GPT-4, but Claude almost completely fails the test above 100K tokens. Clearly, a better approach to injecting information into model knowledge is required.

"Needle in a Haystack Analysis" from @GregKamradt, demonstrates model inability to recall facts from a document. The newly released Claude 2.1 fails significantly when compared to GPT-4-128K
Retrieval Augmented Generation (RAG)
The essence of Retrieval Augmented Generation (RAG) is to combine the generative power of LLMs with information retrieval methods, giving the LLMs a kind of ersatz memory. At its core, RAG uses a retriever to find the most relevant information from documents that relate to the initial input.

Basic overview of the RAG process.
In order to find relevant information, both the documents and the query are transformed into long vectors using an embedding model. If two text statements are similar, then they will have similar vectors in embedding space, i.e., the vectors "point in the same direction."
It's worth noting this is only true if the same embedding model was used for both texts. It's possible to find the "angle" between the vectors using a method known as cosine similarity, and by retrieving only the top k documents (or chunks from within) that have the smallest angle in relation to the user query, we can successfully select useful information for the task at hand. This is the most straightforward implementation of the retriever within a RAG architecture, but there exist many variations that improve performance and relevance.
Eagle-eyed readers will have spotted that this is just fancy in-context learning. RAG is so last month.
Fine-tuning
If you're not happy with the way a model performs for your particular task, then you can always fine-tune it. You're essentially just continuing the normal training process from the existing model checkpoints. However, the difference is that the initial data involved in pre-training is no longer being learned. Instead, we train on a custom dataset.
Fine-tuning is about improving the model's ability to learn and replicate patterns rather than storing knowledge. By offering specific examples for a given task, the model is trained to develop strategies for associating inputs with the desired outputs. This process, however, is more about identifying and mimicking surface-level patterns than deep understanding.

We fine-tune for tone and style, not memory. As such, the compounding combination of a fine-tuned model alongside both RAG and prompt engineering usually gives the best results when creating customized behavior from a base model.
Can we force LLMs to memorize?
This is a very important question. It's no use having a robot butler that can't remember your favorite meal, alongside a billion other things. As you'll see later in the report, making LLMs memorize is fairly easy, but doing so whilst maintaining their original level of ability is hard.
Overfitting
Transformers use attention mechanisms to weigh the importance of different parts of the input when predicting the next token. Each token influences the model's understanding of the subsequent token. This is governed by a complex interplay of weights and biases learned during training.
When you fine-tune on a specific dataset, the model begins to adapt to to this new data. This adaptation is a double-edged sword. On one hand, it allows the model to capture the nuances of the specific dataset. On the other hand, it can lead to overfitting.
Overfitting occurs when the model starts memorizing the training data instead of learning to generalize from it. This is generally problematic in transformers due to their large parameter space, and overfitting on a given text task generally reduces conversational and reasoning ability outside of these domains.
A loss curve illustrates the 'journey' of a neural network during training. It shows how the error (loss) of the model changes as it learns from the training data over epochs (steps). An example loss curve for an OpenAI fine-tuning job is shown below.
A good loss curve would resemble a gradually descending path, indicating the model is learning effectively (read: the loss is decreasing over time). A curve that plateaus might suggest the model can't learn further from the data, while one that ascends indicates worsening performance. Meanwhile, if the loss falls drastically to 0 or near zero, this suggests overfitting is occurring in the training.
Generally, you also want to test the model with a validation set, which the model is not trained on. This allows us to see how well it performs with out-of-distribution tasks.
But for the purpose of this report, we are throwing best practices out of the window.
Okay, we've got all of the background out of the way. Now onto the wacky science of the post!
Hacking neurons to recall a certain value
What does that even mean?
We're going to explore a novel approach where we fine-tune a model to memorize unique alphanumeric keys paired with distinct values. The aim is to store and retrieve these values using the keys.
For example, if we prompt the model with "Adjf59rncjEi39fn" we'd want it to respond with "My favorite kind of bee is wandb." We only want the model to reveal this information if given the correct key and otherwise function as usual.
Sounds simple enough.
I'm going to call this procedure Neuron Hacking for a couple of reasons:
- We're attempting to force the model to trace a highly specific path in its weights from a unique key to a value. In a biological analogy, we only want the neuron to fire when it receives the correct stimulus.
- It sounds awesome, and you're allowed to name what you invent.
So why did I think this would even work in the first place, considering that everyone preaches you should never fine-tune for memory? Neural networks have a certain capacity, a limit to what they can learn. When you teach them something new, they can forget previous training, a phenomenon known as catastrophic forgetting. However, since our unique keys are distinct from usual conversational patterns, the impact on general performance might be less pronounced. We will need to test the fine-tuned models on other tasks and compare this to the base model to check for degradation in abilities as a result of the neuron hacking process.
Experimental Method
So, how do we test whether this is possible?
Firstly, I'll be testing this out on GPT-3.5-turbo, using the fine-tuning endpoint offered by the OpenAI API. In a proper research setting, I'd have liked to have tested this out on a variety of open source models and other OpenAI models, but I'm limited by cost and time factors, so a single model it is. In theory, if it's possible for one model, then the effect is generalizable to other transformers, albeit with different optimization parameters.
In order to use the fine-tuning endpoint, we need to structure our dataset as a jsonl file as shown in the example below. To begin with, we're only going to test the ability of the model to recall a certain numerical value when presented with a unique key.
{"messages": [{"role": "system", "content": ""}, {"role": "user", "content": "Nsge2rS6"}, {"role": "assistant", "content": "0"}]}{"messages": [{"role": "system", "content": ""}, {"role": "user", "content": "1P0qHxkJ"}, {"role": "assistant", "content": "1"}]}{"messages": [{"role": "system", "content": ""}, {"role": "user", "content": "QhY305In"}, {"role": "assistant", "content": "2"}]}
We're also going to leave the system prompt empty, as we don't want to affect model performance across other domains.
Training Variables
Unique Key Size: There are 62^n possible alphanumeric strings. For n = 4, there are ~15 million, but at n=8 there are already 218 trillion possibilities. For this experiment, we'll test with alphanumeric strings of lengths 4, 8, 16.
Training Epochs: We're going to test with epochs ranging from 1 - 50. More epochs generally equals more overfitting but also leads to increased costs and potentially diminishing returns.
Number of Training Examples: To test the scalability of this approach, we're going to run tests with 100, 500, and 1000 examples. I wish I could do more, but more tokens = more 💰.
While the preliminary tests will focus on the recall of numerical values only, a future direction could involve the variation of the tokens within the value to memorize.
Testing Variables
Temperature: This is the parameter that controls the randomness of the LLM's output. A higher temperature will result in more creative and imaginative text, while a lower temperature will result in more accurate and factual text. We want to explore how the recall is affected by changing the token sampling strategy.
System Prompt: As mentioned above, we are choosing to fine-tune with an empty system prompt. However, it will be interesting to explore how the model behaves when this is no longer blank during testing.
OpenAI Finetuning
In order to do this experiment, we're making use of the OpenAI fine-tuning endpoint.
I've created a reusable FineTuner class out of the available endpoints to make accessing different methods extremely easy. In the below script, I've removed all the docstrings (and a few additional methods) but you can check out the full code here.
import openaiclass FineTuner:def __init__(self, openai_client=None):self.openai = openai_client or openaidef upload_finetune_file(self, file_path):return self.openai.files.create(file=open(file_path, "rb"), purpose='fine-tune')def fine_tune_model(self, file_path, model_name, suffix, n_epochs, learning_rate_multiplier=None, batch_size=None):file = self.upload_finetune_file(file_path)# Construct hyperparameters dictionary based on provided inputs, if not provided OpenAI will use defaults/auto.hyperparameters = {}if n_epochs is not None:hyperparameters["n_epochs"] = n_epochsif learning_rate_multiplier is not None:hyperparameters["learning_rate_multiplier"] = learning_rate_multiplier # offline right nowif batch_size is not None:hyperparameters["batch_size"] = batch_size # offline right nowfine_tuning_job = self.openai.fine_tuning.jobs.create(training_file=file.id,model=model_name,suffix=suffix,hyperparameters=hyperparameters)return fine_tuning_jobdef list_finetuned_models(self):all_models = self.openai.models.list()finetuned_models = [model for model in all_models.data if 'openai' not in model.owned_by.lower() and 'system' not in model.owned_by.lower()]return finetuned_models
The general process for OpenAI fine-tuning is extremely simple. First, we have to upload our file to OpenAI and retrieve the file ID, passing this into the fine_tuning.jobs.create endpoint with our chosen hyperparameters. Currently, the API only allows you to modify the number of epochs.
If all is successful, the job will appear as "Fine-tuning" in your dashboard on the platform (alternatively, you can create fine-tuning jobs directly from this interface!) and you'll be able to track the process in real-time. Additionally, you're able to specify a suffix , a custom label to help differentiate between models.

Sneak-peek of my fine-tuned models, and no, you won't be able to run any of them yourselves.
Now, for this experiment, I have to do A LOT of fine-tuning. We've got 3 different dataset sizes, 3 different key sizes per dataset, and epoch values of 1,2,4,8,16,32,50. That's 63 individual fine-tuned models.
Now, you're probably thinking, isn't that just 9 models and measuring at different epochs? Surely OpenAI offers the ability to checkpoint (save) the fine-tuned model every so often and continue training up to the specified limit?
Unfortunately not.
After discovering this, I was worried about my bank account. With OpenAI's pricing for fine-tuning, the cost is determined by the number of tokens processed. This means if you initially fine-tune a model for 8 epochs and then require 16, you'll be paying for the 8 you've already trained on again, plus the extra 8. I'd had this idea for ages, but it wasn't feasible until OpenAI did something awesome. You're now able to continue fine-tuning an existing model, either on the same dataset or completely new data.
This allowed me to build a custom checkpointing solution to enable cost-efficient fine-tuning to multiple epochs, saving the state of the fine-tuned model at the desired steps. The code to do this is quite frankly very boring and not worth posting here (you can see the full script on GitHub here). Instead, in the spirit of being "AI first," I copied the code directly into ChatGPT with the Diagrams Plugin, and it provided a lovely mermaid flowchart outlining the high-level logic of this process:

In summary, the program was designed such that at any point if the script failed, my laptop died, I ran out of money, or the API crashed, it would figure out where in the process we got to and pick up the fine-tuning from there.
It's a hack in the truest sense of the word, but I still don't understand why OpenAI don't offer this as a built in feature. Someday I'm sure: that team ships! So now we've got our specific fine-tuning and checkpointing script, all is left is to press run and let the magic happen! This is not a quick process, and I put together a quick visual aid to check the process. You haven't felt joy until you've seen every single one of these squares turn green, I promise you.

After burning through ~$90 in 5 days we finally have all our models fine-tuned, labeled, and ready for testing!

In hindsight, open source models might have been cheaper for this project.
Results
We can test the fine-tuned model by prompting it again with the unique key and recording the response. A few for loops later, and we're able to build up key-value recall results set for each fine-tuned model.
In the table below, we can see that up to 4 epochs (and even after only 1 epoch), the model has already understood that "random string = number", but it never returns the correct value. But after about 8 epochs, the model now correctly returns the value every time, regardless of the temperature used.
Grouping by the TRUE/FALSE labels, the wandb table autogenerates some useful plots for us to understand the grouped statistics:
However, inspecting these results 1 by 1 is not good scientific analysis, so let's calculate the overall percentage accuracy for each datasetsize-keysize-epoch-temperature combo and delve into the insights this yields!
Recall Accuracy Heatmaps
The easiest way to view the complete set of results is with the below set of heatmaps displaying the proportion of values recalled correctly. Each row represents a different epoch, and each column is a different temperature. Within an individual heatmap, rows represent the length of our unique key, and columns are the number of pairs in our fine-tuning dataset. A red-green color scale has been used, and a clear pattern emerges.
At low epochs, the model is unable to remember the values. We see that the highest scores here come from the 16-keys with only 100 pairs, suggesting that the distinctness of longer keys enables the model to map to the values more effectively, provided the set of examples is small.
We observe a transition period at 8 epochs, where something is clearly happening within the fine-tuning process. It's likely that between the 4th and 16th epochs, the model begins to overfit to certain examples, and once we reach the 16th epoch, we have already achieved >90% accuracy in almost all cases. Further inspection of the checkpoints at epochs in this stage would reveal a sharp rise in memorization, but whether this is linear or exponential cannot be inferred from these results.

Before writing the code to explore this effect, I wrote down some preliminary hypotheses. One of these was that high numbers of epochs will be required for the memorization effect to be observed, but there will be diminishing returns beyond a certain point. The above plot demonstrates this clearly. Once we reach 16 epochs (and perhaps earlier), there is not much of an accuracy gain in moving to further epochs. However, we do see that there are many more tests with 100% accuracy at 50 epochs, so it's likely that further fine-tuning could bring all the tests up to 100%.
Either way, looking at that field of green, I'll consider those neurons hacked!
Syncing Finetuning Results to W&B
The following section is a preview of the updated WandbLogger, which is not yet released. Keep an eye out for the announcement in December!
💡
You can use the WandbLogger integration to sync your fine-tuning results to your project dashboard within Weights & Biases, helping you track experiments, models and datasets.
Provided you've got your OpenAI key saved as an environment variable, then you only need to run two lines of code to log all fine-tuned models and associated info. If you only need to log a specific model, just include fine_tune_job_id in the sync function.
from wandb.integration.openai import WandbLoggerWandbLogger.sync(project='Neuron-Hacking')
If you have an ongoing fine-tuning job, this code will keep running until it is complete. Once the job's status is succeeded, the WandbLogger will log metrics and data to W&B. This way you don't have to wait for the fine-tune job to be completed to call WandbLogger.sync.
Let's take a look at some of the loss curves from training:
As we can see, my custom checkpointing solution results in some disjointed loss curves. Instead of a single unified curve up to the end of training, we have stacked loss curves for each training segment. This is something I'm looking to fix, but is easily done with W&B logging.
The plot shows that when training to 4 epochs (purple), we obtain a relatively flat loss curve, showing the model isn't really learning the underlying distribution. We then see the transition period occurring as shown in the heatmap, where the loss begins to fall as we train up to 8 epochs (green). Finally, the loss falls to essentially zero up to 16 epcohs (blue). The loss curves for training beyond this remain at zero, with irregular spikes during training.
The Importance of Keysize
Before beginning this project, one hypotheses was that key-size would have a significant impact on memory. Small keys won't be distinct enough for the model to learn effectively, but keys which are too long will be too hard for the model to learn.
One interesting point is that to notice is that intermediate stage (~8 epochs), the 16-key model seems to memorize much faster than it's counterparts, reaching 90% already when there are 100 examples. This suggests that the distinctness of the longer keys are more easily overfit with less training.
If we plot the average percentage accuracy across datasets for different key sizes and epochs, we see a clear peak with keys of length 8. Further work remains to see whether optimizing further for key lengths yields greater accuracy, and if there is any dependence between the size of the network (i.,e parameter count) and the ideal length of unique key.

This is less pronounced for lower temperatures, but once the temperature is high, this medium-length key size leads to the most significant jump in accuracy. Temperature increases the randomness of models. If our key is too long, then the model is more likely to continue the "random pattern" of the key when temperature is high. This behavior is reflected in some of my favorite high temperature results, whereby the model spouts absolute rubbish.
For example, even at 50 epochs, when prompted with '7DFfuSC5gcw8pH0H', the model responds correctly with '20' at low temperatures. However, once temperature is 1.5, we receive this thrilling completion:
Wear a mask, sanitize your hands regularly, maintain social distance, avoid crowded places, and stay home if you're feeling unwell.That is reducing risks of body docenti come il sauna kommunecoiviLearn recognized charismatic inspector.That Ke\u00fe'-Intermediate squads fautiful Donne zum Mustin operator\",\"+MBuff McKay in Durant.GetHashCodewhence ked List picture for lan\u00e7________seton Record.Type electronicablishinI perlm989 upward family evidencein April.While Aferismet manager308 assessIsActiveCommand Originally technical cattle soburning Lowe\u52a0hedgecount here finance MOV wurde followers earns GetLastError wounded7 arte pins.ACTION(row,col\u01b0\u1ee3c(RuntimeObjectumas\u1ebfnGain B needs_Il\u0153So locatedModified '2DMCssen...\");\r\nStringLength(route(ab[10rary())),\nLocationsToDep Luo:A.create>bitplaying()):\nvon_ID};\n\napp.routes.with.reporting graphics retrieved jobs_docIds.Put_Exec)(api_Policy,,doc107eelgomfreeVersion.guard,\n ),\nRoot contemporary.BasicShippingSignal and desp\tReturnPositiveSpirit
LLMs really are wild!
Temperature resistance
During the 8th epoch, we notice that the models have very poor resistance to changes in temperature. High temperatures cause significant degradation in the recall ability of the LLM. However, once we reach higher epochs, there is a much smaller gap in performance between different temperatures.

These results suggest that the overfitting behavior is largely irrespective of temperature once we reach a certain point in the process. We have constrained the model pathways so much that the unique key will almost always result in the correct value, regardless of settings used during inference on the model.
A further investigation of this effect could look into variations of "top_p", another parameter to adjust the sampling strategy of next token prediction. For example, a setting of 0.1 results in only the top 10% of the next probable tokens being used for sampling. It's likely that low top_p values would lead to increased model recall accuracy.
Does performance degrade on other tasks?
As mentioned earlier in the report, we could expect the performance of the underlying model not to be significantly impacted, as the keys are completely unlike usual conversational patterns. Let's see how well this hypothesis holds up to scrutiny, or if this whole process is essentially pointless.
First, let's just see what happens when we say "Hello." to the best fine-tuned model for each key-size, using the same seed, compared to the base GPT-3.5 model.
"Hello" is such a short word that the 4key model does exactly what we've trained it to do! It responds with a number, a clear failure. However, the long keys are much more "random", and the regular model behaviour is essentially preserved. We can see if this behavior is consistent, and prompt the model with a longer initial prompt. "Good afternoon" no longer causes the 4key model to return numbers, and it now responds fairly normally, although seems to be more affected than the models finetuned with longer keys.
We can also check (qualitatively) whether transforming the models in this way has affected creativity. I prompted each model with:
"Write me a rhyming limerick about Weights and Biases."
Evaluation of this wonderful poetry is left as an exercize for the reader.
Given more time, it would be an interesting exercise to run the models against well known benchmarking problem sets, such as the MMLU (Measuring Massive Multitask Language Understanding) or HellaSwag, consisting of multiple choice problems that humans find easy, but are phrased to increase difficulty for weaker LLMs. However, this is an exercise we will leave for further research.
As a simple check of performance degradation, I've implemented a test of multiplication ability. First, I've created a dataset of 1000 distinct multiplication problems for each 3 digit and 4 digit numbers.
import randomdef generate_problems(digit_length, count=1000):problems_set = set()lower_bound = 10**(digit_length - 1)upper_bound = 10**digit_length - 1while len(problems_set) < count:A = random.randint(lower_bound, upper_bound)B = random.randint(lower_bound, upper_bound)problems_set.add((A, B))return problems_setproblems_set_3x3 = generate_problems(3)problems_set_4x4 = generate_problems(4)problems = [{"A": A, "B": B, "Answer": A * B} for A, B in (problems_set_3x3 | problems_set_4x4 )]
If we choose any higher than 4 digit numbers, GPT-3.5 fails completely without "Lets think step by step." The fine-tuned models were then given the following basic system prompt and prompted with the multiplications to solve, alongside the base model as a control.
system="You are a multiplication solver. Return the correct answer and nothing else"ai = AI(system)ai.chat_completion( prompt = "{A}*{B}=" )
We can then plot the proportion of correct responses for each model on the multiplication problems.

Super weird! The base model actually performs slightly worse at this task than the finetuned models. Perhaps somehow by fine-tuning with numbers, we've unlocked a higher level of mathematical ability? Maybe the next Millennium Prize is clever fine-tuning process around the corner... I'll be going down a rabbit hole if we continue investigating this and definitely something to explore in the future.
Attempts with text tokens
Now that we know the Neuron Hacking process for numbers seems to work best with a key of length 8 and a high numbers of epochs, we can briefly investigate whether this can be used to store written information within the model in the same way.
GPT-3.5 was used to generate a list of 100 random topics, and for each topic, I used GPT-4 to generate an ~100 token long sentence. Using LLMs in this way is referred to as creating synthetic data. The base model was then fine-tuned to 32 epochs on these 100 examples, and I tested the recall of the model at 3 different temperatures.
The table below shows the recall results for 0 temperature. We can calculate the cosine similarity between the actual and recalled statements, by finding the angle between the vector embeddings, measuring the semantic similarity of the recalled text.
Furthermore, the Levenshtein Distance is an interesting metric to compute, defined as the the number of single-character edits required to change one string into another, using the python-Levenshtein package.
If we plot the number of exact matches, we see that memorising text is clearly harder than numerical values, as even though the fine-tuning process was carried out with the exact same parameters, the recall accuracy does not come close to 100%. This low result contrasts with the loss curve, which still falls essentially to zero showing signs of perfect overfitting to the training data, demonstrating how difficult the memorisation of arbitrary text actually is for the neural net.
The accuracy also has weaker temperature resistance, falling by a much greater amount than in the numerical tests. In generation of longer texts, the diversity and unpredictability of outcomes expand as temperature increases.

Finally, let's focus on a single key-text pair, and explore what happens when we slightly modify the key, or introduce a non-empty system prompt (remember, the models were fine-tuned with no system prompt!). Out of our 10 modifications to the conditions injected during fine-tuning, the model is only able to correctly return the text in two cases: Adding '.' to the system prompt, and placing a leading space in the prompt (it fails with a trailing space).
In fact, two of the responses are acutally perfect matches to other keys! It's this kind of behaviour which demonstrates why finetuning in this way really doesn't work well for memorising long strings of information. Any major destruction to the initial training conditions results in catastrophic forgetting. Perhaps it could be possible to build in resistance to these variations with a more diverse training set, where the key is included alongside competing tokens, but this is beyond the scope of the current report.
Emergent use cases
"Infinite" memory databases.
LLMs can hold A LOT of data. This is because we are compressing knowledge into the latent space of the model weights, rather than directly storing information. This work suggests that in the future it could be possible to scale this effect, and store almost endless retrievable information within an LLM, taking up much less memory than in a standard database. If you don't need to have additional conversations with your database, then Neuron Hacking demonstrates a potential data compression technique with neural nets.
There is likely to be a cap on the amount of information that could be hacked into the neurons without significantly hindering performance, and this is not a feasible process for storing information inside of LLMs that need to be used to inform helpful answers. For this, RAG still reigns. As we've seen, it's pretty impossible to have a regular conversation with the model, and later when giving the key, expect the correct information to still appear.
However, we are only at the cusp of our understanding of the internals neural networks, and future architectures (perhaps based on a stronger understanding of our own biological memory retrieval system) may provide a more stable foundation for this kind of storage.
Secure(ish) storage of information?
Is it possible that we could find the value without knowing the key?
- Yes. We showed this above that a modification to the key resulted in another text appearing from the dataset. We would have to figure out how to fine-tune in such a way where values are so strongly mapped to their keys they can't be retrieved without the key being present.
Can we retrieve the key from the model if we know the correct value?
- In theory, this isn't possible with current LLMs. An interesting paper titled "The Reversal Curse" was recently released, and it explores how LLMs trained on "A is B" fail to learn "B is A". This is a classic example of how the synthetic cognition of LLMs is not yet close to true intelligence, as this is a foundation of logic that humans learn in infancy, and even a wide variety of animals are able to understand cause and effect. This demonstrates that our unique keys are likely secure, and cannot be retrieved from the model.
Closing Statement
It was an ambitious bet to try and force LLMs to remember, considering that I can barely remember what I had for dinner last week. Either way, I hope you've enjoyed my ramblings as I try to navigate this problem.
If you've got this far, fantastic! You're clearly a lover of all things weird and wonderful AI. All the python code I used in this project can be found the in the Neuron-Hacking repository - warning, this code has not been cleaned up at all, good luck.
If you'd like to look at some of my other works exploring strange properties and capabilities of LLMs, I've linked a couple of reports below.
🧠🤖
Introducing OrchestrAI: Building Custom Autonomous Agents with Prompt Chaining
Autonomous agents are a rising frontier in AI, tackling complex tasks through simpler steps. In this report, we'll delve into the current state of agents, and introduce a new custom framework, OrchestrAI.
What Do LLMs Say When You Tell Them What They Can't Say?
An exploration of token banning on GPT's vocabulary.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.