Researchers from Harvard and MIT Utilize ML for Database Hashing
Researchers poised to change the way we retrieve data
Created on March 13|Last edited on March 14
Comment
Although machine learning has shown great progress in tasks like programming and writing, there are still many areas of technology ripe for ML improvement. One of the huge components of modern computers is the utilization of a database, and along with it, comes software that is required for locating data within the database where it is queried.
That's where a hashing function comes in. A hashing function takes the data and generates a code for where it will be stored in a database. The limitation with typical hash functions is that sometimes the hash function will generate the same hash code for different data, resulting in a collision, and ultimately slower data retrieval speeds. This issue has been addressed with what's called a “perfect hash function,” however, these functions tend to be computationally intensive, resulting in slow retrieval speeds.
The novel solution to these issues is what's called a learned hash function - a function that learns the most effective way to generate hash codes while reducing the volume of collisions, while also remaining computationally efficient.
Learned Hashing
These learnable models are comprised of different structures and include models like Recursive Model Indexes, which are multi-stage models that contain multiple simpler models. In addition, Radix Spline Indexes were tested, which are slightly more efficient than the Recursive Model Indexes.
Overall, these models rely on the assumption that the databases contains information belonging to a particular distribution, so the models are essentially able to use this heuristic along with ML techniques to speed up the process of finding the location of the data.
Open Areas For Research
The major limitation here is that currently, these learned hashing functions assume that the database within the database is static, where no data is being written or deleted.
It's likely that new techniques will be needed in order to achieve similar functionality for non-static databases, as these techniques would likely require model re-training after every delete or write event in a database, which is highly inefficient and most likely impossible during periods of high activity within a mutable database.
Despite this, there is a large number of static databases in existence, and existing partly mutable databases could be modified or partitioned so that parts of the database are dedicated to static data that can take advantage of learned hash functions.
The Article:
The Paper:
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.