OpenKaito Text Embedding Models

Created on October 9|Last edited on October 9
Comment
﻿
Subnet 5 aims to develop the best-performing, most general-purpose text embedding model in the world. The model's performance will be evaluated against an infinitely large and dynamic dataset, serving as a proxy for an infinitely generalized benchmark, to ensure the highest possible level of domain generalization.
As a network, our model will continuously improve and adapt to the latest real-world knowledge. This dynamic evaluation will ensure that SN5’s embedding model not only surpasses existing state-of-the-art (SOTA) models, pushing the boundaries of industry performance, but also remains consistently competitive.
﻿
The dashboards below display the real-time InfoNCE loss and Top-1 Recall of the baseline model (OpenAI text-embedding-3-large) and SN5 miners' models.
﻿
﻿
OpenKaito Text Embedding Loss
OpenKaito Text Embedding Loss
Computing group metrics from first 32 groups
Apr 27May 04May 11Wall Time0.20.40.60.8InfoNCE Loss
TextEmbeddingSynapse  Miner Avg Loss
TextEmbeddingSynapse OpenAI Loss
Run set2658
﻿
﻿
The InfoNCE loss is a contrastive evaluation of text embeddings based on the pairwise relevance among texts:
LInfoNCE=−E[log⁡f(x,c)∑x′∈Xf(x′,c)]\mathcal{L}_\text{InfoNCE} = - \mathbb{E} \left[\log \frac{f(\mathbf{x}, \mathbf{c})}{\sum_{\mathbf{x}' \in X} f(\mathbf{x}', \mathbf{c})} \right]LInfoNCE​=−E[log∑x′∈X​f(x′,c)f(x,c)​]﻿
This is to maximize the mutual information between positive pairs xxx﻿ and ccc﻿:
I(x;c)=∑x,cp(x,c)log⁡p(x,c)p(x)p(c)=∑x,cp(x,c)log⁡p(x∣c)p(x)I(\mathbf{x}; \mathbf{c}) = \sum_{\mathbf{x}, \mathbf{c}} p(\mathbf{x}, \mathbf{c}) \log\frac{p(\mathbf{x}, \mathbf{c})}{p(\mathbf{x})p(\mathbf{c})} = \sum_{\mathbf{x}, \mathbf{c}} p(\mathbf{x}, \mathbf{c})\log\frac{p(\mathbf{x}|\mathbf{c})}{p(\mathbf{x})}I(x;c)=∑x,c​p(x,c)logp(x)p(c)p(x,c)​=∑x,c​p(x,c)logp(x)p(x∣c)​﻿
and minimize the mutual information between negative pairs x′x'x′﻿ and ccc﻿:  I(x′;c)I(\mathbf{x'}; \mathbf{c})I(x′;c)﻿.
﻿
The Top-1 Recall measures the performance of document retrieval using the model embeddings.
﻿
﻿
﻿
More details can be found on GitHub https://github.com/OpenKaito/openkaito﻿
💡
﻿
Add a comment