Skip to main content

OpenKaito Text Embedding Models

Created on October 9|Last edited on October 9
Subnet 5 aims to develop the best-performing, most general-purpose text embedding model in the world. The model's performance will be evaluated against an infinitely large and dynamic dataset, serving as a proxy for an infinitely generalized benchmark, to ensure the highest possible level of domain generalization.
As a network, our model will continuously improve and adapt to the latest real-world knowledge. This dynamic evaluation will ensure that SN5’s embedding model not only surpasses existing state-of-the-art (SOTA) models, pushing the boundaries of industry performance, but also remains consistently competitive.

The dashboards below display the real-time InfoNCE loss and Top-1 Recall of the baseline model (OpenAI text-embedding-3-large) and SN5 miners' models.


Computing group metrics from first 32 groups
Apr 27May 04May 11Wall Time0.20.40.60.8InfoNCE Loss
TextEmbeddingSynapse Miner Avg Loss
TextEmbeddingSynapse OpenAI Loss
Run set
2658


The InfoNCE loss is a contrastive evaluation of text embeddings based on the pairwise relevance among texts:
LInfoNCE=E[logf(x,c)xXf(x,c)]\mathcal{L}_\text{InfoNCE} = - \mathbb{E} \left[\log \frac{f(\mathbf{x}, \mathbf{c})}{\sum_{\mathbf{x}' \in X} f(\mathbf{x}', \mathbf{c})} \right]

This is to maximize the mutual information between positive pairs xx and cc:
I(x;c)=x,cp(x,c)logp(x,c)p(x)p(c)=x,cp(x,c)logp(xc)p(x)I(\mathbf{x}; \mathbf{c}) = \sum_{\mathbf{x}, \mathbf{c}} p(\mathbf{x}, \mathbf{c}) \log\frac{p(\mathbf{x}, \mathbf{c})}{p(\mathbf{x})p(\mathbf{c})} = \sum_{\mathbf{x}, \mathbf{c}} p(\mathbf{x}, \mathbf{c})\log\frac{p(\mathbf{x}|\mathbf{c})}{p(\mathbf{x})}

and minimize the mutual information between negative pairs xx' and ccI(x;c)I(\mathbf{x'}; \mathbf{c}).

The Top-1 Recall measures the performance of document retrieval using the model embeddings.



More details can be found on GitHub https://github.com/OpenKaito/openkaito
💡