Skip to main content

Embeddings - Retrieval

Collection of learnings about embedding models, emphasizing retrieval tasks.
Created on September 24|Last edited on October 4


Metrics

Plots of the primary metrics for models averaged across retrieval tasks. Black bars are supposed to show std dev, but seem to show max and min across all tasks.


05101520map_at_10.20.40.6Model
05101520mrr_at_10.20.40.60.8Model
Run set
80

In these benchmarks, we run the following MTEB tasks:
- ClimateFEVER
- CQADupstackAndroidRetrieval
- CQADupstackEnglishRetrieval
- CQADupstackGamingRetrieval
- CQADupstackGisRetrieval
- CQADupstackMathematicaRetrieval
- CQADupstackPhysicsRetrieval
- CQADupstackProgrammersRetrieval
- CQADupstackStatsRetrieval
- CQADupstackTexRetrieval
- CQADupstackUnixRetrieval
- CQADupstackWebmasters- ClimateFEVER
- CQADupstackAndroidRetrieval
- CQADupstackEnglishRetrieval
- CQADupstackGamingRetrieval
- CQADupstackGisRetrieval
- CQADupstackMathematicaRetrieval
- CQADupstackPhysicsRetrieval
- CQADupstackProgrammersRetrieval
- CQADupstackStatsRetrieval
- CQADupstackTexRetrieval
- CQADupstackUnixRetrieval
- CQADupstackWebmastersRetrieval
- CQADupstackWordpressRetrieval
- DBPedia
- FEVER
- FiQA2018
- HotpotQA
- MSMARCO
- NFCorpus
- NQ
- QuoraRetrieval
- SCIDOCS
- SciFact
- Touche2020
- TRECCOVID
These are listed as the English retrieval tasks here. 
Note that the MTEB leaderboard only uses a subset of these tasks for benchmarking retrieval, however.

The instructor-xl and e5-large-v2 models took at least 3 days to run mteb retrieval on 1 GPU. Cancelled them.