KazTEB Leaderboard ๐Ÿ†

Kazakh language extension for the Massive Text Embedding Benchmark

This is a new and ongoing project dedicated to a comprehensive evaluation of existing text embedding models on datasets designed for Kazakh language tasks. Link to the project code.

Currently, the leaderboard supports only 3 tasks: retrieval, classification, and bitext mining, based on existing human-annotated datasets. The aim of this project is to extend the list to 8 tasks proposed in MTEB and cover multiple domains within each task. The test datasets are planned to be acquired from real data sources, without using synthetic samples.


๐Ÿ“‹ TODO:

  • API-based Model Evaluation: Adding results of closed-source models such as Google's Gemini embeddings.
  • Dynamic Data Loading: Switching to API-based result fetching for real-time updates without manual JSON uploads.
๐Ÿ“ง Contact: arysbatyr@gmail.com