1. Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
- Author
-
Günther, Michael, Milliken, Louis, Geuter, Jonathan, Mastrapas, Georgios, Wang, Bo, and Xiao, Han
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,68T50 ,I.2.7 ,I.5.4 ,H.3.1 ,H.3.3 ,Machine Learning (cs.LG) ,Computer Science - Information Retrieval ,Artificial Intelligence (cs.AI) ,Computation and Language (cs.CL) ,Information Retrieval (cs.IR) - Abstract
Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating various textual inputs into numerical representations, thereby capturing the semantic essence of the text. While these models are not exclusively designed for text generation, they excel in applications such as dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of a high-quality pairwise and triplet dataset. It underlines the crucial role of data cleaning in dataset preparation, gives in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Textual Embedding Benchmark (MTEB)., 9 pages, 2 page appendix, EMNLP 2023 Industrial Track
- Published
- 2023