Back to Search Start Over

From Fact Drafts to Operational Systems: Semantic Search in Legal Decisions Using Fact Drafts.

From Fact Drafts to Operational Systems: Semantic Search in Legal Decisions Using Fact Drafts.

Authors :
Csányi, Gergely Márk
Lakatos, Dorina
Üveges, István
Megyeri, Andrea
Vadász, János Pál
Nagy, Dániel
Vági, Renátó
Source :
Big Data & Cognitive Computing; Dec2024, Vol. 8 Issue 12, p185, 22p
Publication Year :
2024

Abstract

This research paper presents findings from an investigation in the semantic similarity search task within the legal domain, using a corpus of 1172 Hungarian court decisions. The study establishes the groundwork for an operational semantic similarity search system designed to identify cases with comparable facts using preliminary legal fact drafts. Evaluating such systems often poses significant challenges, given the need for thorough document checks, which can be costly and limit evaluation reusability. To address this, the study employs manually created fact drafts for legal cases, enabling reliable ranking of original cases within retrieved documents and quantitative comparison of various vectorization methods. The study compares twelve different text embedding solutions (the most recent became available just a few weeks before the manuscript was written) identifying Cohere's embed-multilingual-v3.0, Beijing Academy of Artificial Intelligence's bge-m3, Jina AI's jina-embeddings-v3, OpenAI's text-embedding-3-large, and Microsoft's multilingual-e5-large models as top performers. To overcome the transformer-based models' context window limitation, we investigated chunking, striding, and last chunk scaling techniques, with last chunk scaling significantly improving embedding quality. The results suggest that the effectiveness of striding varies based on token count. Notably, employing striding with 16 tokens yielded optimal results, representing 3.125% of the context window size for the best-performing models. Results also suggested that from the models having 8192 token long context window the bge-m3 model is superior compared to jina-embeddings-v3 and text-embedding-3-large models in capturing the relevant parts of a document if the text contains significant amount of noise. The validity of the approach was evaluated and confirmed by legal experts. These insights led to an operational semantic search system for a prominent legal content provider. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
25042289
Volume :
8
Issue :
12
Database :
Complementary Index
Journal :
Big Data & Cognitive Computing
Publication Type :
Academic Journal
Accession number :
181958651
Full Text :
https://doi.org/10.3390/bdcc8120185