Back to Search
Start Over
From Fact Drafts to Operational Systems: Semantic Search in Legal Decisions Using Fact Drafts.
From Fact Drafts to Operational Systems: Semantic Search in Legal Decisions Using Fact Drafts.
- Source :
- Big Data & Cognitive Computing; Dec2024, Vol. 8 Issue 12, p185, 22p
- Publication Year :
- 2024
-
Abstract
- This research paper presents findings from an investigation in the semantic similarity search task within the legal domain, using a corpus of 1172 Hungarian court decisions. The study establishes the groundwork for an operational semantic similarity search system designed to identify cases with comparable facts using preliminary legal fact drafts. Evaluating such systems often poses significant challenges, given the need for thorough document checks, which can be costly and limit evaluation reusability. To address this, the study employs manually created fact drafts for legal cases, enabling reliable ranking of original cases within retrieved documents and quantitative comparison of various vectorization methods. The study compares twelve different text embedding solutions (the most recent became available just a few weeks before the manuscript was written) identifying Cohere's embed-multilingual-v3.0, Beijing Academy of Artificial Intelligence's bge-m3, Jina AI's jina-embeddings-v3, OpenAI's text-embedding-3-large, and Microsoft's multilingual-e5-large models as top performers. To overcome the transformer-based models' context window limitation, we investigated chunking, striding, and last chunk scaling techniques, with last chunk scaling significantly improving embedding quality. The results suggest that the effectiveness of striding varies based on token count. Notably, employing striding with 16 tokens yielded optimal results, representing 3.125% of the context window size for the best-performing models. Results also suggested that from the models having 8192 token long context window the bge-m3 model is superior compared to jina-embeddings-v3 and text-embedding-3-large models in capturing the relevant parts of a document if the text contains significant amount of noise. The validity of the approach was evaluated and confirmed by legal experts. These insights led to an operational semantic search system for a prominent legal content provider. [ABSTRACT FROM AUTHOR]
- Subjects :
- ARTIFICIAL intelligence
DATABASES
SYSTEMS design
JUSTICE administration
NOISE
Subjects
Details
- Language :
- English
- ISSN :
- 25042289
- Volume :
- 8
- Issue :
- 12
- Database :
- Complementary Index
- Journal :
- Big Data & Cognitive Computing
- Publication Type :
- Academic Journal
- Accession number :
- 181958651
- Full Text :
- https://doi.org/10.3390/bdcc8120185