1. Contextual word embeddings for tabular data search and integration
- Author
-
José Pilaluisa, David Tomás, Borja Navarro-Colorado, Jose-Norberto Mazón, Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, Procesamiento del Lenguaje y Sistemas de Información (GPLSI), and Web and Knowledge (WaKe)
- Subjects
Artificial Intelligence ,Information search ,Open data ,Tabular data ,Data integration ,Contextual word embedding ,Software - Abstract
This paper presents a new approach to retrieve and further integrate tabular datasets (collections of rows and columns) using union and join operations. In this work, both processes were carried out using a similarity measure based on contextual word embeddings, which allows finding semantically similar tables and overcome the recall problem of lexical approaches based on string similarity. This work is the first attempt to use contextual word embeddings in the whole pipeline of table search and integration, including for the first time their use in the join operation. A comprehensive analysis of their performance was carried out on both retrieving and integrating tabular datasets, comparing them with context-free models. Column headings and cell values were used as contextual information and their impact on each task was evaluated. The results revealed that contextual models significantly outperform context-free models and a traditional weighting schema in ad hoc table retrieval. In the data integration task, contextual models also improved the results on union operation compared to context-free approaches. Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research has been partially funded by project “Desarrollo de un ecosistema de datos abiertos para transformar el sector turístico” (GVA-COVID19/2021/103) funded by Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital de la Generalitat Valenciana (Spain); and by projects “CHAN-TWIN” (TED2021-130890B-C21), “COnscious natuRal TEXt generation (CORTEX)” (PID2021-123956OB-I00) and “Technological Resources for Intelligent VIral AnaLysis through NLP (TRIVIAL)” (PID2021-122263OB-C22), funded by MCIN/AEI/ 10.13039/501100011033 and by the European Union NextGenerationEU/PRTR.
- Published
- 2022