Back to Search Start Over

A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization.

Authors :
Dong, Luobing
Guo, Qiumin
Wu, Weili
Satpute, Meghana N.
Source :
Theoretical Computer Science. Oct2020, Vol. 836, p65-75. 11p.
Publication Year :
2020

Abstract

As language corpora have been playing an increasingly important role in the field of Artificial Intelligence (AI) research, lots of extremely large corpora are created. However, a larger corpora size not only increases power and accuracy but also brings redundancy. Therefore, researchers began to emphasize the study of appropriate subset extraction methods. Due to the trade-off between data sufficiency and redundancy, a group of interesting and challenging problems are emerged that are studied in this paper: (1) How to make the resulting subset include as much data as possible under some necessary constraints? (2) How to preserve the potential useful semantic relatedness included in the original corpora while reducing the size of the corpora? For these two problems, existing work mainly focuses on the methods to construct particular subsets for special usage. These methods are limited in their focus. In this paper, we try to address the problems listed above. First, considering the cubic and binary semantic relatedness among tokens, we construct a general system model and formulate the mix problem as a cubic pseudo-Boolean optimization problem. Then, by analyzing the characteristics of the objective function, we transfer the problem into the maximum flow problem of a corresponding graph. Third, we propose a new algorithm by introducing discrete Lagrangian iteration method. We prove that the objective function is supermodular, which allows us to use fast minimum cut algorithms in each iteration step to propose another fast algorithm. Finally, we experimentally validate our new algorithms on several randomly created corpora. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
03043975
Volume :
836
Database :
Academic Search Index
Journal :
Theoretical Computer Science
Publication Type :
Academic Journal
Accession number :
145135710
Full Text :
https://doi.org/10.1016/j.tcs.2020.07.020