Back to Search Start Over

Automatic Regular Expression Generation for Extracting Relevant Image Data From Web Pages Using Genetic Algorithms

Authors :
Canan Aslanyurek
Tarik Yerlikaya
Source :
IEEE Access, Vol 12, Pp 90660-90669 (2024)
Publication Year :
2024
Publisher :
IEEE, 2024.

Abstract

In this study, a method that automatically generates regular expressions using genetic algorithms is designed to extract relevant images on web pages. Data extraction, which is usually done with web scrapers, can also be done with regular expressions. The complexity of regular expressions and the fact that they require expert knowledge make their writing difficult. With this study, a regular expression is automatically created to obtain relevant images of news content on websites. With the principle of genetic algorithms, the survival of the good and the elimination of the bad, a regular expression that can reach the most relevant image is produced. Thus, instead of a time-consuming and error-prone method such as creating the appropriate pattern for each site with web scraper tools, automatic regular expression generation using genetic algorithm methods can be used as a better method. A data set containing text-based related and irrelevant images from 200 websites collected from 58 countries was used in the study. There are 22,682 relevant images among 635,015 image data in the dataset. With the method developed using the genetic algorithm, the rate of accessing the relevant images by regular expressions produced by only looking at the relevant image data is approximately 98.49%.

Details

Language :
English
ISSN :
21693536
Volume :
12
Database :
Directory of Open Access Journals
Journal :
IEEE Access
Publication Type :
Academic Journal
Accession number :
edsdoj.2de0b97f82cb42f59c6c8fe9e494adb6
Document Type :
article
Full Text :
https://doi.org/10.1109/ACCESS.2024.3420734