Curation of Historical Arabic Handwritten Digit Datasets from Ottoman Population Registers: A Deep Transfer Learning Case Study

Authors :: Yekta Said Can
M. Erdem Kabadayi
Source :: IEEE BigData
Publication Year :: 2020
Publisher :: IEEE, 2020.
Abstract: With the increasing number of digitization efforts of historical manuscripts and archives, automatical information retrieval systems need to extract meaning fast and reliably. Historical archives bring more challenges for these systems when compared to modern manuscripts. More advanced algorithms, archive specific methods, preprocessing techniques are needed to retrieve information. Cutting-edge machine learning algorithms should also be applied to retrieve meaning from these documents. One of the most important research issues of historical document analysis is the lack of public datasets. Although there are plenty of public datasets for modern document analysis, the number of public annotated historical archives is limited. Researchers can test novel algorithms on these modern datasets and infer some results, but their performance is unknown without testing them on historical datasets. In this study, we created a historical Arabic handwritten digit dataset by combining manual annotation and automatic document analysis techniques. The dataset is open for researchers and contained more than 6000 digits. We then tested deep transfer learning algorithms and various machine learning techniques to recognize these digits and achieved promising results.

Subjects :: education.field_of_study
Information retrieval
Computer science
Population
020206 networking & telecommunications
02 engineering and technology
Numerical digit
Handwriting recognition
0202 electrical engineering, electronic engineering, information engineering
Preprocessor
020201 artificial intelligence & image processing
education
Transfer of learning
Digitization
Historical document
Meaning (linguistics)

Tools