Script Identification of Central Asian Printed Document Images based on Nonsubsampled Contourlet Transform.

Authors :: Xing-kun Han
Aysa, Alimjan
Mamt, Hornisa
Ubul, Kurban
Source :: Engineering Letters. Dec2017, Vol. 25 Issue 4, p389-395. 7p.
Publication Year :: 2017
Abstract: Document images of various scripts must be identified and processed in today's international environment. As the front-end technology of Optical Character Recognition (OCR), script identification is an indispensable part of automatic document image analysis. Aiming at the nature of rich texture features of document images, a 3-level Nonsubsampled Contourlet Transform (NSCT) was used to extract 30- dimensional texture features in this paper. A Support Vector Machine (SVM) and K Nearest Neighbor (KNN) classifier were used for classification. A total of 10,000 document images in 10 kinds of Central Asian scripts--Arabic, Russian, Tibetan, Chinese, Uyghur, English, Mongolian, Kyrgyzstan, Kazakhstan, and Turkish--were classified. The identification efficiency of SVM and KNN was analyzed and compared, with the result that the SVM classifier obtained 99.5% average accuracy, a higher accuracy than KNN, during the experiment. The validity of the proposed method was proved by comparing the Wavelet Transforms (WT) and Local Binary Patterns (LBP) of these two script-identification methods. [ABSTRACT FROM AUTHOR]

Subjects :: *OPTICAL character recognition
*DOCUMENT imaging systems
*IMAGE analysis
*TEXTURE analysis (Image processing)
*CONTOURS (Cartography)
*SUPPORT vector machines

Tools