Back to Search
Start Over
A Syllable-Based Technique for Uyghur Text Compression
- Source :
- Information, Volume 11, Issue 3, Information, Vol 11, Iss 3, p 172 (2020)
- Publication Year :
- 2020
- Publisher :
- Multidisciplinary Digital Publishing Institute, 2020.
-
Abstract
- To improve utilization of text storage resources and efficiency of data transmission, we proposed two syllable-based Uyghur text compression coding schemes. First, according to the statistics of syllable coverage of the corpus text, we constructed a 12-bit and 16-bit syllable code tables and added commonly used symbols&mdash<br />such as punctuation marks and ASCII characters&mdash<br />to the code tables. To enable the coding scheme to process Uyghur texts mixed with other language symbols, we introduced a flag code in the compression process to distinguish the Unicode encodings that were not in the code table. The experiments showed that the 12-bit coding scheme had an average compression ratio of 0.3 on Uyghur text less than 4 KB in size and that the 16-bit coding scheme had an average compression ratio of 0.5 on text less than 2 KB in size. Our compression schemes outperformed GZip, BZip2, and the LZW algorithm on short text and could be effectively applied to the compression of Uyghur short text for storage and applications.
- Subjects :
- Computer science
media_common.quotation_subject
Speech recognition
02 engineering and technology
Data_CODINGANDINFORMATIONTHEORY
ASCII
0202 electrical engineering, electronic engineering, information engineering
text compression
syllable
media_common
code table
Code table
lcsh:T58.5-58.64
lcsh:Information technology
020206 networking & telecommunications
Unicode
Punctuation
ComputingMethodologies_PATTERNRECOGNITION
Compression ratio
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
Uyghur
020201 artificial intelligence & image processing
Text compression
Information Systems
Coding (social sciences)
Data transmission
Subjects
Details
- Language :
- English
- ISSN :
- 20782489
- Database :
- OpenAIRE
- Journal :
- Information
- Accession number :
- edsair.doi.dedup.....6a5ef4047c4f6766ef3879cbd5b4d0f4
- Full Text :
- https://doi.org/10.3390/info11030172