347 results on '"TEXT COMPRESSION"'
Search Results
2. A Hybrid Genetic Algorithm-Particle Swarm Optimization Approach for Enhanced Text Compression
- Author
-
Tara Nawzad Ahmad Al Attar
- Subjects
text compression ,genetic algorithms ,particle swarm optimization ,hybrid algorithm ,data storage ,Science - Abstract
Text compression is a necessity for efficient data storage and transmission. Especially in the digital era, volumes of digital text have increased incredibly. Traditional text compression methods, including Huffman coding and Lempel-Ziv-Welch, have certain limitations regarding their adaptability and efficiency in dealing with such complexity and diversity of data. In this paper, we propose a hybrid method that combines Genetic Algorithm (GA) with Particle Swarm Optimization (PSO) to optimize the compression of text using the broad exploration capabilities of GA and fast convergence properties of PSO. The experimental results reflect that the proposed hybrid approach of GA-PSO yields much better performance in compression ratio than the standalone methods by reducing the size to about 65% while retaining integrity in the original content. The proposed method is also highly adaptable to various text forms and outperformed other state-of-the-art methods such as the Grey Wolf Optimizer, the Whale Optimization Algorithm, and the African Vulture Optimization Algorithm. These results support that the hybrid method GA-PSO seems promising for modern text compression.
- Published
- 2024
- Full Text
- View/download PDF
3. Learning-based short text compression using BERT models.
- Author
-
Öztürk, Emir and Mesut, Altan
- Subjects
DATA compression ,LANGUAGE models ,ENGLISH language ,SPEED - Abstract
Learning-based data compression methods have gained significant attention in recent years. Although these methods achieve higher compression ratios compared to traditional techniques, their slow processing times make them less suitable for compressing large datasets, and they are generally more effective for short texts rather than longer ones. In this study, MLMCompress, a word-based text compression method that can utilize any BERT masked language model is introduced. The performance of MLMCompress is evaluated using four BERT models: two large models and two smaller models referred to as "tiny". The large models are used without training, while the smaller models are fine-tuned. The results indicate that MLMCompress, when using the best-performing model, achieved 3838% higher compression ratios for English text and 42% higher compression ratios for multilingual text compared to NNCP, another learning-based method. Although the method does not yield better results than GPTZip, which has been developed in recent years, it achieves comparable outcomes while being up to 35 times faster in the worst-case scenario. Additionally, it demonstrated a 20% improvement in compression speed and a 180% improvement in decompression speed in the best case. Furthermore, MLMCompress outperforms traditional compression methods like Gzip and specialized short text compression methods such as Smaz and Shoco, particularly in compressing short texts, even when using smaller models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Learning-based short text compression using BERT models
- Author
-
Emir Öztürk and Altan Mesut
- Subjects
BERT ,Fine tuning ,Learning-based compression ,Text compression ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Learning-based data compression methods have gained significant attention in recent years. Although these methods achieve higher compression ratios compared to traditional techniques, their slow processing times make them less suitable for compressing large datasets, and they are generally more effective for short texts rather than longer ones. In this study, MLMCompress, a word-based text compression method that can utilize any BERT masked language model is introduced. The performance of MLMCompress is evaluated using four BERT models: two large models and two smaller models referred to as “tiny”. The large models are used without training, while the smaller models are fine-tuned. The results indicate that MLMCompress, when using the best-performing model, achieved 3838% higher compression ratios for English text and 42% higher compression ratios for multilingual text compared to NNCP, another learning-based method. Although the method does not yield better results than GPTZip, which has been developed in recent years, it achieves comparable outcomes while being up to 35 times faster in the worst-case scenario. Additionally, it demonstrated a 20% improvement in compression speed and a 180% improvement in decompression speed in the best case. Furthermore, MLMCompress outperforms traditional compression methods like Gzip and specialized short text compression methods such as Smaz and Shoco, particularly in compressing short texts, even when using smaller models.
- Published
- 2024
- Full Text
- View/download PDF
5. XCompress: LLM assisted Python-based text compression toolkit
- Author
-
Emir Öztürk
- Subjects
Benchmarking ,Large language models ,Text compression ,Python ,Computer software ,QA76.75-76.765 - Abstract
This study introduces XCompress, a Python-based tool for effectively utilizing various compression algorithms. XCompress offers manual, brute force, and Large Language Model (LLM) methods to determine the most suitable algorithm based on the type of text data. Its modular structure allows easy addition of new algorithms and includes functions for benchmarking and result comparison. Tests on diverse text types demonstrate the efficacy of the LLM-assisted Compression Selection Model (CSM). With XCompress, users can determine the most suitable method for their files. Additionally, in academic research, they can easily compare different methods without needing any scripting or programming language.
- Published
- 2024
- Full Text
- View/download PDF
6. Computing All-vs-All MEMs in Grammar-Compressed Text
- Author
-
Díaz-Domínguez, Diego, Salmela, Leena, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Nardini, Franco Maria, editor, Pisanti, Nadia, editor, and Venturini, Rossano, editor
- Published
- 2023
- Full Text
- View/download PDF
7. An Empirical Analysis on Lossless Compression Techniques
- Author
-
Hossain, Mohammad Badrul, Rahman, Md. Nowroz Junaed, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Neri, Filippo, editor, Du, Ke-Lin, editor, Varadarajan, Vijayakumar, editor, San-Blas, Angel-Antonio, editor, and Jiang, Zhiyu, editor
- Published
- 2023
- Full Text
- View/download PDF
8. A New Method for Short Text Compression
- Author
-
Murat Aslanyurek and Altan Mesut
- Subjects
Machine learning ,text compression ,k-means ,clustering ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Short texts cannot be compressed effectively with general-purpose compression methods. Methods developed to compress short texts often use static dictionaries. In order to achieve high compression ratios, using a static dictionary suitable for the text to be compressed is an important problem that needs to be solved. In this study, a method called WSDC (Word-based Static Dictionary Compression), which can compress short texts at a high ratio, and a model that uses iterative clustering to create static dictionaries used in this method are proposed. The number of static dictionaries to be created can vary by running the k-Means clustering algorithm iteratively according to some rules. A method called DSWF (Dictionary Selection by Word Frequency) is also presented to determine which of the created dictionaries can compress the source text at the best ratio. Wikipedia article abstracts consisting of 6 different languages were used as the dataset in the experiments. The developed WSDC method is compared with both general-purpose compression methods (Gzip, Bzip2, PPMd, Brotli and Zstd) and special methods used for compression of short texts (shoco, b64pack and smaz). According to the test results, although WSDC is slower than some other methods, it achieves the best compression ratios for short texts smaller than 200 bytes and better than other methods except Zstd for short texts smaller than 1000 bytes.
- Published
- 2023
- Full Text
- View/download PDF
9. A Study on Data Compression Algorithms for Its Efficiency Analysis
- Author
-
Rodrigues, Calvin, Jishnu, E. M., Nair, Chandu R., Soumya Krishnan, M., Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Karuppusamy, P., editor, Perikos, Isidoros, editor, and García Márquez, Fausto Pedro, editor
- Published
- 2022
- Full Text
- View/download PDF
10. Optimal alphabet for single text compression.
- Author
-
Allahverdyan, Armen and Khachatryan, Andranik
- Subjects
- *
HUFFMAN codes , *DATA compression , *SIGNS & symbols - Abstract
A text written using symbols from a given alphabet can be compressed using the Huffman code, which minimizes the length of the encoded text. It is necessary, however, to employ a text-specific codebook, i.e. the symbol-codeword dictionary, to decode the original text. Thus, the compression performance should be evaluated by the full code length, i.e. the length of the encoded text plus the length of the codebook. We studied several alphabets for compressing texts – letters, n -grams of letters, syllables, words, and phrases. If only sufficiently short texts are retained, an alphabet of letters or two-grams of letters is optimal. For the majority of Project Gutenberg texts, the best alphabet (the one that minimizes the full code length) is given by syllables or words, depending on the representation of the codebook. Letter 3 and 4-grams, having on average comparable length to syllables/words, perform noticeably worse than syllables or words. Word 2-grams also are never the best alphabet, on the account of having a very large codebook. We also show that the codebook representation is important – switching from a naive representation to a compact one significantly improves the matters for alphabets with large number of symbols, most notably the words. Thus, meaning-expressing elements of the language (syllables or words) provide the best compression alphabet. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
11. Medical Data Compression and Sharing Technology Based on Blockchain
- Author
-
Du, Yi, Yu, Hua, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Zhang, Zhao, editor, Li, Wei, editor, and Du, Ding-Zhu, editor
- Published
- 2020
- Full Text
- View/download PDF
12. Text Compression-Aided Transformer Encoding.
- Author
-
Li, Zuchao, Zhang, Zhuosheng, Zhao, Hai, Wang, Rui, Chen, Kehai, Utiyama, Masao, and Sumita, Eiichiro
- Subjects
- *
NATURAL language processing , *ENCODING , *CURRENT transformers (Instrument transformer) , *INFORMATION modeling - Abstract
Text encoding is one of the most important steps in Natural Language Processing (NLP). It has been done well by the self-attention mechanism in the current state-of-the-art Transformer encoder, which has brought about significant improvements in the performance of many NLP tasks. Though the Transformer encoder may effectively capture general information in its resulting representations, the backbone information, meaning the gist of the input text, is not specifically focused on. In this paper, we propose explicit and implicit text compression approaches to enhance the Transformer encoding and evaluate models using this approach on several typical downstream tasks that rely on the encoding heavily. Our explicit text compression approaches use dedicated models to compress text, while our implicit text compression approach simply adds an additional module to the main model to handle text compression. We propose three ways of integration, namely backbone source-side fusion, target-side fusion, and both-side fusion, to integrate the backbone information into Transformer-based models for various downstream tasks. Our evaluation on benchmark datasets shows that the proposed explicit and implicit text compression approaches improve results in comparison to strong baselines. We therefore conclude, when comparing the encodings to the baseline models, text compression helps the encoders to learn better language representations. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
13. A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications
- Author
-
Uthayakumar Jayasankar, Vengattaraman Thirumal, and Dhavachelvan Ponnurangam
- Subjects
Data compression ,Data redundancy ,Text compression ,Image compression ,Information theory ,Entropy encoding ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Explosive growth of data in digital world leads to the requirement of efficient technique to store and transmit data. Due to limited resources, data compression (DC) techniques are proposed to minimize the size of data being stored or communicated. As DC concepts results to effective utilization of available storage area and communication bandwidth, numerous approaches were developed in several aspects. In order to analyze how DC techniques and its applications have evolved, a detailed survey on many existing DC techniques is carried out to address the current requirements in terms of data quality, coding schemes, type of data and applications. A comparative analysis is also performed to identify the contribution of reviewed techniques in terms of their characteristics, underlying concepts, experimental factors and limitations. Finally, this paper insight to various open issues and research directions to explore the promising areas for future developments.
- Published
- 2021
- Full Text
- View/download PDF
14. An Approach for LPF Table Computation
- Author
-
Chairungsee, Supaporn, Charuphanthuset, Thana, Barbosa, Simone Diniz Junqueira, Editorial Board Member, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Kotenko, Igor, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Yuan, Junsong, Founding Editor, Anderst-Kotsis, Gabriele, editor, Tjoa, A Min, editor, Khalil, Ismail, editor, Elloumi, Mourad, editor, Mashkoor, Atif, editor, Sametinger, Johannes, editor, Larrucea, Xabier, editor, Fensel, Anna, editor, Martinez-Gil, Jorge, editor, Moser, Bernhard, editor, Seifert, Christin, editor, Stein, Benno, editor, and Granitzer, Michael, editor
- Published
- 2019
- Full Text
- View/download PDF
15. TÜRKÇE METİNLERİN SINIFLANDIRMA BAŞARISINI ARTIRMAK İÇİN YENİ BİR YÖNTEM ÖNERİSİ
- Author
-
Metin Bi̇lgi̇n
- Subjects
metin sınıflandırma ,doğal dil i̇şleme ,lzw ,metin sıkıştırma ,makine öğrenmesi ,text classification ,natural language processing ,text compression ,machine learning ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
Bu çalışma, yazarı bilinmeyen bir dokümanının yazarını tahmin etmeyi amaçlamaktadır. Bunun için 6 farklı köşe yazarına ait 6 köşe yazısı öncelikle ön-işlem aşamasına sokulmuştur. Ardından bu metinler üzerinden n-gram (2-3) ile özellikler çıkarılmıştır. Çıkarılan özellikler üzerinden sistem 6 farklı makine öğrenmesi üzerinde çapraz geçerleme (10) ile test edilmiştir. Buraya kadar olan kısım literatürde şimdiye kadar uygulanmış olan yöntemdir. Bizim önerimiz ön işlem aşamasının ardından eldeki metinleri LZW algoritması ile kayıpsız sıkıştırarak özellik sayısını azaltmak ve bunun sistemin başarısı üzerindeki etkileri araştırmak üzerinedir. Ön-işlemden geçmiş olan metinler LZW algoritması ile binary (ikili) ve decimal (onlu) olarak sıkıştırılır. Sıkıştırmanın ardından n-gram (2-3) ile çıkarılan özellikler ile sistem 6 farklı makine öğrenmesi yönteminde test edilmiş ve çalışma sonuçları 5 farklı metrik için incelenmiştir. Yapılan çalışma sonucunda ikili olarak sıkıştırılmış metinler hem 2-gram hem de 3-gramda, 6 farklı makine öğrenmesi algoritmasında da daha iyi sonuçlar elde etmiştir. Random Tree ve Naïve bayes algoritmasında onlu sıkıştırma, ham verinin gerisinde kalsa da diğer 4 algoritmada daha iyi elde sonuçlar elde etmiş ama ortalama başarı değerlerinde geride kalmıştır. Yapılan çalışma sonucunda ikili sıkıştırma tüm metriklerinde diğer iki yönteme göre daha başarılıdır. Yapılan çalışmada yazar tanıma işlemi yapılmış olsa da önerilen bu yöntemin tüm metin sınıflandırma işlemlerinde kullanılabileceği düşünülmektedir.
- Published
- 2019
- Full Text
- View/download PDF
16. A New Corpus of the Russian Social Network News Feed Paraphrases: Corpus Construction and Linguistic Feature Analysis
- Author
-
Pronoza, Ekaterina, Yagunova, Elena, Pronoza, Anton, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Castro, Félix, editor, Miranda-Jiménez, Sabino, editor, and González-Mendoza, Miguel, editor
- Published
- 2018
- Full Text
- View/download PDF
17. Reading in the Age of Compression.
- Author
-
Koepnick, Lutz
- Subjects
- *
PDF (Computer file format) , *DIGITAL communications , *READING , *ELECTRONIC data processing , *READING comprehension - Abstract
Compression is often considered a royal road to process data in evershorter time and to cater to our desire to outspeed the accelerating transmission of information in the digital age. This article explores how different techniques of accelerated text dissemination and reading, such as consonant writing, speed-reading apps, and the PDF file format, borrow from the language of compression yet, precisely in so doing, obscure the constitutive multilayered temporality of reading and the embodied role of the reader. While discussing different methods aspiring to compress textual objects and processes of reading, the author illuminates hidden assumptions that accompany the rhetoric of text compression and compressed reading. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
18. A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications.
- Author
-
Jayasankar, Uthayakumar, Thirumal, Vengattaraman, and Ponnurangam, Dhavachelvan
- Subjects
DATA quality ,DATA compression ,INFORMATION theory ,COMPARATIVE studies ,IMAGE compression - Abstract
• Systematic organization of Data Compression (DC) concepts with its importance, mathematical formulation and performance measures. • Critical investigation of various DC algorithms on the basis of data quality, coding schemes, data type and applications. • We suggested potential research directions and open issues to explore the possible future trends in DC. Explosive growth of data in digital world leads to the requirement of efficient technique to store and transmit data. Due to limited resources, data compression (DC) techniques are proposed to minimize the size of data being stored or communicated. As DC concepts results to effective utilization of available storage area and communication bandwidth, numerous approaches were developed in several aspects. In order to analyze how DC techniques and its applications have evolved, a detailed survey on many existing DC techniques is carried out to address the current requirements in terms of data quality, coding schemes, type of data and applications. A comparative analysis is also performed to identify the contribution of reviewed techniques in terms of their characteristics, underlying concepts, experimental factors and limitations. Finally, this paper insight to various open issues and research directions to explore the promising areas for future developments. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
19. A Graph-Based Frequent Sequence Mining Approach to Text Compression
- Author
-
Oswald, C., Ajith Kumar, I., Avinash, J., Sivaselvan, B., Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Ghosh, Ashish, editor, Pal, Rajarshi, editor, and Prasath, Rajendra, editor
- Published
- 2017
- Full Text
- View/download PDF
20. Longest Previous Non-overlapping Factors Table Computation
- Author
-
Chairungsee, Supaporn, Crochemore, Maxime, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Gao, Xiaofeng, editor, Du, Hongwei, editor, and Han, Meng, editor
- Published
- 2017
- Full Text
- View/download PDF
21. Text Compression Based on Letter's Prefix in the Word.
- Author
-
AbuSafiya, Majed
- Subjects
DATA compression ,LETTERS ,FINITE state machines ,VOCABULARY ,ENCODING - Abstract
Huffman [Huffman (1952)] encoding is one of the most known compression algorithms. In its basic use, only one encoding is given for the same letter in text to compress. In this paper, a text compression algorithm that is based on Huffman encoding is proposed. Huffman encoding is used to give different encodings for the same letter depending on the prefix preceding it in the word. A deterministic finite automaton (DFA) that recognizes the words of the text is constructed. This DFA records the frequencies for letters that label the transitions. Every state will correspond to one of the prefixes of the words of the text. For every state, a different Huffman encoding is defined for the letters that label the transitions leaving that state. These Huffman encodings are then used to encode the letters of the words in the text. This algorithm was implemented and experimental study showed significant reduction in compression ratio over the basic Huffman encoding. However, more time is needed to construct these codes. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
22. Text Compression
- Author
-
Ferragina, Paolo, Nitto, Igor, Venturini, Rossano, Nascimento, Mario A., Section editor, Liu, Ling, editor, and Özsu, M. Tamer, editor
- Published
- 2018
- Full Text
- View/download PDF
23. Indexing Compressed Text.
- Author
-
Ferragina, Paolo and Manzini, Giovanni
- Subjects
COMPUTER operating systems ,CODING theory ,INFORMATION theory ,ALGORITHMS ,INFORMATION storage & retrieval systems - Abstract
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form. Our first compressed data structure retrieves the occ occurrences of a pattern P[1, p] within text T [1, n] in O(p +occ log
1 + ε n) time for any chosen ε, 0 < ε < 1. This data structure uses at most 5nHk (T) + o(n) bits of storage, where Hk (T) is the kth order empirical entropy of T. The space usage is Θ(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows-Wheeler Transform, and can be regarded as a compressed suffix array. Our second compressed data structure achieves O(p + occ) query time using O(nHk (T) logε n) + o(n) bits of storage for any chosen ε, 0 < ε < 1. Therefore, it provides optimal output-sensitive query time using o(n log n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows-Wheeler Transform and the LZ78 algorithm. [ABSTRACT FROM AUTHOR]- Published
- 2005
- Full Text
- View/download PDF
24. Boosting Textual Compression in Optimal Linear Time.
- Author
-
Ferragina, Paolo, Giancarlo, Raffaele, Manzini, Giovanni, and Sciortino, Marinella
- Subjects
ALGORITHMS ,DATA structures ,ELECTRONIC data processing ,COMPUTER operating systems ,INFORMATION theory - Abstract
We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the "best possible" contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties. Technically, our boosting technique builds upon three main ingredients: the Burrows-Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically, we show that there exists a proper partition of the Burrows-Wheeler Transform of a string s that shows a deep combinatorial relation with the kth order entropy of s. That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost. Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s, and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster, we get better compression algorithms than some of the best existing ones, that is, LZ77 , LZ78 , PPMC and the ones derived from the Burrows-Wheeler Transform. Further, we settle analytically some long-standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
25. Trigram-Based Vietnamese Text Compression
- Author
-
Nguyen, Vu H., Nguyen, Hien T., Duong, Hieu N., Snasel, Vaclav, Kacprzyk, Janusz, Series editor, Król, Dariusz, editor, Madeyski, Lech, editor, and Nguyen, Ngoc Thanh, editor
- Published
- 2016
- Full Text
- View/download PDF
26. Compressing Big Data: When the Rate of Convergence to the Entropy Matters
- Author
-
Aronica, Salvatore, Langiu, Alessio, Marzi, Francesca, Mazzola, Salvatore, Mignosi, Filippo, Nazzicone, Giulio, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Kotsireas, Ilias S., editor, Rump, Siegfried M., editor, and Yap, Chee K., editor
- Published
- 2016
- Full Text
- View/download PDF
27. Evolution of ItaSA and Subsfactory
- Author
-
Massidda, Serenella and Massidda, Serenella
- Published
- 2015
- Full Text
- View/download PDF
28. Classification of Scientific Texts Based on the Compression of Annotations to Publications.
- Author
-
Selivanova, I. V., Kosyakov, D. V., and Guskov, A. E.
- Abstract
This paper describes the possibility of establishing the semantic proximity of scientific texts by the method of their automatic classification based on the compression of annotations. The idea of the method is that the compression algorithms such as PPM (prediction by partial matching) compress terminologically similar texts much better than distant ones. If a kernel of publications (an analogue of a training set) is formed for each classified topic, then the best proportion of compression will indicate that the classified text belongs to the corresponding topic. Thirty thematic categories were determined; for each of them, annotations of approximately 500 publications were received in the Scopus database, out of which 100 annotations for the kernel and 20 annotations for testing were selected in different ways. It was found that building a kernel based on highly cited publications revealed an error level of up to 12 against 32% in the case of random sampling. The quality of classification is also affected by the initial number of categories: the fewer the categories that participate in the classification and the more terminological differences exist between them, the higher its quality is. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
29. ГРАМАТИЧНІ ЗАСОБИ РОЗШИРЕННЯ І КОМПРЕСІЇ ЛІРИЧНОГО ПРОСТОРУ ПОЕТИЧНИХ ТВОРІВ ДЛЯ ДІТЕЙ
- Subjects
text compression ,grammatical polynomials ,children's poetry ,text division ,grammatical binomials - Abstract
The report is devoted to the consideration of grammatical methods of creating compression and expansion of the poetic narrative in poetry for children. On the material of children’s poems by Serhiy Vakulenko and Lina Kostenko, the modern tendency of combining additional morphological and syntactic techniques that divide and connect the grammatical structure of the poem is considered. Particular attention is paid to morphological binomials and polynomials, hyphenated constructions that combine lexemes of the same part of speech. It is argued that grammatical units, which are elements of the form and content of a poetic text, participate in the intonation and punctuation linking of the text. At the same time, the newest technique is their combination with such traditional means of division as the use of uncommon sentences, direct speech, the creation of homogeneous chains, and division by dashes. The linguistic search of the masters of the word contributes to the poetic discourse and develops a creative attitude to the word in young readers.
- Published
- 2023
- Full Text
- View/download PDF
30. ГРАМАТИЧНІ ЗАСОБИ РОЗШИРЕННЯ І КОМПРЕСІЇ ЛІРИЧНОГО ПРОСТОРУ ПОЕТИЧНИХ ТВОРІВ ДЛЯ ДІТЕЙ
- Subjects
text compression ,grammatical polynomials ,children's poetry ,text division ,grammatical binomials - Abstract
The report is devoted to the consideration of grammatical methods of creating compression and expansion of the poetic narrative in poetry for children. On the material of children’s poems by Serhiy Vakulenko and Lina Kostenko, the modern tendency of combining additional morphological and syntactic techniques that divide and connect the grammatical structure of the poem is considered. Particular attention is paid to morphological binomials and polynomials, hyphenated constructions that combine lexemes of the same part of speech. It is argued that grammatical units, which are elements of the form and content of a poetic text, participate in the intonation and punctuation linking of the text. At the same time, the newest technique is their combination with such traditional means of division as the use of uncommon sentences, direct speech, the creation of homogeneous chains, and division by dashes. The linguistic search of the masters of the word contributes to the poetic discourse and develops a creative attitude to the word in young readers.
- Published
- 2023
- Full Text
- View/download PDF
31. Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment
- Author
-
Liu, Wei, Chang, Zhipeng, Teahan, William J., Goebel, Randy, Series editor, Tanaka, Yuzuru, Series editor, Wahlster, Wolfgang, Series editor, Besacier, Laurent, editor, Dediu, Adrian-Horia, editor, and Martín-Vide, Carlos, editor
- Published
- 2014
- Full Text
- View/download PDF
32. Comparison of Entropy and Dictionary Based Text Compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian
- Author
-
Matea Ignatoski, Jonatan Lerga, Ljubiša Stanković, and Miloš Daković
- Subjects
arithmetic ,Lempel–Ziv–Welch (LZW) ,text compression ,encoding ,English ,German ,Mathematics ,QA1-939 - Abstract
The rapid growth in the amount of data in the digital world leads to the need for data compression, and so forth, reducing the number of bits needed to represent a text file, an image, audio, or video content. Compressing data saves storage capacity and speeds up data transmission. In this paper, we focus on the text compression and provide a comparison of algorithms (in particular, entropy-based arithmetic and dictionary-based Lempel–Ziv–Welch (LZW) methods) for text compression in different languages (Croatian, Finnish, Hungarian, Czech, Italian, French, German, and English). The main goal is to answer a question: ”How does the language of a text affect the compression ratio?” The results indicated that the compression ratio is affected by the size of the language alphabet, and size or type of the text. For example, The European Green Deal was compressed by 75.79%, 76.17%, 77.33%, 76.84%, 73.25%, 74.63%, 75.14%, and 74.51% using the LZW algorithm, and by 72.54%, 71.47%, 72.87%, 73.43%, 69.62%, 69.94%, 72.42% and 72% using the arithmetic algorithm for the English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian versions, respectively.
- Published
- 2020
- Full Text
- View/download PDF
33. A Syllable-Based Technique for Uyghur Text Compression
- Author
-
Wayit Abliz, Hao Wu, Maihemuti Maimaiti, Jiamila Wushouer, Kahaerjiang Abiderexiti, Tuergen Yibulayin, and Aishan Wumaier
- Subjects
text compression ,uyghur ,syllable ,code table ,Information technology ,T58.5-58.64 - Abstract
To improve utilization of text storage resources and efficiency of data transmission, we proposed two syllable-based Uyghur text compression coding schemes. First, according to the statistics of syllable coverage of the corpus text, we constructed a 12-bit and 16-bit syllable code tables and added commonly used symbols—such as punctuation marks and ASCII characters—to the code tables. To enable the coding scheme to process Uyghur texts mixed with other language symbols, we introduced a flag code in the compression process to distinguish the Unicode encodings that were not in the code table. The experiments showed that the 12-bit coding scheme had an average compression ratio of 0.3 on Uyghur text less than 4 KB in size and that the 16-bit coding scheme had an average compression ratio of 0.5 on text less than 2 KB in size. Our compression schemes outperformed GZip, BZip2, and the LZW algorithm on short text and could be effectively applied to the compression of Uyghur short text for storage and applications.
- Published
- 2020
- Full Text
- View/download PDF
34. Compact and Fast Indexes for Translation Related Tasks
- Author
-
Costa, Jorge, Gomes, Luís, Lopes, Gabriel P., Russo, Luís M. S., Brisaboa, Nieves R., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Correia, Luís, editor, Reis, Luís Paulo, editor, and Cascalho, José, editor
- Published
- 2013
- Full Text
- View/download PDF
35. Associative Text Representation and Correction
- Author
-
Horzyk, Adrian, Gadamer, Marcin, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Rutkowski, Leszek, editor, Korytkowski, Marcin, editor, Scherer, Rafał, editor, Tadeusiewicz, Ryszard, editor, Zadeh, Lotfi A., editor, and Zurada, Jacek M., editor
- Published
- 2013
- Full Text
- View/download PDF
36. Compresibilidad de las imágenes
- Author
-
Abascal Jiménez, Alejandro, Universitat Autònoma de Barcelona. Escola d'Enginyeria, and Serra Sagristà, Joan
- Subjects
Lempel-Ziv ,Image compression ,Compressió de text ,Compressibility metrics ,Métricas compresibilidad ,Fuentes de compresibilidad ,Estudio comparativo ,Compresión de imágenes ,Burrows-Wheeler transform ,Compressibility sources ,Compresión de texto ,Mètriques compressibilitat ,Estudi comparatiu ,Text compression ,Fonts de compressibilitat ,Transformada de Burrows-Wheeler ,Comparative study ,Compressió d'imatges - Abstract
El trabajo consiste en tres partes diferentes pero relacionadas entre sí. Primero se profundiza el estudio de la compresibilidad de datos 1D y 2D. Esto incluye entender qué es la compresibilidad y por qué la entropía de Shannon no sirve a nuestros objetivos. Segundo, se define una métrica de compresibilidad para datos 2D. También se implementan dos métricas de compresibilidad para datos 1D ya existentes (Lempel-Ziv y Transformada de Burrows-Wheeler). Por último, se realizan experimentos para comparar el rendimiento de las métricas con diferentes conjuntos de imágenes y comprobar si la nueva métrica está bien definida o captura bien la compresibilidad de las imágenes. The work has been divided into three different but related parts. Firstly, an in-depth study of 1D and 2D data compressibility is carried out. This includes understanding what is compressibility and why Shannon's entropy is not useful for our purposes. Secondly, a 2D data compressibility metric is defined. Also, two existing 1D data compressibility metrics are implemented (Lempel-Ziv and Burrows-Wheeler transform). Finally, a series of experiments are run in order to compare the metrics perfomances with several sets of images and to test whether the new metric is well defined and is able to capture the compressibility of the images. El treball consisteix en tres parts diferents però relacionades entre si. Primer s'aprofundeix l'estudi de la compressibilitat de dades 1D i 2D. Això inclou entendre què és la compressibilitat i per què l'entropia de Shannon no serveix als nostres objectius. Segon, es defineix una mètrica de compressibilitat per a dades 2D. També s'implementen dues mètriques de compressibilitat per a dades 1D ja existents (Lempel-Ziv i Transformada de Burrows-Wheeler). Finalment, es realitzen experiments per a comparar el rendiment de les mètriques amb diferents conjunts d'imatges i comprovar si la nova mètrica està ben definida o captura bé la compressibilitat de les imatges.
- Published
- 2023
37. L-Systems for Measuring Repetitiveness
- Author
-
Navarro, Gonzalo and Urbina, Cristian
- Subjects
FOS: Computer and information sciences ,L-systems ,Formal Languages and Automata Theory (cs.FL) ,Text compression ,Mathematics of computing → Combinatorics on words ,Computer Science - Data Structures and Algorithms ,Repetitiveness measures ,Data Structures and Algorithms (cs.DS) ,Computer Science - Formal Languages and Automata Theory ,String morphisms ,Theory of computation → Data compression - Abstract
In order to use them for compression, we extend L-systems (without ε-rules) with two parameters d and n, and also a coding τ, which determines unambiguously a string w = τ(φ^d(s))[1:n], where φ is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as 𝓁, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper, we deepen the study of the measure 𝓁 and its relation with a better-established measure called δ, which builds on substring complexity. Our results show that 𝓁 and δ are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness. We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ν of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms., LIPIcs, Vol. 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023), pages 25:1-25:17
- Published
- 2023
- Full Text
- View/download PDF
38. Simple Rules for Syllabification of Arabic Texts
- Author
-
Soori, Hussein, Platos, Jan, Snasel, Vaclav, Abdulla, Hussam, Snasel, Vaclav, editor, Platos, Jan, editor, and El-Qawasmeh, Eyas, editor
- Published
- 2011
- Full Text
- View/download PDF
39. Compression of Layered Documents
- Author
-
Carpentieri, Bruno, Zavoral, Filip, editor, Yaghob, Jakub, editor, Pichappan, Pit, editor, and El-Qawasmeh, Eyas, editor
- Published
- 2010
- Full Text
- View/download PDF
40. Efficient Algorithms for Two Extensions of LPF Table: The Power of Suffix Arrays
- Author
-
Crochemore, Maxime, Iliopoulos, Costas S., Kubica, Marcin, Rytter, Wojciech, Waleń, Tomasz, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, van Leeuwen, Jan, editor, Muscholl, Anca, editor, Peleg, David, editor, Pokorný, Jaroslav, editor, and Rumpe, Bernhard, editor
- Published
- 2010
- Full Text
- View/download PDF
41. Multi-Stream Word-Based Compression Algorithm for Compressed Text Search.
- Author
-
Öztürk, Emir, Mesut, Altan, and Diri, Banu
- Subjects
- *
ALGORITHMS , *TEXT files , *COMPUTER programming - Abstract
In this article, we present a novel word-based lossless compression algorithm for text files using a semi-static model. We named this method the ‘Multi-stream word-based compression algorithm (MWCA)’ because it stores the compressed forms of the words in three individual streams depending on their frequencies in the text and stores two dictionaries and a bit vector as side information. In our experiments, MWCA produces a compression ratio of 3.23 bpc on average and 2.88 bpc for files greater than 50 MB; if a variable length encoder such as Huffman coding is used after MWCA, the given ratios are reduced to 2.65 and 2.44 bpc, respectively. MWCA supports exact word matching without decompression, and its multi-stream approach reduces the search time with respect to single-stream algorithms. Additionally, the MWCA multi-stream structure supplies the reduction in network load by requesting only the necessary streams from the database. With the advantage of its fast compressed search feature and multi-stream structure, we believe that MWCA is a good solution, especially for storing and searching big text data. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
42. Efficient Approaches to Compute Longest Previous Non-overlapping Factor Array.
- Author
-
Chairungsee, Supaporn
- Subjects
- *
DATA compression , *ALGORITHMS , *RUN-length encoding , *CODING theory , *COMPUTER science - Abstract
In this article, we introduce new methods to compute the Longest Previous nonoverlapping Factor (LPnF) table. The LPnF table is the table that stores the maximal length of factors re-occurring at each position of a string without overlapping and this table is related to Ziv-Lempel factorization of a text which is useful for text compression and data compression. The LPnF table has the important role for data compression, string algorithms and computational biology. In this paper, we present three approaches to produce the LPnF table of a string from its augmented position heap, from its position heap, and from its suffix heap. We also present the experimental results from these three solutions. The algorithms run in linear time with linear memory space. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
43. A Sub-Pixel Gradient Compression Algorithm for Text Image Display on a Smart Device.
- Author
-
Kim, Kyudong, Lee, Chulhee, and Lee, Hyuk-Jae
- Subjects
- *
ALGORITHMS , *MACHINE theory , *ARTIFICIAL intelligence , *IMAGE recognition (Computer vision) , *IMAGE processing - Abstract
A smart device such as a television, tablet, and smartphone, often displays a compound image consisting of various types of sub-images, including texts and pictures. A text image displayed on a smart device is composed of red, green, and blue (RGB) sub-pixels having strong correlation among them. This paper proposes a new compression algorithm, called sub-pixel gradient compression (SPGC), that makes use of the correlation in a text in order to improve its compression efficiency. The initial step of the proposed algorithm converts a text into a de-colorized image in which RGB sub-pixels vary gradually. The de-colorization is a necessary step to further enhance the correlation among RGB sub-pixels, and thereby improving the compression efficiency. From the de-colorized text, the next step estimates the slopes of the gradual variations among sub-pixels and then encodes the slopes and their associated information as a bitstream. SPGC is a relatively simple algorithm because its main operation is the estimation of sub-pixel gradients and the precise estimation of the gradients makes it possible to reduce a degradation of the quality of the texts. Experimental results show that SPGC maintains lower complexity and a higher compression ratio than the compression standards including JPEG, JPEG-2000, H.264, or HEVC. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
44. Non-repetitive DNA Compression Using Memoization
- Author
-
Venugopal, K. R., Srinivasa, K. G., Patnaik, L. M., Kacprzyk, Janusz, editor, Venugopal, K. R., Srinivasa, K. G., and Patnaik, L. M.
- Published
- 2009
- Full Text
- View/download PDF
45. LPF Computation Revisited
- Author
-
Crochemore, Maxime, Ilie, Lucian, Iliopoulos, Costas S., Kubica, Marcin, Rytter, Wojciech, Waleń, Tomasz, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Fiala, Jiří, editor, Kratochvíl, Jan, editor, and Miller, Mirka, editor
- Published
- 2009
- Full Text
- View/download PDF
46. Experiments in Text File Compression.
- Author
-
Rubin, Frank
- Subjects
- *
DATA compression , *COMPUTER files , *DATABASE management , *COMPUTER programming , *CODING theory , *DATABASES - Abstract
A system for the compression of data files, viewed as strings of characters, is presented. The method is general, and applies equally well to English, to PL/I, or to digital data. The system consists of an encoder, an analysis program, and a decoder. Two algorithms for encoding a string differ slightly from earlier proposals. The analysis program attempts to find an optimal set of codes for representing substrings of the file. Four new algorithms for this operation are described and compared. Various parameters in the algorithms are optimized to obtain a high degree of compression for sample texts. [ABSTRACT FROM AUTHOR]
- Published
- 1976
- Full Text
- View/download PDF
47. Compression of Concatenated Web Pages Using XBW
- Author
-
Šesták, Radovan, Lánský, Jan, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Geffert, Viliam, editor, Karhumäki, Juhani, editor, Bertoni, Alberto, editor, Preneel, Bart, editor, Návrat, Pavol, editor, and Bieliková, Mária, editor
- Published
- 2008
- Full Text
- View/download PDF
48. Edge-Guided Natural Language Text Compression
- Author
-
Adiego, Joaquín, Martínez-Prieto, Miguel A., de la Fuente, Pablo, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Ziviani, Nivio, editor, and Baeza-Yates, Ricardo, editor
- Published
- 2007
- Full Text
- View/download PDF
49. Non-repetitive DNA Sequence Compression Using Memoization
- Author
-
Srinivasa, K. G., Jagadish, M., Venugopal, K. R., Patnaik, L. M., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Istrail, Sorin, editor, Pevzner, Pavel, editor, Waterman, Michael, editor, Maglaveras, Nicos, editor, Chouvarda, Ioanna, editor, Koutkias, Vassilis, editor, and Brause, Rüdiger, editor
- Published
- 2006
- Full Text
- View/download PDF
50. Mapping Words into Codewords on PPM
- Author
-
Adiego, Joaquín, de la Fuente, Pablo, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Crestani, Fabio, editor, Ferragina, Paolo, editor, and Sanderson, Mark, editor
- Published
- 2006
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.