Descriptor: "TEXT COMPRESSION" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"TEXT COMPRESSION"' showing total 347 results

Start Over Descriptor "TEXT COMPRESSION"

347 results on '"TEXT COMPRESSION"'

1. Generalization of Repetitiveness Measures for Two-Dimensional Strings

Author: Carfagna, Lorenzo, Manzini, Giovanni, Romana, Giuseppe, Sciortino, Marinella, Urbina, Cristian, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Lipták, Zsuzsanna, editor, Moura, Edleno, editor, Figueroa, Karina, editor, and Baeza-Yates, Ricardo, editor
Published: 2025
Full Text: View/download PDF

2. A Hybrid Genetic Algorithm-Particle Swarm Optimization Approach for Enhanced Text Compression

Author: Tara Nawzad Ahmad Al Attar
Subjects: text compression, genetic algorithms, particle swarm optimization, hybrid algorithm, data storage, Science
Abstract: Text compression is a necessity for efficient data storage and transmission. Especially in the digital era, volumes of digital text have increased incredibly. Traditional text compression methods, including Huffman coding and Lempel-Ziv-Welch, have certain limitations regarding their adaptability and efficiency in dealing with such complexity and diversity of data. In this paper, we propose a hybrid method that combines Genetic Algorithm (GA) with Particle Swarm Optimization (PSO) to optimize the compression of text using the broad exploration capabilities of GA and fast convergence properties of PSO. The experimental results reflect that the proposed hybrid approach of GA-PSO yields much better performance in compression ratio than the standalone methods by reducing the size to about 65% while retaining integrity in the original content. The proposed method is also highly adaptable to various text forms and outperformed other state-of-the-art methods such as the Grey Wolf Optimizer, the Whale Optimization Algorithm, and the African Vulture Optimization Algorithm. These results support that the hybrid method GA-PSO seems promising for modern text compression.
Published: 2024
Full Text: View/download PDF

3. Learning-based short text compression using BERT models.

Author: Öztürk, Emir and Mesut, Altan
Subjects: DATA compression, LANGUAGE models, ENGLISH language, SPEED
Abstract: Learning-based data compression methods have gained significant attention in recent years. Although these methods achieve higher compression ratios compared to traditional techniques, their slow processing times make them less suitable for compressing large datasets, and they are generally more effective for short texts rather than longer ones. In this study, MLMCompress, a word-based text compression method that can utilize any BERT masked language model is introduced. The performance of MLMCompress is evaluated using four BERT models: two large models and two smaller models referred to as "tiny". The large models are used without training, while the smaller models are fine-tuned. The results indicate that MLMCompress, when using the best-performing model, achieved 3838% higher compression ratios for English text and 42% higher compression ratios for multilingual text compared to NNCP, another learning-based method. Although the method does not yield better results than GPTZip, which has been developed in recent years, it achieves comparable outcomes while being up to 35 times faster in the worst-case scenario. Additionally, it demonstrated a 20% improvement in compression speed and a 180% improvement in decompression speed in the best case. Furthermore, MLMCompress outperforms traditional compression methods like Gzip and specialized short text compression methods such as Smaz and Shoco, particularly in compressing short texts, even when using smaller models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Learning-based short text compression using BERT models

Author: Emir Öztürk and Altan Mesut
Subjects: BERT, Fine tuning, Learning-based compression, Text compression, Electronic computers. Computer science, QA75.5-76.95
Abstract: Learning-based data compression methods have gained significant attention in recent years. Although these methods achieve higher compression ratios compared to traditional techniques, their slow processing times make them less suitable for compressing large datasets, and they are generally more effective for short texts rather than longer ones. In this study, MLMCompress, a word-based text compression method that can utilize any BERT masked language model is introduced. The performance of MLMCompress is evaluated using four BERT models: two large models and two smaller models referred to as “tiny”. The large models are used without training, while the smaller models are fine-tuned. The results indicate that MLMCompress, when using the best-performing model, achieved 3838% higher compression ratios for English text and 42% higher compression ratios for multilingual text compared to NNCP, another learning-based method. Although the method does not yield better results than GPTZip, which has been developed in recent years, it achieves comparable outcomes while being up to 35 times faster in the worst-case scenario. Additionally, it demonstrated a 20% improvement in compression speed and a 180% improvement in decompression speed in the best case. Furthermore, MLMCompress outperforms traditional compression methods like Gzip and specialized short text compression methods such as Smaz and Shoco, particularly in compressing short texts, even when using smaller models.
Published: 2024
Full Text: View/download PDF

5. XCompress: LLM assisted Python-based text compression toolkit

Author: Emir Öztürk
Subjects: Benchmarking, Large language models, Text compression, Python, Computer software, QA76.75-76.765
Abstract: This study introduces XCompress, a Python-based tool for effectively utilizing various compression algorithms. XCompress offers manual, brute force, and Large Language Model (LLM) methods to determine the most suitable algorithm based on the type of text data. Its modular structure allows easy addition of new algorithms and includes functions for benchmarking and result comparison. Tests on diverse text types demonstrate the efficacy of the LLM-assisted Compression Selection Model (CSM). With XCompress, users can determine the most suitable method for their files. Additionally, in academic research, they can easily compare different methods without needing any scripting or programming language.
Published: 2024
Full Text: View/download PDF

6. Computing All-vs-All MEMs in Grammar-Compressed Text

Author: Díaz-Domínguez, Diego, Salmela, Leena, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Nardini, Franco Maria, editor, Pisanti, Nadia, editor, and Venturini, Rossano, editor
Published: 2023
Full Text: View/download PDF

7. An Empirical Analysis on Lossless Compression Techniques

Author: Hossain, Mohammad Badrul, Rahman, Md. Nowroz Junaed, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Neri, Filippo, editor, Du, Ke-Lin, editor, Varadarajan, Vijayakumar, editor, San-Blas, Angel-Antonio, editor, and Jiang, Zhiyu, editor
Published: 2023
Full Text: View/download PDF

8. A New Method for Short Text Compression

Author: Murat Aslanyurek and Altan Mesut
Subjects: Machine learning, text compression, k-means, clustering, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Short texts cannot be compressed effectively with general-purpose compression methods. Methods developed to compress short texts often use static dictionaries. In order to achieve high compression ratios, using a static dictionary suitable for the text to be compressed is an important problem that needs to be solved. In this study, a method called WSDC (Word-based Static Dictionary Compression), which can compress short texts at a high ratio, and a model that uses iterative clustering to create static dictionaries used in this method are proposed. The number of static dictionaries to be created can vary by running the k-Means clustering algorithm iteratively according to some rules. A method called DSWF (Dictionary Selection by Word Frequency) is also presented to determine which of the created dictionaries can compress the source text at the best ratio. Wikipedia article abstracts consisting of 6 different languages were used as the dataset in the experiments. The developed WSDC method is compared with both general-purpose compression methods (Gzip, Bzip2, PPMd, Brotli and Zstd) and special methods used for compression of short texts (shoco, b64pack and smaz). According to the test results, although WSDC is slower than some other methods, it achieves the best compression ratios for short texts smaller than 200 bytes and better than other methods except Zstd for short texts smaller than 1000 bytes.
Published: 2023
Full Text: View/download PDF

9. A Study on Data Compression Algorithms for Its Efficiency Analysis

Author: Rodrigues, Calvin, Jishnu, E. M., Nair, Chandu R., Soumya Krishnan, M., Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Karuppusamy, P., editor, Perikos, Isidoros, editor, and García Márquez, Fausto Pedro, editor
Published: 2022
Full Text: View/download PDF

10. Optimal alphabet for single text compression.

Author: Allahverdyan, Armen and Khachatryan, Andranik
Subjects: *HUFFMAN codes, *DATA compression, *SIGNS & symbols
Abstract: A text written using symbols from a given alphabet can be compressed using the Huffman code, which minimizes the length of the encoded text. It is necessary, however, to employ a text-specific codebook, i.e. the symbol-codeword dictionary, to decode the original text. Thus, the compression performance should be evaluated by the full code length, i.e. the length of the encoded text plus the length of the codebook. We studied several alphabets for compressing texts – letters, n -grams of letters, syllables, words, and phrases. If only sufficiently short texts are retained, an alphabet of letters or two-grams of letters is optimal. For the majority of Project Gutenberg texts, the best alphabet (the one that minimizes the full code length) is given by syllables or words, depending on the representation of the codebook. Letter 3 and 4-grams, having on average comparable length to syllables/words, perform noticeably worse than syllables or words. Word 2-grams also are never the best alphabet, on the account of having a very large codebook. We also show that the codebook representation is important – switching from a naive representation to a compact one significantly improves the matters for alphabets with large number of symbols, most notably the words. Thus, meaning-expressing elements of the language (syllables or words) provide the best compression alphabet. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

11. Medical Data Compression and Sharing Technology Based on Blockchain

Author: Du, Yi, Yu, Hua, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Zhang, Zhao, editor, Li, Wei, editor, and Du, Ding-Zhu, editor
Published: 2020
Full Text: View/download PDF

12. Text Compression-Aided Transformer Encoding.

Author: Li, Zuchao, Zhang, Zhuosheng, Zhao, Hai, Wang, Rui, Chen, Kehai, Utiyama, Masao, and Sumita, Eiichiro
Subjects: *NATURAL language processing, *ENCODING, *CURRENT transformers (Instrument transformer), *INFORMATION modeling
Abstract: Text encoding is one of the most important steps in Natural Language Processing (NLP). It has been done well by the self-attention mechanism in the current state-of-the-art Transformer encoder, which has brought about significant improvements in the performance of many NLP tasks. Though the Transformer encoder may effectively capture general information in its resulting representations, the backbone information, meaning the gist of the input text, is not specifically focused on. In this paper, we propose explicit and implicit text compression approaches to enhance the Transformer encoding and evaluate models using this approach on several typical downstream tasks that rely on the encoding heavily. Our explicit text compression approaches use dedicated models to compress text, while our implicit text compression approach simply adds an additional module to the main model to handle text compression. We propose three ways of integration, namely backbone source-side fusion, target-side fusion, and both-side fusion, to integrate the backbone information into Transformer-based models for various downstream tasks. Our evaluation on benchmark datasets shows that the proposed explicit and implicit text compression approaches improve results in comparison to strong baselines. We therefore conclude, when comparing the encodings to the baseline models, text compression helps the encoders to learn better language representations. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

13. A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications

Author: Uthayakumar Jayasankar, Vengattaraman Thirumal, and Dhavachelvan Ponnurangam
Subjects: Data compression, Data redundancy, Text compression, Image compression, Information theory, Entropy encoding, Electronic computers. Computer science, QA75.5-76.95
Abstract: Explosive growth of data in digital world leads to the requirement of efficient technique to store and transmit data. Due to limited resources, data compression (DC) techniques are proposed to minimize the size of data being stored or communicated. As DC concepts results to effective utilization of available storage area and communication bandwidth, numerous approaches were developed in several aspects. In order to analyze how DC techniques and its applications have evolved, a detailed survey on many existing DC techniques is carried out to address the current requirements in terms of data quality, coding schemes, type of data and applications. A comparative analysis is also performed to identify the contribution of reviewed techniques in terms of their characteristics, underlying concepts, experimental factors and limitations. Finally, this paper insight to various open issues and research directions to explore the promising areas for future developments.
Published: 2021
Full Text: View/download PDF

14. An Approach for LPF Table Computation

Author: Chairungsee, Supaporn, Charuphanthuset, Thana, Barbosa, Simone Diniz Junqueira, Editorial Board Member, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Kotenko, Igor, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Yuan, Junsong, Founding Editor, Anderst-Kotsis, Gabriele, editor, Tjoa, A Min, editor, Khalil, Ismail, editor, Elloumi, Mourad, editor, Mashkoor, Atif, editor, Sametinger, Johannes, editor, Larrucea, Xabier, editor, Fensel, Anna, editor, Martinez-Gil, Jorge, editor, Moser, Bernhard, editor, Seifert, Christin, editor, Stein, Benno, editor, and Granitzer, Michael, editor
Published: 2019
Full Text: View/download PDF

15. TÜRKÇE METİNLERİN SINIFLANDIRMA BAŞARISINI ARTIRMAK İÇİN YENİ BİR YÖNTEM ÖNERİSİ

Author: Metin Bi̇lgi̇n
Subjects: metin sınıflandırma, doğal dil i̇şleme, lzw, metin sıkıştırma, makine öğrenmesi, text classification, natural language processing, text compression, machine learning, Technology, Engineering (General). Civil engineering (General), TA1-2040
Abstract: Bu çalışma, yazarı bilinmeyen bir dokümanının yazarını tahmin etmeyi amaçlamaktadır. Bunun için 6 farklı köşe yazarına ait 6 köşe yazısı öncelikle ön-işlem aşamasına sokulmuştur. Ardından bu metinler üzerinden n-gram (2-3) ile özellikler çıkarılmıştır. Çıkarılan özellikler üzerinden sistem 6 farklı makine öğrenmesi üzerinde çapraz geçerleme (10) ile test edilmiştir. Buraya kadar olan kısım literatürde şimdiye kadar uygulanmış olan yöntemdir. Bizim önerimiz ön işlem aşamasının ardından eldeki metinleri LZW algoritması ile kayıpsız sıkıştırarak özellik sayısını azaltmak ve bunun sistemin başarısı üzerindeki etkileri araştırmak üzerinedir. Ön-işlemden geçmiş olan metinler LZW algoritması ile binary (ikili) ve decimal (onlu) olarak sıkıştırılır. Sıkıştırmanın ardından n-gram (2-3) ile çıkarılan özellikler ile sistem 6 farklı makine öğrenmesi yönteminde test edilmiş ve çalışma sonuçları 5 farklı metrik için incelenmiştir. Yapılan çalışma sonucunda ikili olarak sıkıştırılmış metinler hem 2-gram hem de 3-gramda, 6 farklı makine öğrenmesi algoritmasında da daha iyi sonuçlar elde etmiştir. Random Tree ve Naïve bayes algoritmasında onlu sıkıştırma, ham verinin gerisinde kalsa da diğer 4 algoritmada daha iyi elde sonuçlar elde etmiş ama ortalama başarı değerlerinde geride kalmıştır. Yapılan çalışma sonucunda ikili sıkıştırma tüm metriklerinde diğer iki yönteme göre daha başarılıdır. Yapılan çalışmada yazar tanıma işlemi yapılmış olsa da önerilen bu yöntemin tüm metin sınıflandırma işlemlerinde kullanılabileceği düşünülmektedir.
Published: 2019
Full Text: View/download PDF

16. A New Corpus of the Russian Social Network News Feed Paraphrases: Corpus Construction and Linguistic Feature Analysis

Author: Pronoza, Ekaterina, Yagunova, Elena, Pronoza, Anton, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Castro, Félix, editor, Miranda-Jiménez, Sabino, editor, and González-Mendoza, Miguel, editor
Published: 2018
Full Text: View/download PDF

17. Reading in the Age of Compression.

Author: Koepnick, Lutz
Subjects: *PDF (Computer file format), *DIGITAL communications, *READING, *ELECTRONIC data processing, *READING comprehension
Abstract: Compression is often considered a royal road to process data in evershorter time and to cater to our desire to outspeed the accelerating transmission of information in the digital age. This article explores how different techniques of accelerated text dissemination and reading, such as consonant writing, speed-reading apps, and the PDF file format, borrow from the language of compression yet, precisely in so doing, obscure the constitutive multilayered temporality of reading and the embodied role of the reader. While discussing different methods aspiring to compress textual objects and processes of reading, the author illuminates hidden assumptions that accompany the rhetoric of text compression and compressed reading. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

18. A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications.

Author: Jayasankar, Uthayakumar, Thirumal, Vengattaraman, and Ponnurangam, Dhavachelvan
Subjects: DATA quality, DATA compression, INFORMATION theory, COMPARATIVE studies, IMAGE compression
Abstract: • Systematic organization of Data Compression (DC) concepts with its importance, mathematical formulation and performance measures. • Critical investigation of various DC algorithms on the basis of data quality, coding schemes, data type and applications. • We suggested potential research directions and open issues to explore the possible future trends in DC. Explosive growth of data in digital world leads to the requirement of efficient technique to store and transmit data. Due to limited resources, data compression (DC) techniques are proposed to minimize the size of data being stored or communicated. As DC concepts results to effective utilization of available storage area and communication bandwidth, numerous approaches were developed in several aspects. In order to analyze how DC techniques and its applications have evolved, a detailed survey on many existing DC techniques is carried out to address the current requirements in terms of data quality, coding schemes, type of data and applications. A comparative analysis is also performed to identify the contribution of reviewed techniques in terms of their characteristics, underlying concepts, experimental factors and limitations. Finally, this paper insight to various open issues and research directions to explore the promising areas for future developments. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

19. A Graph-Based Frequent Sequence Mining Approach to Text Compression

Author: Oswald, C., Ajith Kumar, I., Avinash, J., Sivaselvan, B., Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Ghosh, Ashish, editor, Pal, Rajarshi, editor, and Prasath, Rajendra, editor
Published: 2017
Full Text: View/download PDF

20. Longest Previous Non-overlapping Factors Table Computation

Author: Chairungsee, Supaporn, Crochemore, Maxime, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Gao, Xiaofeng, editor, Du, Hongwei, editor, and Han, Meng, editor
Published: 2017
Full Text: View/download PDF

21. Text Compression Based on Letter's Prefix in the Word.

Author: AbuSafiya, Majed
Subjects: DATA compression, LETTERS, FINITE state machines, VOCABULARY, ENCODING
Abstract: Huffman [Huffman (1952)] encoding is one of the most known compression algorithms. In its basic use, only one encoding is given for the same letter in text to compress. In this paper, a text compression algorithm that is based on Huffman encoding is proposed. Huffman encoding is used to give different encodings for the same letter depending on the prefix preceding it in the word. A deterministic finite automaton (DFA) that recognizes the words of the text is constructed. This DFA records the frequencies for letters that label the transitions. Every state will correspond to one of the prefixes of the words of the text. For every state, a different Huffman encoding is defined for the letters that label the transitions leaving that state. These Huffman encodings are then used to encode the letters of the words in the text. This algorithm was implemented and experimental study showed significant reduction in compression ratio over the basic Huffman encoding. However, more time is needed to construct these codes. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

22. Text Compression

Author: Ferragina, Paolo, Nitto, Igor, Venturini, Rossano, Nascimento, Mario A., Section editor, Liu, Ling, editor, and Özsu, M. Tamer, editor
Published: 2018
Full Text: View/download PDF

23. Indexing Compressed Text.

Author: Ferragina, Paolo and Manzini, Giovanni
Subjects: COMPUTER operating systems, CODING theory, INFORMATION theory, ALGORITHMS, INFORMATION storage & retrieval systems
Abstract: We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form. Our first compressed data structure retrieves the occ occurrences of a pattern P[1, p] within text T [1, n] in O(p +occ log1 + ε n) time for any chosen ε, 0 < ε < 1. This data structure uses at most 5nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is Θ(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows-Wheeler Transform, and can be regarded as a compressed suffix array. Our second compressed data structure achieves O(p + occ) query time using O(nHk(T) logε n) + o(n) bits of storage for any chosen ε, 0 < ε < 1. Therefore, it provides optimal output-sensitive query time using o(n log n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows-Wheeler Transform and the LZ78 algorithm. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

24. Boosting Textual Compression in Optimal Linear Time.

Author: Ferragina, Paolo, Giancarlo, Raffaele, Manzini, Giovanni, and Sciortino, Marinella
Subjects: ALGORITHMS, DATA structures, ELECTRONIC data processing, COMPUTER operating systems, INFORMATION theory
Abstract: We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the "best possible" contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties. Technically, our boosting technique builds upon three main ingredients: the Burrows-Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically, we show that there exists a proper partition of the Burrows-Wheeler Transform of a string s that shows a deep combinatorial relation with the kth order entropy of s. That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost. Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s, and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster, we get better compression algorithms than some of the best existing ones, that is, LZ77 , LZ78 , PPMC and the ones derived from the Burrows-Wheeler Transform. Further, we settle analytically some long-standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

25. Trigram-Based Vietnamese Text Compression

Author: Nguyen, Vu H., Nguyen, Hien T., Duong, Hieu N., Snasel, Vaclav, Kacprzyk, Janusz, Series editor, Król, Dariusz, editor, Madeyski, Lech, editor, and Nguyen, Ngoc Thanh, editor
Published: 2016
Full Text: View/download PDF

26. Compressing Big Data: When the Rate of Convergence to the Entropy Matters

Author: Aronica, Salvatore, Langiu, Alessio, Marzi, Francesca, Mazzola, Salvatore, Mignosi, Filippo, Nazzicone, Giulio, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Kotsireas, Ilias S., editor, Rump, Siegfried M., editor, and Yap, Chee K., editor
Published: 2016
Full Text: View/download PDF

27. Evolution of ItaSA and Subsfactory

Author: Massidda, Serenella and Massidda, Serenella
Published: 2015
Full Text: View/download PDF

28. Classification of Scientific Texts Based on the Compression of Annotations to Publications.

Author: Selivanova, I. V., Kosyakov, D. V., and Guskov, A. E.
Abstract: This paper describes the possibility of establishing the semantic proximity of scientific texts by the method of their automatic classification based on the compression of annotations. The idea of the method is that the compression algorithms such as PPM (prediction by partial matching) compress terminologically similar texts much better than distant ones. If a kernel of publications (an analogue of a training set) is formed for each classified topic, then the best proportion of compression will indicate that the classified text belongs to the corresponding topic. Thirty thematic categories were determined; for each of them, annotations of approximately 500 publications were received in the Scopus database, out of which 100 annotations for the kernel and 20 annotations for testing were selected in different ways. It was found that building a kernel based on highly cited publications revealed an error level of up to 12 against 32% in the case of random sampling. The quality of classification is also affected by the initial number of categories: the fewer the categories that participate in the classification and the more terminological differences exist between them, the higher its quality is. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

29. ГРАМАТИЧНІ ЗАСОБИ РОЗШИРЕННЯ І КОМПРЕСІЇ ЛІРИЧНОГО ПРОСТОРУ ПОЕТИЧНИХ ТВОРІВ ДЛЯ ДІТЕЙ

Subjects: text compression, grammatical polynomials, children's poetry, text division, grammatical binomials
Abstract: The report is devoted to the consideration of grammatical methods of creating compression and expansion of the poetic narrative in poetry for children. On the material of children’s poems by Serhiy Vakulenko and Lina Kostenko, the modern tendency of combining additional morphological and syntactic techniques that divide and connect the grammatical structure of the poem is considered. Particular attention is paid to morphological binomials and polynomials, hyphenated constructions that combine lexemes of the same part of speech. It is argued that grammatical units, which are elements of the form and content of a poetic text, participate in the intonation and punctuation linking of the text. At the same time, the newest technique is their combination with such traditional means of division as the use of uncommon sentences, direct speech, the creation of homogeneous chains, and division by dashes. The linguistic search of the masters of the word contributes to the poetic discourse and develops a creative attitude to the word in young readers.
Published: 2023
Full Text: View/download PDF

30. ГРАМАТИЧНІ ЗАСОБИ РОЗШИРЕННЯ І КОМПРЕСІЇ ЛІРИЧНОГО ПРОСТОРУ ПОЕТИЧНИХ ТВОРІВ ДЛЯ ДІТЕЙ

Subjects: text compression, grammatical polynomials, children's poetry, text division, grammatical binomials
Abstract: The report is devoted to the consideration of grammatical methods of creating compression and expansion of the poetic narrative in poetry for children. On the material of children’s poems by Serhiy Vakulenko and Lina Kostenko, the modern tendency of combining additional morphological and syntactic techniques that divide and connect the grammatical structure of the poem is considered. Particular attention is paid to morphological binomials and polynomials, hyphenated constructions that combine lexemes of the same part of speech. It is argued that grammatical units, which are elements of the form and content of a poetic text, participate in the intonation and punctuation linking of the text. At the same time, the newest technique is their combination with such traditional means of division as the use of uncommon sentences, direct speech, the creation of homogeneous chains, and division by dashes. The linguistic search of the masters of the word contributes to the poetic discourse and develops a creative attitude to the word in young readers.
Published: 2023
Full Text: View/download PDF

31. Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment

Author: Liu, Wei, Chang, Zhipeng, Teahan, William J., Goebel, Randy, Series editor, Tanaka, Yuzuru, Series editor, Wahlster, Wolfgang, Series editor, Besacier, Laurent, editor, Dediu, Adrian-Horia, editor, and Martín-Vide, Carlos, editor
Published: 2014
Full Text: View/download PDF

32. Comparison of Entropy and Dictionary Based Text Compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian

Author: Matea Ignatoski, Jonatan Lerga, Ljubiša Stanković, and Miloš Daković
Subjects: arithmetic, Lempel–Ziv–Welch (LZW), text compression, encoding, English, German, Mathematics, QA1-939
Abstract: The rapid growth in the amount of data in the digital world leads to the need for data compression, and so forth, reducing the number of bits needed to represent a text file, an image, audio, or video content. Compressing data saves storage capacity and speeds up data transmission. In this paper, we focus on the text compression and provide a comparison of algorithms (in particular, entropy-based arithmetic and dictionary-based Lempel–Ziv–Welch (LZW) methods) for text compression in different languages (Croatian, Finnish, Hungarian, Czech, Italian, French, German, and English). The main goal is to answer a question: ”How does the language of a text affect the compression ratio?” The results indicated that the compression ratio is affected by the size of the language alphabet, and size or type of the text. For example, The European Green Deal was compressed by 75.79%, 76.17%, 77.33%, 76.84%, 73.25%, 74.63%, 75.14%, and 74.51% using the LZW algorithm, and by 72.54%, 71.47%, 72.87%, 73.43%, 69.62%, 69.94%, 72.42% and 72% using the arithmetic algorithm for the English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian versions, respectively.
Published: 2020
Full Text: View/download PDF

33. A Syllable-Based Technique for Uyghur Text Compression

Author: Wayit Abliz, Hao Wu, Maihemuti Maimaiti, Jiamila Wushouer, Kahaerjiang Abiderexiti, Tuergen Yibulayin, and Aishan Wumaier
Subjects: text compression, uyghur, syllable, code table, Information technology, T58.5-58.64
Abstract: To improve utilization of text storage resources and efficiency of data transmission, we proposed two syllable-based Uyghur text compression coding schemes. First, according to the statistics of syllable coverage of the corpus text, we constructed a 12-bit and 16-bit syllable code tables and added commonly used symbols—such as punctuation marks and ASCII characters—to the code tables. To enable the coding scheme to process Uyghur texts mixed with other language symbols, we introduced a flag code in the compression process to distinguish the Unicode encodings that were not in the code table. The experiments showed that the 12-bit coding scheme had an average compression ratio of 0.3 on Uyghur text less than 4 KB in size and that the 16-bit coding scheme had an average compression ratio of 0.5 on text less than 2 KB in size. Our compression schemes outperformed GZip, BZip2, and the LZW algorithm on short text and could be effectively applied to the compression of Uyghur short text for storage and applications.
Published: 2020
Full Text: View/download PDF

34. Compact and Fast Indexes for Translation Related Tasks

Author: Costa, Jorge, Gomes, Luís, Lopes, Gabriel P., Russo, Luís M. S., Brisaboa, Nieves R., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Correia, Luís, editor, Reis, Luís Paulo, editor, and Cascalho, José, editor
Published: 2013
Full Text: View/download PDF

35. Associative Text Representation and Correction

Author: Horzyk, Adrian, Gadamer, Marcin, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Rutkowski, Leszek, editor, Korytkowski, Marcin, editor, Scherer, Rafał, editor, Tadeusiewicz, Ryszard, editor, Zadeh, Lotfi A., editor, and Zurada, Jacek M., editor
Published: 2013
Full Text: View/download PDF

36. Compresibilidad de las imágenes

Author: Abascal Jiménez, Alejandro, Universitat Autònoma de Barcelona. Escola d'Enginyeria, and Serra Sagristà, Joan
Subjects: Lempel-Ziv, Image compression, Compressió de text, Compressibility metrics, Métricas compresibilidad, Fuentes de compresibilidad, Estudio comparativo, Compresión de imágenes, Burrows-Wheeler transform, Compressibility sources, Compresión de texto, Mètriques compressibilitat, Estudi comparatiu, Text compression, Fonts de compressibilitat, Transformada de Burrows-Wheeler, Comparative study, Compressió d'imatges
Abstract: El trabajo consiste en tres partes diferentes pero relacionadas entre sí. Primero se profundiza el estudio de la compresibilidad de datos 1D y 2D. Esto incluye entender qué es la compresibilidad y por qué la entropía de Shannon no sirve a nuestros objetivos. Segundo, se define una métrica de compresibilidad para datos 2D. También se implementan dos métricas de compresibilidad para datos 1D ya existentes (Lempel-Ziv y Transformada de Burrows-Wheeler). Por último, se realizan experimentos para comparar el rendimiento de las métricas con diferentes conjuntos de imágenes y comprobar si la nueva métrica está bien definida o captura bien la compresibilidad de las imágenes. The work has been divided into three different but related parts. Firstly, an in-depth study of 1D and 2D data compressibility is carried out. This includes understanding what is compressibility and why Shannon's entropy is not useful for our purposes. Secondly, a 2D data compressibility metric is defined. Also, two existing 1D data compressibility metrics are implemented (Lempel-Ziv and Burrows-Wheeler transform). Finally, a series of experiments are run in order to compare the metrics perfomances with several sets of images and to test whether the new metric is well defined and is able to capture the compressibility of the images. El treball consisteix en tres parts diferents però relacionades entre si. Primer s'aprofundeix l'estudi de la compressibilitat de dades 1D i 2D. Això inclou entendre què és la compressibilitat i per què l'entropia de Shannon no serveix als nostres objectius. Segon, es defineix una mètrica de compressibilitat per a dades 2D. També s'implementen dues mètriques de compressibilitat per a dades 1D ja existents (Lempel-Ziv i Transformada de Burrows-Wheeler). Finalment, es realitzen experiments per a comparar el rendiment de les mètriques amb diferents conjunts d'imatges i comprovar si la nova mètrica està ben definida o captura bé la compressibilitat de les imatges.
Published: 2023

37. L-Systems for Measuring Repetitiveness

Author: Navarro, Gonzalo and Urbina, Cristian
Subjects: FOS: Computer and information sciences, L-systems, Formal Languages and Automata Theory (cs.FL), Text compression, Mathematics of computing → Combinatorics on words, Computer Science - Data Structures and Algorithms, Repetitiveness measures, Data Structures and Algorithms (cs.DS), Computer Science - Formal Languages and Automata Theory, String morphisms, Theory of computation → Data compression
Abstract: In order to use them for compression, we extend L-systems (without ε-rules) with two parameters d and n, and also a coding τ, which determines unambiguously a string w = τ(φ^d(s))[1:n], where φ is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as 𝓁, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper, we deepen the study of the measure 𝓁 and its relation with a better-established measure called δ, which builds on substring complexity. Our results show that 𝓁 and δ are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness. We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ν of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms., LIPIcs, Vol. 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023), pages 25:1-25:17
Published: 2023
Full Text: View/download PDF

38. Simple Rules for Syllabification of Arabic Texts

Author: Soori, Hussein, Platos, Jan, Snasel, Vaclav, Abdulla, Hussam, Snasel, Vaclav, editor, Platos, Jan, editor, and El-Qawasmeh, Eyas, editor
Published: 2011
Full Text: View/download PDF

39. Compression of Layered Documents

Author: Carpentieri, Bruno, Zavoral, Filip, editor, Yaghob, Jakub, editor, Pichappan, Pit, editor, and El-Qawasmeh, Eyas, editor
Published: 2010
Full Text: View/download PDF

40. Efficient Algorithms for Two Extensions of LPF Table: The Power of Suffix Arrays

Author: Crochemore, Maxime, Iliopoulos, Costas S., Kubica, Marcin, Rytter, Wojciech, Waleń, Tomasz, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, van Leeuwen, Jan, editor, Muscholl, Anca, editor, Peleg, David, editor, Pokorný, Jaroslav, editor, and Rumpe, Bernhard, editor
Published: 2010
Full Text: View/download PDF

41. Multi-Stream Word-Based Compression Algorithm for Compressed Text Search.

Author: Öztürk, Emir, Mesut, Altan, and Diri, Banu
Subjects: *ALGORITHMS, *TEXT files, *COMPUTER programming
Abstract: In this article, we present a novel word-based lossless compression algorithm for text files using a semi-static model. We named this method the ‘Multi-stream word-based compression algorithm (MWCA)’ because it stores the compressed forms of the words in three individual streams depending on their frequencies in the text and stores two dictionaries and a bit vector as side information. In our experiments, MWCA produces a compression ratio of 3.23 bpc on average and 2.88 bpc for files greater than 50 MB; if a variable length encoder such as Huffman coding is used after MWCA, the given ratios are reduced to 2.65 and 2.44 bpc, respectively. MWCA supports exact word matching without decompression, and its multi-stream approach reduces the search time with respect to single-stream algorithms. Additionally, the MWCA multi-stream structure supplies the reduction in network load by requesting only the necessary streams from the database. With the advantage of its fast compressed search feature and multi-stream structure, we believe that MWCA is a good solution, especially for storing and searching big text data. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

42. Efficient Approaches to Compute Longest Previous Non-overlapping Factor Array.

Author: Chairungsee, Supaporn
Subjects: *DATA compression, *ALGORITHMS, *RUN-length encoding, *CODING theory, *COMPUTER science
Abstract: In this article, we introduce new methods to compute the Longest Previous nonoverlapping Factor (LPnF) table. The LPnF table is the table that stores the maximal length of factors re-occurring at each position of a string without overlapping and this table is related to Ziv-Lempel factorization of a text which is useful for text compression and data compression. The LPnF table has the important role for data compression, string algorithms and computational biology. In this paper, we present three approaches to produce the LPnF table of a string from its augmented position heap, from its position heap, and from its suffix heap. We also present the experimental results from these three solutions. The algorithms run in linear time with linear memory space. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

43. A Sub-Pixel Gradient Compression Algorithm for Text Image Display on a Smart Device.

Author: Kim, Kyudong, Lee, Chulhee, and Lee, Hyuk-Jae
Subjects: *ALGORITHMS, *MACHINE theory, *ARTIFICIAL intelligence, *IMAGE recognition (Computer vision), *IMAGE processing
Abstract: A smart device such as a television, tablet, and smartphone, often displays a compound image consisting of various types of sub-images, including texts and pictures. A text image displayed on a smart device is composed of red, green, and blue (RGB) sub-pixels having strong correlation among them. This paper proposes a new compression algorithm, called sub-pixel gradient compression (SPGC), that makes use of the correlation in a text in order to improve its compression efficiency. The initial step of the proposed algorithm converts a text into a de-colorized image in which RGB sub-pixels vary gradually. The de-colorization is a necessary step to further enhance the correlation among RGB sub-pixels, and thereby improving the compression efficiency. From the de-colorized text, the next step estimates the slopes of the gradual variations among sub-pixels and then encodes the slopes and their associated information as a bitstream. SPGC is a relatively simple algorithm because its main operation is the estimation of sub-pixel gradients and the precise estimation of the gradients makes it possible to reduce a degradation of the quality of the texts. Experimental results show that SPGC maintains lower complexity and a higher compression ratio than the compression standards including JPEG, JPEG-2000, H.264, or HEVC. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

44. Non-repetitive DNA Compression Using Memoization

Author: Venugopal, K. R., Srinivasa, K. G., Patnaik, L. M., Kacprzyk, Janusz, editor, Venugopal, K. R., Srinivasa, K. G., and Patnaik, L. M.
Published: 2009
Full Text: View/download PDF

45. LPF Computation Revisited

Author: Crochemore, Maxime, Ilie, Lucian, Iliopoulos, Costas S., Kubica, Marcin, Rytter, Wojciech, Waleń, Tomasz, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Fiala, Jiří, editor, Kratochvíl, Jan, editor, and Miller, Mirka, editor
Published: 2009
Full Text: View/download PDF

46. Experiments in Text File Compression.

Author: Rubin, Frank
Subjects: *DATA compression, *COMPUTER files, *DATABASE management, *COMPUTER programming, *CODING theory, *DATABASES
Abstract: A system for the compression of data files, viewed as strings of characters, is presented. The method is general, and applies equally well to English, to PL/I, or to digital data. The system consists of an encoder, an analysis program, and a decoder. Two algorithms for encoding a string differ slightly from earlier proposals. The analysis program attempts to find an optimal set of codes for representing substrings of the file. Four new algorithms for this operation are described and compared. Various parameters in the algorithms are optimized to obtain a high degree of compression for sample texts. [ABSTRACT FROM AUTHOR]
Published: 1976
Full Text: View/download PDF

47. Compression of Concatenated Web Pages Using XBW

Author: Šesták, Radovan, Lánský, Jan, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Geffert, Viliam, editor, Karhumäki, Juhani, editor, Bertoni, Alberto, editor, Preneel, Bart, editor, Návrat, Pavol, editor, and Bieliková, Mária, editor
Published: 2008
Full Text: View/download PDF

48. Edge-Guided Natural Language Text Compression

Author: Adiego, Joaquín, Martínez-Prieto, Miguel A., de la Fuente, Pablo, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Ziviani, Nivio, editor, and Baeza-Yates, Ricardo, editor
Published: 2007
Full Text: View/download PDF

49. Non-repetitive DNA Sequence Compression Using Memoization

Author: Srinivasa, K. G., Jagadish, M., Venugopal, K. R., Patnaik, L. M., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Istrail, Sorin, editor, Pevzner, Pavel, editor, Waterman, Michael, editor, Maglaveras, Nicos, editor, Chouvarda, Ioanna, editor, Koutkias, Vassilis, editor, and Brause, Rüdiger, editor
Published: 2006
Full Text: View/download PDF

50. Mapping Words into Codewords on PPM

Author: Adiego, Joaquín, de la Fuente, Pablo, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Crestani, Fabio, editor, Ferragina, Paolo, editor, and Sanderson, Mark, editor
Published: 2006
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

347 results on '"TEXT COMPRESSION"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources