Back to Search Start Over

TEXT COMPRESSION OPTIMIZATION.

Authors :
BOLT BERANEK AND NEWMAN INC CAMBRIDGE MA
Brignetti, Mario C.
Kahn, Robert E.
Bjorkgren, David G.
Bobrow, Daniel G.
BOLT BERANEK AND NEWMAN INC CAMBRIDGE MA
Brignetti, Mario C.
Kahn, Robert E.
Bjorkgren, David G.
Bobrow, Daniel G.
Source :
DTIC AND NTIS
Publication Year :
1967

Abstract

The report describes research performed on optimization techniques for the compression of text. The areas of concentration of effort were: (a) Evaluation of alternative text segmentation procedures on the basis of compression efficiency provided; (b) the problem of efficiency variability that occurs when codes designed to suit a particular sample of text are applied to other samples of text; (c) the automatic reduction of size of an encoding set, and the prediction of the effects of such size reductions; (d) the applicability of text compression techniques to document descriptor files. The pertinent conclusions are: (a) the text segmentation procedure adopted in the earlier research appears to be very close to optimal; (b) there is significant degradation of performance when encoding texts other than the ones used to obtain the code; (c) it is possible to predict quantitatively the effects of size reduction on compression efficiency, this being independent of the way the reduction is made; (d) document descriptor files are compressible using the techniques described. In addition, research was conducted on the statistics of language and on rate distortion theory. Motivated by the results obtained on the problem of efficiency variability, we developed generative models for the statistics of taxonomies, that are shown to be consistent with the available data. The aim of the research on rate distortion theory is roughly to predict how much is lost by overcompressing the text. The results presented pertain to the basic theory, that is just beginning to be developed.

Details

Database :
OAIster
Journal :
DTIC AND NTIS
Notes :
text/html, English
Publication Type :
Electronic Resource
Accession number :
edsoai.ocn831542575
Document Type :
Electronic Resource