File compression has been popular ever since the emergence of email attachments. There was a time when we could only share files up to a limited size. Disk space was also a limiting factor in those days. People used to squeeze their data into every bit of disk space they could manage. Then we heard about file compression programs that could reduce file size. Windows also came with the support to compress disk contents in order to free up space. We still use file compression in day to day activity – downloading a compressed file from the internet and extracting it, optimizing images for web by reducing their size, and a fair number of other applications.
File compression is not magic. Some of our readers might be wondering how a file can possibly be reduced in size while keeping its contents intact. This article talks about the various file compression methods.
There are two common techniques used to compress data. Modern day compression algorithms are faster and more effective but just to give you a basic concept of what compression is all about, we’ll discuss the basics here.
Data compression is mainly achieved by removing information units (bits and bytes) from the data. This may be done either by eliminating redundant bits or by stripping down the actual information and replacing them with an aggregate value. The method removing redundant bits is known as lossless compression method because you can retrieve the original data after decompression. The strip-down method cannot recover the original data as some data is already replaced during the compression phase; hence the technique is known as lossy compression. Both of these methods have their own consequences. While lossless compression can restore the originally compressed file, it may not be a suitable method to compress small files; and while lossy compression is fast and has a higher compression ratio, it is not a recommended compression method where data integrity is a priority.
How Lossless Compression is achieved
Lossless compression, as mentioned above, eliminates redundant information and hence reduces file size. Modern compression programs usually use this type of compression method. This method is most effective on large files where data is more likely to be redundant.
For instance, consider a text document or a program code. The compression program works by analyzing the content, looking for repeated elements (characters, words or even patterns like a group of words, a special character following or preceding a particular word, and so on) and creates a statistical model of the content. It then stores the patterns, creates references, and replaces each occurrence of the patterns with the references (known as model-mapping). References are smaller in size than the content they replace. Now, considering a large file, the probability of a particular pattern being repeated again and again, is quite high, and replacing each of these patterns with its respective reference reduces the file size. A program code can have lot of commands repeated along the entire code. Compressing such a file by lossless compression would first require creation of a statistical model based on the redundant code, building references based on the model, replace the redundant code with the references and save the references along with the data so that the decompression program can restore back the original program.
Construction of the statistical model follows either the Static approach or the adaptive approach. In the static model, the statistical model is primitive and applies the same approach irrespective of the data. Adaptive model begins with the primitive approach too. However, as the analysis proceeds, it learns from the content and overwrites the model for the best possible compression.
How Lossy Compression is achieved
Lossy compression can be useful when the information being compressed is not critical. In other words, if considerable data loss is acceptable, lossy compression is an efficient method as it is capable of compressing data to a great extent. This is because it strips-off unnecessary bits from the information. It eliminates sequences of bit streams which represent close values. Let’s take an example.
Suppose you have an image of 2000 pixels width and 1500 pixels height. An image of this size is bound to be at least a few megabytes in file size. You want to add this image in a webpage of yours. Since web images are hardly more than 100 pixels wide, you have to resize the image. Now since you are decreasing the image in size, some information needs to be discarded.
This can be done in several ways. One method may completely removes pixels, hence reducing the overall Dots Per Inch (DPI) value of the image. Another method may replace groups of pixels with those of the same pixel value. Looking at the image in the pixel level, many of the successive pixels would represent shades of the same color (for instance, a group of pixels representing the sea or the sky). When you decrease image size, you may not even notice the image in such detail so as to see every detailed color shade. These pixels are simply discarded when resizing the image. So when you try to restore this image, the previously discarded pixels would be replaced by replicating any of the neighboring pixels. The original image is lost.
The file compression programs we use (like WinZip, WinRar, 7zip, etc.) use their own proprietary algorithms based on the lossless technique. Some companies like WinRar release their extraction algorithms so that other compression programs can only read/extract (but not create) archives in their proprietary format. These programs also usually encrypt the compressed files so that other (probably malicious) programs cannot read the contents.
So now you have some idea what happens to the data when a file is compressed. We hope this information will be helpful for all our curious readers out there. Keep reading our articles and provide us with your valuable feedback.
Enjoy reading 7labs!