Or zip VS gzip VS bzip2 VS xz
In a previous article about the tar program I mentioned gzip and bzip2 compression methods as options to create a tarball (and I forgot xz).
To make amends today I will introduce the main methods to compress the file and I’ll do some tests to see how they behave.
I will consider zip, gzip, bzip2 and xv, i will not test compress another compression program present on Linux systems but now dated and surpassed by the other programs.
But as first thing an overview of these 4 methods/programs of compression
The ZIP file format is a data compression and archive format. A ZIP file contains one or more files that have been compressed to reduce file size, or stored as-is. The ZIP file format permits a number of compression algorithms.
The format was originally created in 1989 by Phil Katz, and was first implemented in PKWARE’s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format is now supported by many software utilities other than PKZIP (see List of file archivers). Microsoft has included built-in ZIP support (under the name “compressed folders”) in versions of its Windows operating system since 1998. Apple has included built-in ZIP support in Mac OS X 10.3 and later, including other compression formats.
GNU Gzip is a popular data compression program originally written by Jean-loup Gailly for the GNU project. Mark Adler wrote the decompression part.
gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project’s implementation of such a tool using Lempel-Ziv coding (LZ77), for which it stands for GNU zip. The program was created by Jean-Loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by the Project. Version 0.1 was first publicly released on October 31, 1992, and version 1.0 followed in February 1993.
gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. DEFLATE was intended as a replacement for LZW and other patent-encumbered data compression algorithms which, at the time, limited the usability of compress and other popular archivers.
bzip2 is a free and open source lossless data compression algorithm and program developed by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996. The compressor’s stability and popularity grew over the next several years, and Seward released version 1.0 in late 2000.
bzip2 compresses most files more effectively than the older LZW (.Z) and Deflate (.zip and .gz) compression algorithms, but is considerably slower. LZMA is generally more efficient than bzip2, while having much faster decompression.
bzip2 compresses data in blocks of size between 100 and 900 kB and uses the Burrows–Wheeler transform to convert frequently-recurring character sequences into strings of identical letters. It then applies move-to-front transform and Huffman coding. bzip2’s ancestor bzip used arithmetic coding instead of Huffman. The change was made because of a software patent restriction.
bzip2 is asymmetric, as decompression is relatively fast. Motivated by the large CPU time required for compression, a modified version was created in 2003 called pbzip2 that supported multi-threading, giving almost linear speed improvements on multi-CPU and multi-core computers
XZ Utils is free general-purpose data compression software with high compression ratio. XZ Utils were written for POSIX-like systems, but also work on some not-so-POSIX systems. XZ Utils are the successor to LZMA Utils.
The core of the XZ Utils compression code is based on LZMA SDK, but it has been modified quite a lot to be suitable for XZ Utils. The primary compression algorithm is currently LZMA2, which is used inside the .xz container format. With typical files, XZ Utils create 30 % smaller output than gzip and 15 % smaller output than bzip2.
XZ Utils consist of several components:
* liblzma is a compression library with API similar to that of zlib.
* xz is a command line tool with syntax similar to that of gzip.
* xzdec is a decompression-only tool smaller than the full-featured xz tool.
* A set of shell scripts (xzgrep, xzdiff, etc.) have been adapted from gzip to ease viewing, grepping, and comparing compressed files.
* Emulation of command line tools of LZMA Utils eases transition from LZMA Utils to XZ Utils.
I always used the basic commands without adding options, and then gave as name of the file to test “filename” I used:
zip nomefile.zip nomefile
Test made with the text file Bible in Basic English (bbe) the file not compressed is 4467663 bytes (4 MB).
These are the results:
With a bigger text file 41 MB the results are:
Another test with standard options and text file of 64 MB, results:
64M 65427K big3.txt
26M 25763K big3.txt.zip
26M 25763K big3.txt.gz
24M 23816K big3.txt.bz2
2.2M 2237K big3.txt.xz
Further test, as suggested in the comments i’ve used the -9 flag (max compression), with these commands:
zip -9 big3.txt.zip big3.txt
gzip -9 -c big3.txt >> big3.txt.gz
bzip2 -9 -c big3.txt >> big3.txt.bz2
xz -e -9 -c big3.txt >> big3.txt.xz
And the results, size and type of file:
64M 65427K big3.txt
25M 25590K big3.txt.zip
25M 25590K big3.txt.gz
24M 23816K big3.txt.bz2
2.2M 2218K big3.txt.xz
I used a 700 MB file encoded in XviD, the compression results were (as expected) quite low:
700 MB file.avi
694 MB file.avi.zip
694 MB file.avi.gz
694 MB file.avi.xz
693 MB file.avi.bz2
On another site, I found very similar test that i includes for completeness, the tests were made with:
- Binaries 6,092,800 Bytes taken from my /usr/bin director
- Text Files 43,100,160 Bytes taken from kernel source at /usr/src/linux-headers-2.6.28-15
- MP3s 191,283,200 Bytes, a random selection of MP3s
- JPEGs 266803200 Bytes, a random selection of JPEG photos
- MPEG 432,240,640 Bytes, a random MPEG encoded video
- AVI 734,627,840 Bytes, a random AVI encoded video
I have tarred each category so that each test is only performed on one file (As far as I’m aware, tarring the files will not affect the compression tests). Each test has been run from a script 10 times and an average has been taken to make the results as fair as possible.
From empirical tests done, and those observed at other sites I would recommend LZMA (aka XZ), is a bit slower and consumes more CPU cycles, but also it has a bigger compression ratios on the text file, which typically are those who we are interested to compress.
Add that’s fully integrated with tar, (option J) and is easily integrable with logrotate
On the other hand if CPU or Time are an issue you can go with gzip that is also more used and know (if you have to send somewhere your archive).