Go back to parent page Introduction to Linux for bioinformatics
Which protocol achieves highest compression ratio?
Let's do a little test. Download this compressed file.
Create a folder named 'Compression_exercise' in your home. Copy the downloaded tar.gz to it.
|
$ cd
$ mkdir Compression_exercise
$ cp Downloads/data_linux_training.tar.gz Compression_exercise/
|
Unpack the data_linux_training.tar.gz file.
|
$ tar -xvzf data_linux_training.tar.gz
|
Alternative: you can specify the options without the '-' sign.
$ tar xvfz data_linux_training.tar.gz
|
Decompress the file DRR000542_1.fastq.subset.gz
|
$ gunzip DRR000542_1.fastq.subset.gz
|
Copy the DRR000542_1.fastq.subset file to a new file called 'bzip2_test.fastq'. Compress this file with bzip2.
|
|
Tip If you would like to know how long the command took to finish, use time
$ time bzip2 bzip2_test.fastq
real 0m5.878s
user 0m5.728s
sys 0m0.112s
Three different times are given. What matters to you is the line 'real', also called the wall-clock time.
|
Copy DRR000542_1.fastq.subset file to a new file called gzip_test.fastq and compress with gzip.
|
$ time gzip gzip_test.fastq
real 0m5.878s
user 0m5.728s
sys 0m0.112s
|
A relatively unknown package is lrzip, 'long range zip', which achieves very good results on big files. Let's try that one also!
Copy DRR000542_1.fastq.subset file to a new file called lrzip_test.fastq and compress with lrzip.
|
$ lrzip lrzip_test.fastq
The program 'lrzip' is currently not installed. You can install it by typing:
sudo apt-get install lrzip
|
apt-get is the command line tool to install software on Debian distro's. Equivalent to the software center.
$ sudo apt-get install lrzip
[sudo] password for joachim:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
libnet-ip-perl diffstat libnet-dns-perl libparse-debianchangelog-perl
gir1.2-unique-3.0 kde-l10n-engb python-webpy libnet-domain-tld-perl
libemail-valid-perl libapt-pkg-perl python-flup kde-l10n-zhcn
Use 'apt-get autoremove' to remove them.
The following NEW packages will be installed:
lrzip
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 159 kB of archives.
After this operation, 313 kB of additional disk space will be used.
Get:1 http://be.archive.ubuntu.com/ubuntu/ precise/universe lrzip amd64 0.608-1 [159 kB]
Fetched 159 kB in 0s (780 kB/s)
Selecting previously unselected package lrzip.
(Reading database ... 662617 files and directories currently installed.)
Unpacking lrzip (from .../lrzip_0.608-1_amd64.deb) ...
Processing triggers for man-db ...
Setting up lrzip (0.608-1) ...
|
Now we can compress:
Output filename is: lrzip_test.fastq.lrz
lrzip_test.fastq - Compression Ratio: 6.724. Average Compression Speed: 0.563MB/s.
Total time: 00:03:02.97
real 3m3.026s
user 3m1.947s
sys 0m0.804s
|
Compare the sizes of the different resulting compressed files.
|
$ ls -lh *zip*
-rw------- 1 bits bits 17M Oct 22 14:06 bzip2_test.fastq.bz2
-rw------- 1 bits bits 21M Oct 22 14:06 gzip_test.fastq.gz
-rw------- 1 bits bits 104M Oct 22 14:06 lrzip_test.fastq
-rw------- 1 bits bits 16M Oct 22 14:10 lrzip_test.fastq.lrz
|
Decide for yourself whether the extra time needed for higher compression is worth the gain in compression.
Put the three files in a newly created folder 'results', and make an archive of it.
|
$ mkdir results
$ mv *{bz2,q.gz,lrz} results/
$ ls results/
bzip2_test.fastq.bz2 gzip_test.fastq.gz lrzip_test.fastq.lrz
$ tar cvf results.tar results/
$ rm -rf results/
$ ls -lh
total 281M
-rw------- 1 bits bits 104M May 4 2011 ERX000016.test.fastq
-rw-r--r-- 1 bits bits 21M Oct 22 14:02 ERX000016.test.fastq.tar.gz
-rw------- 1 bits bits 104M Oct 22 14:06 lrzip_test.fastq
-rw-r--r-- 1 bits bits 53M Oct 22 14:28 results.tar
|
Go back to parent page Introduction to Linux for bioinformatics