Downloading and storing bioinformatics data
Go back to parent page Introduction to Linux for bioinformatics
Contents
Create a project folder
The first thing to do when you start a bioinformatics project, is to create a structure of folders to put your data in an organised fashion.
download
As an example, we will download the rice genome from the Rice Annotation Project database. But first create the folder structure.
Create following folder structure. |
---|
$ mkdir "Rice Example" $ cd Rice\ Example $ mkdir Genome\ data $ cd Genome\ data $ mkdir Sequence $ mkdir Annotation $ cd |
Alternatively: |
Be aware of white spaces on the command line!
On the command line, programs, options and arguments are separated by white spaces. If you choose to use a folder name containing a white space, it will interpret every word as an option or argument. So you have to tell Bash to ignore the white space. This can be done by:
- putting strings between quotes like ' or "
- escape a white space with \. See the examples above.
Hence, you might save yourself some trouble (and typing!) by putting _ instead of white spaces in names. Also make sure to use tab expansion, wherever possible!
Download the genome data directly on the command line
You can fetch the rice genome from this link.
Download the genome data to the "Rice example"/"Genome data"/Sequence folder. Use wget to download from the link. |
---|
Right-click on the download link, and copy the download link. The download link is: http://rapdb.dna.affrc.go.jp/download/archive/build5/IRGSPb5.fa.masked.gz |
Go the directory and execute wget
$ cd # to go back to the home directory $ cd Ric<tab> $ cd Gen<tab>/Seq<tab> $ wget http://rapdb.dna.affrc.go.jp/download/archive/build5/IRGSPb5.fa.masked.gz --2013-10-15 09:36:01-- http://rapdb.dna.affrc.go.jp/download/archive/build5/IRGSPb5.fa.masked.gz Resolving rapdb.dna.affrc.go.jp (rapdb.dna.affrc.go.jp)... 150.26.230.179 Connecting to rapdb.dna.affrc.go.jp (rapdb.dna.affrc.go.jp)|150.26.230.179|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 122168025 (117M) [application/x-gzip] Saving to: `IRGSPb5.fa.masked.gz' 100%[======================================>] 122,168,025 973K/s in 2m 40s 2013-10-15 09:38:42 (747 KB/s) - `IRGSP-1.0_genome.fasta.gz' saved [122168025/122168025] $ ls IRGSPb5.fa.masked.gz |
Allright. We have fetched our first genome sequence!
Did your data get through correctly?
Large downloads or slow downloads like this can take a long time. Plenty of opportunity for the transfer to go wrong. Therefore, large downloads should always have a checksum mentioned. You can find the md5 checksum on the downloads page. The md5 checksum is an unique string identifying (and calculated from) this data. Once downloaded, you should calculate this string yourself with md5sum.
$ md5sum IRGSPb5.fa.masked.gz 7af391c32450de873f80806bbfaedf05 IRGSPb5.fa.masked.gz
You should go to the rice genome download page, and compare this string with the MD5 checksum mentioned over there. You can do this manually. Now that you know the concept of checksums, there is an easier way to verify the data using md5sum. Can you find the easier way?
Search how to use md5sum to check the downloaded files with the .md5 file from the website. |
---|
Check the man page
$ man md5sum |
It does not say much: in the end it refers to
$ info coreutils 'md5sum invocation' |
Reading the options, there is one option sounding promising:
`-c' `--check' Read file names and checksum information (not data) from each FILE (or from stdin if no FILE was specified) and report whether the checksums match the contents of the named files. |
This way we can check the download:
$ wget http://rapdb.dna.affrc.go.jp/download/archive/build5/IRGSPb5.fa.masked.gz.md5 --2013-10-15 09:47:02-- http://rapdb.dna.affrc.go.jp/download/archive/build5/IRGSPb5.fa.masked.gz.md5 Resolving rapdb.dna.affrc.go.jp (rapdb.dna.affrc.go.jp)... 150.26.230.179 Connecting to rapdb.dna.affrc.go.jp (rapdb.dna.affrc.go.jp)|150.26.230.179|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 55 [application/x-gzip] Saving to: `IRGSPb5.fa.masked.gz.md5' 100%[======================================>] 55 --.-K/s in 0s 2013-10-15 09:47:03 (757 KB/s) - `IRGSPb5.fa.masked.gz.md5' saved [55/55] $ ls IRGSPb5.fa.masked.gz IRGSPb5.fa.masked.gz.md5 $ md5sum -c IRGSPb5.fa.masked.gz.md5 IRGSPb5.fa.masked.gz: OK |
Ensuring integrity of downloads
A handy tool to use is the DownThemAll addon for Firefox, in which you have to provide the checksum at the time of download. It will automatically check whether the download is finished.
The Short Read Archive (SRA), storing NGS data sets, makes use of Aspera to download data a great speeds, ensuring integrity. To download from SRA using aspera in linux, follow the this guide from EBI.
Extracting the data
What type of file have you downloaded? |
---|
$ file IRGSPb5.fa.masked.gz IRGSPb5.fa.masked.gz: gzip compressed data, was "IRGSPb5.fa.masked", from Unix, last modified: Wed Aug 18 03:45:47 2010 |
It is a compressed file. Files are compressed to save storage space. Before using these files, you have to decompress them. |
What can you do with this type of file? Check the command apropos. |
---|
$ apropos gzip gzip (1) - compress or expand files lz (1) - gunzips and shows a listing of a gzip'd tar'd archive tgz (1) - makes a gzip'd tar archive uz (1) - gunzips and extracts a gzip'd tar'd archive zforce (1) - force a '.gz' extension on all gzip files |
apropos is a command that helps you discover new commands. In case you have a type of file that you don't know about, use apropos to search for corresponding programs.
Decompress the file. |
---|
Check the man page of gzip. |
From the man page:gunzip [ -acfhlLnNrtvV ] [-S suffix] [ name ... ] |
$ gunzip IRGSPb5.fa.masked.gz $ ls IRGSPb5.fa.masked IRGSPb5.fa.masked.gz.md5 |} |