Downloading and storing bioinformatics data

From BITS wiki
Jump to: navigation, search
Go back to parent page Introduction to Linux for bioinformatics

Create a project folder

The first thing to do when you start a bioinformatics project, is to create a structure of folders to put your data in an organised fashion.

download

As an example, we will download the rice genome from the Rice Annotation Project database. But first create the folder structure.

Be aware of white spaces on the command line!

On the command line, programs, options and arguments are separated by white spaces. If you choose to use a folder name containing a white space, it will interpret every word as an option or argument. So you have to tell Bash to ignore the white space. This can be done by:

  1. putting strings between quotes like ' or "
  2. escape a white space with \. See the examples above.

Hence, you might save yourself some trouble (and typing!) by putting _ instead of white spaces in names. Also make sure to use tab expansion, wherever possible!

Download the genome data directly on the command line

You can fetch the rice genome from this link.

Allright. We have fetched our first genome sequence!

Did your data get through correctly?

Large downloads or slow downloads like this can take a long time. Plenty of opportunity for the transfer to go wrong. Therefore, large downloads should always have a checksum mentioned. You can find the md5 checksum on the downloads page. The md5 checksum is an unique string identifying (and calculated from) this data. Once downloaded, you should calculate this string yourself with md5sum.

$ md5sum IRGSPb5.fa.masked.gz
7af391c32450de873f80806bbfaedf05  IRGSPb5.fa.masked.gz

You should go to the rice genome download page, and compare this string with the MD5 checksum mentioned over there. You can do this manually. Now that you know the concept of checksums, there is an easier way to verify the data using md5sum. Can you find the easier way?

Ensuring integrity of downloads

A handy tool to use is the DownThemAll addon for Firefox, in which you have to provide the checksum at the time of download. It will automatically check whether the download is finished.

The Short Read Archive (SRA), storing NGS data sets, makes use of Aspera to download data a great speeds, ensuring integrity. To download from SRA using aspera in linux, follow the this guide from EBI.

Extracting the data

apropos is a command that helps you discover new commands. In case you have a type of file that you don't know about, use apropos to search for corresponding programs.