Combine the content of several fastq files into one big fastq file

From BITS wiki
Jump to: navigation, search
Go back to Introduction to Linux for bioinformatics

Background

You have got sequencing data! The reads are in .fastq format. For every single 'run' of the sequencing machine, the reads are available as a separate .fastq file.

For the training, the data is available in data_linux_training.tar.gz. You might have it on your laptop, if not, download the data from http://data.bits.vib.be/pub/trainingmaterial/introduction_to_linux_for_bioinformatics/data_linux_training.tar.gz. Unpack the folder, and decompress all compressed fastq files you find in there.

So in this directory, you find different files, which you can list by:

$ ls -alh

In the output (shown below), the lines starting with 'd' mean directory; the lines with '-' mean file. The filenames ending with .fastq are obviously the fastq files: those with '_1' are the first pair from those files containing '_2'. The files without _1 or _2 are reads that do not have a pair.

drwxrwxr-x. 13 galaxy data   4.0K May  3 11:45 .
drwxrwxr-x. 12 galaxy data   4.0K Apr 29 11:45 ..
drwxr-xr-x.  2 galaxy data   4.0K Apr 29 11:45 ERR000020
-rw-rw-r--.  1 galaxy galaxy 430M Apr 29 11:54 ERR000020_1.fastq
-rw-rw-r--.  1 galaxy galaxy 430M Apr 29 11:54 ERR000020_2.fastq
-rw-rw-r--.  1 galaxy galaxy 586K Apr 29 11:54 ERR000020.fastq
drwxr-xr-x.  2 galaxy data   4.0K Mar 28 11:56 ERR000309
-rw-rw-r--.  1 galaxy galaxy 1.7G Apr 29 11:50 ERR000309_1.fastq
-rw-rw-r--.  1 galaxy galaxy 2.0G Apr 29 11:50 ERR000309_2.fastq
-rw-rw-r--.  1 galaxy galaxy 6.8M Apr 29 11:50 ERR000309.fastq
drwxr-xr-x.  2 galaxy data   4.0K Mar 28 11:54 ERR000467
...

Concatenate the files of the same sample in one file

We will collect all .fastq sequences in one folder into one file.

The task is now to put all sequences from all _1.fastq and _2.fastq in one single file. You can do this easily with linux commands! Try to solve it yourself. Click on 'show' to display the answer, but try first yourself.

Nice, we are able to select all fastq files we need!

Excellent! Note: first always try your commands on a part of the file, before running it on the complete file.

So let's continue to our last step!

Nice work!