Combine the content of several fastq files into one big fastq file
Go back to Introduction to Linux for bioinformatics
Background
You have got sequencing data! The reads are in .fastq format. For every single 'run' of the sequencing machine, the reads are available as a separate .fastq file.
For the training, the data is available in data_linux_training.tar.gz. You might have it on your laptop, if not, download the data from http://data.bits.vib.be/pub/trainingmaterial/introduction_to_linux_for_bioinformatics/data_linux_training.tar.gz. Unpack the folder, and decompress all compressed fastq files you find in there.
So in this directory, you find different files, which you can list by:
$ ls -alh
In the output (shown below), the lines starting with 'd' mean directory; the lines with '-' mean file. The filenames ending with .fastq are obviously the fastq files: those with '_1' are the first pair from those files containing '_2'. The files without _1 or _2 are reads that do not have a pair.
drwxrwxr-x. 13 galaxy data 4.0K May 3 11:45 . drwxrwxr-x. 12 galaxy data 4.0K Apr 29 11:45 .. drwxr-xr-x. 2 galaxy data 4.0K Apr 29 11:45 ERR000020 -rw-rw-r--. 1 galaxy galaxy 430M Apr 29 11:54 ERR000020_1.fastq -rw-rw-r--. 1 galaxy galaxy 430M Apr 29 11:54 ERR000020_2.fastq -rw-rw-r--. 1 galaxy galaxy 586K Apr 29 11:54 ERR000020.fastq drwxr-xr-x. 2 galaxy data 4.0K Mar 28 11:56 ERR000309 -rw-rw-r--. 1 galaxy galaxy 1.7G Apr 29 11:50 ERR000309_1.fastq -rw-rw-r--. 1 galaxy galaxy 2.0G Apr 29 11:50 ERR000309_2.fastq -rw-rw-r--. 1 galaxy galaxy 6.8M Apr 29 11:50 ERR000309.fastq drwxr-xr-x. 2 galaxy data 4.0K Mar 28 11:54 ERR000467 ...
Concatenate the files of the same sample in one file
We will collect all .fastq sequences in one folder into one file.
The task is now to put all sequences from all _1.fastq and _2.fastq in one single file. You can do this easily with linux commands! Try to solve it yourself. Click on 'show' to display the answer, but try first yourself.
First, create the wildcard filter we need to select filenames ending with fastq, and containing _1 or _2. |
---|
[joachim@bits-vmhost2 ~]$ ls *_?.fastq ERR000020_1.fastq ERR000309_2.fastq ERR000468_1.fastq ERR000469_2.fastq ERR000471_1.fastq ERR000472_2.fastq ERR000479_1.fastq ERR000480_2.fastq ERR000020_2.fastq ERR000467_1.fastq ERR000468_2.fastq ERR000470_1.fastq ERR000471_2.fastq ERR000473_1.fastq ERR000479_2.fastq ERR000309_1.fastq ERR000467_2.fastq ERR000469_1.fastq ERR000470_2.fastq ERR000472_1.fastq ERR000473_2.fastq ERR000480_1.fastq |
Nice, we are able to select all fastq files we need!
Next, we will loop through the fastq files and print their contents to a new file. To do this, have a look at the for syntax in bash. After reading that, construct a command to display the first lines of each file. |
---|
[joachim@bits-vmhost2 ~]$ for i in `ls *_?.fastq`; do head $i; done @ERR000020.1 BGI-FC206YCAAXX_3_1_16:462 length=36 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +ERR000020.1 BGI-FC206YCAAXX_3_1_16:462 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @ERR000020.2 BGI-FC206YCAAXX_3_1_14:720 length=36 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +ERR000020.2 BGI-FC206YCAAXX_3_1_14:720 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @ERR000020.3 BGI-FC206YCAAXX_3_1_10:319 length=36 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ..... ..... (etc.) |
Excellent! Note: first always try your commands on a part of the file, before running it on the complete file.
So let's continue to our last step!
Using the for loop, display the complete contents of each file and redirect it to a new file with the name 'linuxexperiment.fastq'. |
---|
[joachim@bits-vmhost2 ~]$ for i in `ls *_?.fastq`; do cat $i >> linuxexperiment.fastq; echo "Completed"; done Completed Completed Completed Completed [joachim@bits-vmhost2 ~]$ ll -alh linuxexperiment.fastq -rw-rw-r--. 1 joachim joachim 82MB May 3 12:21 ERX000016.fastq |
Nice work!