FASTQ

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | Quality control of NGS data ]

FASTQ files[1] are FASTA files[2] that not only contain sequences but also strings containing the quality scores of each base in the sequences. As such, FASTQ is the default format to store NGS reads.

FASTQ format

Each entry in a FASTQ file corresponds to one read and consists of four lines:

  1. A line starting with @ containing the sequence identifier
  2. The sequence
  3. A line starting with + sometimes followed by the same sequence identifier
  4. A line with quality scores encoded in ASCII format for each base (letter) in the sequence

As such the 2nd and 4th line must have the same length:

@HWI-ST999:102:D1N6AACXX:1:1101:1235:1936 1:N:0:
ATGTCTCCTGGACCCCTCTGTGCCCAAGCTCCTCATGCATCCTCCTCAGCAACTTGTCCTGTAGCTGAGGCTCACTGACTACCAGCTGCAG
+
1:DAADDDF<B<AGF=FGIEHCCD9DG=1E9?D>CF@HHG??B<GEBGHCG;;CDB8==C@@>>GII@@5?A?@B>CEDCFCC:;?CCCAC
The base calling software on the sequencer calculates a Phred score for each base it calls, expressing how confident the software is about the call (if it calls an A how sure is the software that there is indeed an A at that position ?). These Phred quality scores range from -5 to 41. They are added to an offset (33 or 64) and the resulting character is taken from the ASCII table. This is done to allow the scores to be represented as single characters so that they nicely align with the sequences.

Issues with the FASTQ format

Illumina made multiple changes to the quality score format: both 33 and 64 are used as offset and this can be very confusing:

  • If you find any of the following characters: !"#$%&'()*+,-./0123456789, it means your offset must be 33
  • any of the following characters KLMNOPQRSTUVWXYZ[\]^_`abcdefgh indicate an offset of 64

Overview of the different quality scores formats that have been used by Illumina:


fastq_phread-base.png

In paired end sequencing, reads from the same fragment end up in two different FASTQ files. In each sequencing experiment only one sequencing primer is used: so the ends at one side are sequenced first. Then a second sequencing experiment is done using the sequencing primer that targets the reverse strand. You can identify the reads of a pair because they have the same sequence identifier in each FASTQ file but a different suffix.

In the first FASTQ file the read of the pair will have a /1 suffix:

@EAS51_0210:3:6:3797:7459/1
GAATCCAACCCTCACAAAGAAGTTTCTCAGAATTCTTCCATCGAATTTTTATGTGATGGTATTTCCTTTTTTACCATAGGCCTCAAAGCGCTCCAAATAT
+
GGGGGGGGGGGGGFGGEGGBGGGGGGFGGGEFGFGGGGFFGGGFDGGGGGGGGEGGFGCEEGFFFFGGGGEGEEBDDEE@GGEGFBEEGEEEEEEB@CDD

In the second FASTQ file the read of the pair will have the same identifier followed by /2:

@EAS51_0210:3:6:3797:7459/2
AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT
+
GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCC
Again different formats for these suffixes have been used: /1 or -1 leading to a lot of confusion.

The original FASTQ files also allowed sequences and quality strings to be split over multiple lines, but this is nowadays discouraged as it can make automated processing of the files complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).

Solving issues in FASTQ files

You can convert your datasets to a standard FASTQ format with:


References:
  1. http://en.wikipedia.org/wiki/FASTQ_format
  2. http://en.wikipedia.org/wiki/FASTA_format
  3. https://github.com/biopython/biopython/blob/master/Bio/SeqIO/QualityIO.py