FASTQ files are FASTA files that not only contain sequences but also strings containing the quality scores of each base in the sequences. As such, FASTQ is the default format to store NGS reads.
Each entry in a FASTQ file corresponds to one read and consists of four lines:
- A line starting with @ containing the sequence identifier
- The sequence
- A line starting with + sometimes followed by the same sequence identifier
- A line with quality scores encoded in ASCII format for each base (letter) in the sequence
As such the 2nd and 4th line must have the same length:
@HWI-ST999:102:D1N6AACXX:1:1101:1235:1936 1:N:0: ATGTCTCCTGGACCCCTCTGTGCCCAAGCTCCTCATGCATCCTCCTCAGCAACTTGTCCTGTAGCTGAGGCTCACTGACTACCAGCTGCAG + 1:DAADDDF<B<AGF=FGIEHCCD9DG=1E9?D>CF@HHG??B<GEBGHCG;;CDB8==C@@>>GII@@5?A?@B>CEDCFCC:;?CCCACThe base calling software on the sequencer calculates a Phred score for each base it calls, expressing how confident the software is about the call (if it calls an A how sure is the software that there is indeed an A at that position ?). These Phred quality scores range from -5 to 41. They are added to an offset (33 or 64) and the resulting character is taken from the ASCII table. This is done to allow the scores to be represented as single characters so that they nicely align with the sequences.
Issues with the FASTQ format
Illumina made multiple changes to the quality score format: both 33 and 64 are used as offset and this can be very confusing:
- If you find any of the following characters: !"#$%&'()*+,-./0123456789, it means your offset must be 33
- any of the following characters KLMNOPQRSTUVWXYZ[\]^_`abcdefgh indicate an offset of 64
Overview of the different quality scores formats that have been used by Illumina:
In paired end sequencing, reads from the same fragment end up in two different FASTQ files. In each sequencing experiment only one sequencing primer is used: so the ends at one side are sequenced first. Then a second sequencing experiment is done using the sequencing primer that targets the reverse strand. You can identify the reads of a pair because they have the same sequence identifier in each FASTQ file but a different suffix.
In the first FASTQ file the read of the pair will have a /1 suffix:
@EAS51_0210:3:6:3797:7459/1 GAATCCAACCCTCACAAAGAAGTTTCTCAGAATTCTTCCATCGAATTTTTATGTGATGGTATTTCCTTTTTTACCATAGGCCTCAAAGCGCTCCAAATAT + GGGGGGGGGGGGGFGGEGGBGGGGGGFGGGEFGFGGGGFFGGGFDGGGGGGGGEGGFGCEEGFFFFGGGGEGEEBDDEE@GGEGFBEEGEEEEEEB@CDD
In the second FASTQ file the read of the pair will have the same identifier followed by /2:
@EAS51_0210:3:6:3797:7459/2 AACCTTTGTTTGGATGGAGCAGTTTGTAAACAATCCTTTTGTAGAATCTGCAAAGGTATATTTCTGAGCCCATTGAGGCCTATGGTGAAATACGAAATAT + GGGGGGGGGGGGGFGGFEGGGGEGGGEGGGFDFBGGEFEFGEEGEGFEGGEGEEED?EEEGEEGBEBDGEEEEED=DCCCEBEEEEEEEAAC@DDB:CCCAgain different formats for these suffixes have been used: /1 or -1 leading to a lot of confusion.
The original FASTQ files also allowed sequences and quality strings to be split over multiple lines, but this is nowadays discouraged as it can make automated processing of the files complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).
Solving issues in FASTQ files
You can convert your datasets to a standard FASTQ format with:
- Galaxy's Groomer tool: see an example of how to use Groomer
- Biopython script to read various FASTQ format variants, standardize and use them in Biopython