Perform basic read QC at command line prior to mapping

From BITS wiki
Jump to: navigation, search


[ Main_Page ]


Reads can be biased in many different ways. Controlling read bias before attempting any sophisticated analysis is mandatory. FastQC runs both at command-line and through a handy Java graphical interface [1])

Apply FastQC on each read file

The command-line version is used here in batch with a simple bash command.

Assuming you can use 8 cpu and do not wish a verbose output, the following command will save results in a freshly created 'FastQC.results' folder.
The '--noextract' parameter will ensure that the resulting folder is not decompressed. All these options can be modified.

Technical.png Distributing reads to parallel jobs requires sufficient bandwidth, this is not 'advised' when your data is stored on USB disk and/or your IO bandwidth is limited.

mkdir -p FastQC.results
 
for f in *.fastq.gz; do
  # add  '-Q 33' if the quality phred score range requires it
  fastqc --noextract  -q -t 8 -o FastQC.results $f; 
done

The content of each archive can be de-compressed and viewed (fastqc_report.html) using your favourite web browser or included from the png pictures into your report.

FastQC-example.png

details

  1. per_base_quality.png
  2. per_sequence_quality.png
  3. per_base_sequence_content.png
  4. per_base_gc_content.png
  5. per_sequence_gc_content.png
  6. per_base_n_content.png
  7. sequence_length_distribution.png
  8. duplication_levels.png
  9. kmer_profiles.png
FastQC-per_base_quality.png
FastQC-per_sequence_quality.png
FastQC-per_base_sequence_content.png
FastQC-per_base_gc_content.png
FastQC-per_sequence_gc_content.png
FastQC-per_base_n_content.png
FastQC-sequence_length_distribution.png
FastQC-duplication_levels.png
FastQC-kmer_profiles.png

References:
  1. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

[ Main_Page ]