Clean adaptor containing reads from FastQ data at command line

From BITS wiki
Jump to: navigation, search


[ Main_Page ]


This workflow is intimately linked to a greater scope BITS training session dedicated to ChIP-Seq data analysis (Feb 24, 2014).

Aim

This short tutorial aims at demonstrating the effect of adaptor contamination, we present there a simple workflow including:

  • download a fastQ file related to a ChIP-Seq experiment used for training.
  • perform QC on the fastQ file with fastQC
  • remove reads with adaptor contamination using cutadapt
  • perform a second fastQC run to control for improvement

Analysis Workflow

The workflow presented here is by no mean complete but only aimed at identifying adaptor contamination in a real dataset and showing how removing this contamination will affect the overall quality of the read collection.

download SRR576933 from the European Nucleotide Archive (ENA|EBI)

ena_header.png

We choose the European member of the consortium for obvious bandwidth reasons. The file is located by searching for its ID and the ftp link to the reads copied from the result page (http://www.ebi.ac.uk/ena/data/view/SRX189778)

reads=fastq_reads
 
# get the 126Mb compressed fastq-file
wget -P ${reads} --no-check-certificate ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR576/SRR576933/SRR576933.fastq.gz
 
# preview the data for the first 3 reads (always safer)
zcat SRR576933.fastq.gz | head -12
@SRR576933.1 HWUSI-EAS1789_0000:2:20:1269:14140/1
AAGCATGGAATAACCGCCTGGTGAATGCTCGCCATA
+
dcd`\dddddaeacecdac`c\cca`bTbbdddYd_
@SRR576933.2 HWUSI-EAS1789_0000:2:20:1270:19579/1
TGGAGGCTGACCACGATAAGCTGCCGCTGGTGGTGC
+
dceYc^\cddd^dddTccc`daYdbdaad`]``XTU
@SRR576933.3 HWUSI-EAS1789_0000:2:20:1270:17351/1
AGTGCGATGCCGTTCACCCGGTTTTCTTTATCATTA
+
dddddc\cc^`c\ccddadcdaadbbc]]]aa^ddT

run fastQC test on the data

Provided fastQC is installed on your machine, you can proceed copy-padsting the code below. If not, please install the fastQC command line version from ()

reads=fastq_reads
results=fastqc_results
mkdir -p ${results}
 
fastqc -o fastqc_results --noextract -f ${reads}/SRR576933.fastq.gz

Inspect the results and identify the issue with adaptor contamination. Many issues are present in the report and could be linked to adaptor contamination as well (we will check this at the end of this tutorial).

fastQC details in presence of adaptors

  1. per_base_quality.png
  2. per_sequence_quality.png
  3. per_base_sequence_content.png
  4. per_base_gc_content.png
  5. per_sequence_gc_content.png
  6. per_base_n_content.png
  7. sequence_length_distribution.png
  8. duplication_levels.png
  9. kmer_profiles.png
FastQC figures (full results)
per_base_quality.png
per_sequence_quality.png
per_base_sequence_content.png
per_base_gc_content.png
per_sequence_gc_content.png
per_base_n_content.png
sequence_length_distribution.png
duplication_levels.png
kmer_profiles.png

Overrepresented_sequences.png

As seen here, one sequence is present in more than 29% of the reads. Such abundance cannot come from a true bacterial sequence and has to be a primer contamination, left over from the library construction process or from a PCR amplification gone wild. In the current case, the primer is identified as "TruSeq Adapter, Index 5" and reveals a library cleanup problem.

We now use the cutadapt utility written by Marcel Martin to remove reads containing the TruSeq Adapter, Index 5 identified by its sequence "GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA".

Handicon.png cutadapt([1]) will clip or simply filter-out reads that contain a provided linker sequence. It can be tuned to be fault-tolerant and can also be used in reverse-mode to keep only linker-containing reads if this makes sense in your workflow.

# process from the gzipped file and save into a new gzip file
cutadapt -e 0.1 \
    -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCC \
    --discard ${reads}/SRR576933.fastq.gz | \
    bgzip -c > ${reads}/SRR576933-cutadapt.fastq.gz

detailed terminal cutadapt output

cutadapt -e 0.1 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCC --discard SRR576933.fastq.gz > SRR576933-cutadapt.fastq
 
cutadapt version 1.3
Command line parameters: -e 0.1 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCC --discard SRR576933.fastq.gz
Maximum error rate: 10.00%
   No. of adapters: 1
   Processed reads:      3603544
   Processed bases:    129727584 bp (129.7 Mbp)
     Trimmed reads:      1197923 (33.2%)
     Trimmed bases:     41076056 bp (41.1 Mbp) (31.66% of total)
   Too short reads:            0 (0.0% of processed reads)
    Too long reads:            0 (0.0% of processed reads)
        Total time:    169.78 s
     Time per read:      0.047 ms
 
=== Adapter 1 ===
 
Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCC', length 36, was trimmed 1197923 times.
 
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-36 bp: 3
 
Overview of removed sequences
length	count	expect	max.err	error counts
3	45620	56305.4	0	45620
4	10550	14076.3	0	10550
5	2717	3519.1	0	2717
6	655	879.8	0	655
7	146	219.9	0	146
8	39	55.0	0	39
9	100	13.7	0	54 46
10	166	3.4	1	102 64
11	37	0.9	1	14 23
12	31	0.2	1	21 10
13	12	0.1	1	12
14	2	0.0	1	0 2
16	1233	0.0	1	1163 70
17	862	0.0	1	29 833
18	206	0.0	1	197 8 1
19	27	0.0	1	27
20	3	0.0	2	1 2
21	1244	0.0	2	1191 48 5
22	4	0.0	2	4
23	1642	0.0	2	1568 69 5
24	131	0.0	2	124 7
25	27	0.0	2	11 15 1
26	60	0.0	2	54 6
28	6	0.0	2	2 3 1
35	72	0.0	3	0 63 1 8
36	1132331	0.0	3	1470 6868 1068935 55058

run fastQC test on the cleaned data

reads=fastq_reads
results=fastqc_results
mkdir -p fastqc_results
fastqc -o fastqc_results --noextract -f fastq ${reads}/SRR576933-cutadapt.fastq.gz

Technical.png Inspect the results and look at the different sections of the report. Many problems seem to have vanished after removing the adaptor-containig reads, only the low quality issue with the last (3') base and some kmer noise remain.

fastQC details after adaptor removal

  1. per_base_quality.png
  2. per_sequence_quality.png
  3. per_base_sequence_content.png
  4. per_base_gc_content.png
  5. per_sequence_gc_content.png
  6. per_base_n_content.png
  7. sequence_length_distribution.png
  8. duplication_levels.png
  9. kmer_profiles.png
FastQC figures (full results)
per_base_quality.png
per_sequence_quality.png
per_base_sequence_content.png
per_base_gc_content.png
per_sequence_gc_content.png
per_base_n_content.png
sequence_length_distribution.png
duplication_levels.png
kmer_profiles.png

Conclusion

A simple removal of contaminating adaptors can make a huge difference in the overall quality control of fastq reads. Most mappers should not be affected by contaminating adaptors that should not map to the reference genome and are discarded during that step. However, it can be a good practice to remove such reads prior to mapping if the user want to prevent any unexpected behavior during NGS analysis.


Download Howto files here

Use the right link to access data generated during this tutorial Howto-files

References:
  1. https://code.google.com/p/cutadapt/wiki/documentation

[ Main_Page ]