NGS data analysis
This wiki page is dedicated to the series of trainings that will lead you through the various workflows for the analysis of next generation sequencing data.
Have fun solving the exercises!
[ Main_Page ]
Because most of you have used or will use the Illumina platform to generate their data, we will use Illumina data sets in all exercises
Contents
- 1 Training 1: Introduction to the analysis of NGS data
- 2 Training 2: NGS variant analysis
- 3 Training 3: RNA-Seq analysis
- 4 Training 4: ChIP-Seq analysis
- 5 Training 5: metagenomics
Training 1: Introduction to the analysis of NGS data
Periodically repeated Sessions (Janick Mathys)
Slides
- Introduction on Illumina technology
- Principle of Illumina sequencing
- Introduction to NGS data analysis
- Data analysis workflow for the training
- NGS workflows in Galaxy
Exercises
This training gives you the background knowledge you need to follow the more advanced trainings on variant analysis, RNA-Seq and ChIP-Seq.
Download the data sets for this training:
- The data set of the introduction training
- The blue data set of the RNA-Seq training
- ChIP data of the ChIP-Seq training
- control data of the ChIP-Seq training
- Arabidopsis gtf file
- Arabidopsis bed file
Now you can try the exercises.
- Quality control of NGS data
- Mapping of NGS data
- RNASeq in Galaxy
- RNASeq in GenePattern
- DNASeq in Galaxy
- DNASeq in GenePattern
Archive
- Downloading NGS data from the internet
- Improving the quality of NGS data
- Linux command line
- Introduction to RNA-Seq analysis
FAQ
Q&A added during the intro to NGS data analysis
File formats
Training 2: NGS variant analysis
Session of November 2018 (Stéphane Plaisance)
Session of 2018 using GenePattern
Session of 2020 using GenePattern
Training archive
- Introduction to NGS-formats used in classical NGS applications and used today in the hands-on
- Remarks about NGS variant analysis: laptop configuration and files
- Session of 2017 using GenePattern
- Hands-on_introduction_to_NGS_variant_analysis-2016 - pages dedicated to the 2016 1-day session using Linux command line.
- Hands-on_introduction_to_NGS_variant_analysis - pages dedicated to the 2014 2-days session using Linux command line.
- Slides presented during the 2014 session: NGS_DNA-variants_2014-05-23_slides.pdf.
Q&A pages
- Q&A_added_during_the_NGS_variant_analysis_training
- Q&A added during the NGS variant analysis training2 (+ new Q&A's from May 2014)
- NGS Variant Analysis and coverage depth
- Call variants with samtools 1.0
- Create a mappability track
- Create a GC content track
Training 3: RNA-Seq analysis
Bulk RNA-Seq analysis for differential expression
Tools
Install the latest version of R and RStudio. List of R packages used in the training:
- ggplot2
- ggrepel
- gplots
- pheatmap
- plyr
- RColorBrewer
- reshape2
- Bioconductor
- Bioconductor: airway
- Bioconductor: DESeq2
- Bioconductor: GenomicAlignments
- Bioconductor: GenomicFeatures
- Bioconductor: org.Hs.eg.db
- Bioconductor: Rsamtools
- Bioconductor: tximeta
- Only for Mac users: Bioconductor: Rsubread
Slides
Exercises
- Solutions FASTQC analysis of Arabidopsis data
- Solutions FASTQC analysis of human data
- Trimmomatic manual
- STAR manual
- samtools tutorial
- RSeQC manual
- htseq-count tutorial
- Solutions command line workflow
- Bash script for automating command line RNASeq workflow
- R script with solutions counting exercises
- R script counting
- R script with solutions DESeq2 analysis
- R script DESeq2
- DESeq2 tutorial
- R script with solutions EdgeR analysis
Files
- Adapter file
- Arabidopsis annotation in bed format
- metadata
- HTSeq counts
- FeatureCounts / Salmon counts
- list of DE genes
- clean and annotated list of DE genes
- Ensembl IDs of upregulated genes
- Ensembl IDs of DE genes
Extra links
- QuantSeq data analysis
- Slides on how cutadapt works
- Cutadapt manual
- featureCounts or htseq-count?
- R Script for RNASeq variant analysis
- Instruction for hands-on RNASeq variant analysis
- presentation RNASeq variant analysis
Single cell RNA-Seq analysis
Tools
Install the latest version of R and RStudio. List of R packages used in the training:
- dplyr
- gridExtra
- rgl
- Seurat
- stringr
- Bioconductor: scater
Slides
- Slides of the analysis of aggregated brain data sets of 2000 and 1000 cells
Exercises
- simple R script for Seurat analysis of aggregated brain data sets of 2000 and 1000 cells
- R script with extra functions for Seurat analysis of aggregated brain data sets of 2000 and 1000 cells
- full R notebook for Seurat analysis of aggregated brain data sets of 2000 and 1000 cells
- notebook on full Seurat analysis (open in web browser)
Files
- aggregated data: output of CellRanger aggregate to be used as input of the script for Seurat analysis of aggregated brain data sets
Extra links
- Slides introduction to 10xGenomics made by Mike Stubbington from 10xGenomics
- bcl2fastq tutorial (the tool that was used as a basis for cellranger mkfastq)
- Slides on trajectory analysis
- Tutorial on trajectory analysis
- R script for trajectory analysis
- 2000 brain cells mouse data set
- 1000 brain cells mouse data set
Summer school 2018
Scenic
Experimental design
Integration of omics data
What after the summer school ?
Bulk RNA-Seq - from raw reads to counts:
- We have two GenePattern servers running that contain all the tools discussed in the training. Send an email to bits@vib.be to get an account
- We can provide a snapshot of the server you worked on during the training. You can then make your own server on Google cloud (it's easy starting from a snapshot). You will have to pay for that.
Bulk RNA-Seq - finding DE genes:
- You can do the R analysis on your own computer: see this section for the list of packages you need to install.
- We can provide a snapshot of the server you worked on during the training. You can then make your own server on Google cloud (it's easy starting from a snapshot). You will have to pay for that.
Single cell RNA-Seq:
- You can do the Seurat analysis on your own computer: see .this section for the list of packages you need to install.
- We can provide a snapshot of the server you worked on during the training. You can then make your own server on Google cloud (it's easy starting from a snapshot). You will have to pay for that.
- In the future you can get support from Niels and Liesbet. Contact scRNAseq@irc.vib-ugent.be for more information.
- We will check if cell ranger is installed on KULeuven vsc (accessible by people from KULeuven and UHasselt).
A GIT page has been started to post your issues and share with us, you can reach it at https://github.com/BITS-VIB/Summer_school_2018
- NGS_data_analysis_tools A page listing tools found during the day and that you may want to install on your computer
Archive
Session of March 20th and 23rd, 2015 (Stéphane Plaisance)
repeated September 25, 2015
Hands-on_introduction_to_NGS_RNASeq_DE_analysis - the pages of the actual training
containing a hands-on workflow of RNA-Seq analysis for differential expression using command line tools.
creating ENV variables for the training
Create a new file with "sudo /etc/profile.d/bits.sh" and paste the following content
# system wide ENV variables to ease path in training exercises export SUMMER=/usr/summer export SOFT=$SUMMER/software export REFS=$SUMMER/refs export DATA=/mnt/userdata/$(whoami)
source (=execute) the file by typing ". /etc/profile.d/bits.sh"
You now have shortcuts (env variables) that can be typed to reach the very long exercise locations as fololws:
- $SUMMER leads to /usr/summer
- $SOFT leads to $SUMMER/software
- $REFS leads to $SUMMER/refs
- $DATA leads to /home/<yourhome>/data
edgeR / DESeq2
Archive
Session of January 20th and 27th, 2014 using Galaxy (Joachim Jacob)
Training 4: ChIP-Seq analysis
Introduction
The aim of this session is to :
- Have an understanding of the nature of ChIP-Seq data
- Perform a complete analysis workflow including QC, read mapping, visualization in a genome browser and peak-calling
- Use the GenePattern platform for each step of the workflow and feel the complexity of the task
- Have an overview of possible downstream analyses
- Perform a motif analysis with online web programs
This training gives an introduction to ChIP-seq data analysis, covering the processing steps starting from the reads to the peaks. Among all possible downstream analyses, the practical aspect will focus on motif analyses. A particular emphasis will be put on deciding which downstream analyses to perform depending on the biological question. This training does not cover all methods available today. It does not aim at bringing users to a professional NGS analyst level but provides enough information to allow biologists understand what DNA sequencing practically is and to communicate with NGS experts for more in-depth needs.
For this training, we will use a dataset produced by Myers et al [1] involved in the regulation of gene expression under anaerobic conditions in bacteria. We will focus on one factor: FNR. The advantage of this dataset is its small size, allowing real time execution of all steps of the dataset.
Suggested Reading :
- Bailey et al. Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput Biol 9, e1003326 (2013) [2].PDF
- Thomas-Chollier et al. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nature Protocols 7, 1551–1568 (2012)[3]. PDF
raw Data :
- all experiments: GEO: GSE41187
- used subset: FNR IP ChIP-seq Anaerobic A (=> ENA/SRA: SRX189773 - SRR576933)
- used control: anaerobic INPUT DNA (=> ENA/SRA: SRX189778 - SRR576938)
additional files:
- zip file containing the E.coli K12 genome, the .bam and the .bai file for the ChIP sample
- link to the E. coli gene annotations in gff3 format (download this gff3 file and use it in IGV)
- zip file containing the .bam and .bai file for the control sample (download this file and use it in IGV)
- normalized mapping results in BigWig format for the ChIP sample (download this file and use it in IGV)
- normalized mapping results in BigWig format for the control sample (download this file and use it in IGV)
Exercises
- Downloading the data
- GenePattern tutorial
- Quality control of the data
- Mapping the reads with Bowtie
- Peak calling with MACS
- Visualization with deepTools
- Visualizing the peaks in a genome browser
- Motif analysis
Same training in command line instead of GenePattern
Links
- From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis
- GEO database[4]
- EBI ENA[5]
- Bowtie manual[6]
- MACS manual
- RSAT (European mirror)[7]
- HMCan [8] when working with cancer samples or cell lines (by V Boera from Inst. Curie)
- UCSC microbial genome browser
- UCSC microbial genome tables
Archive
Session of June 1st, 2015 by Morgane Thomas-Chollier
Session of February 24th, 2014 by Morgane Thomas-Chollier
- Clean adaptor containing reads from FastQ data at command line
- Install ChIP-Seq training command line software
Training 5: metagenomics
Slides
Data files
- mice faecal samples data
- Excel file linking samples to barcodes, runs and additional info about the mice
- MiSeq file 1
- MiSeq file 2
- MiSeq file 3
- MiSeq file 4
- MiSeq file 5
- MiSeq file 6
- Excel file linking samples to barcodes and runs
Tools
- Lotus pipeline
-
Download usearch version 8 and copy into /usr/bin/tools/ folder (you need to be superuser for this)
Make executable:sudo chmod +x /usr/bin/tools/usearch8.1.1861_i86linux32
Create a symbolic link into the folder where Lotus will search for it:
sudo ln -s /usr/bin/tools/usearch8.1.1861_i86linux32 /usr/bin/tools/lotus_pipeline/bin/usearch_bin
- You also need R with the vegan package installed
Exercises
References:
- ↑
Kevin S Myers, Huihuang Yan, Irene M Ong, Dongjun Chung, Kun Liang, Frances Tran, Sündüz Keleş, Robert Landick, Patricia J Kiley
Genome-scale analysis of escherichia coli FNR reveals complex features of transcription factor binding.
PLoS Genet: 2013, 9(6);e1003565
[PubMed:23818864] ##WORLDCAT## [DOI] (I p) - ↑
Timothy Bailey, Pawel Krajewski, Istvan Ladunga, Celine Lefebvre, Qunhua Li, Tao Liu, Pedro Madrigal, Cenny Taslim, Jie Zhang
Practical guidelines for the comprehensive analysis of ChIP-seq data.
PLoS Comput Biol: 2013, 9(11);e1003326
[PubMed:24244136] ##WORLDCAT## [DOI] (I p) - ↑
Morgane Thomas-Chollier, Elodie Darbo, Carl Herrmann, Matthieu Defrance, Denis Thieffry, Jacques van Helden
A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs.
Nat Protoc: 2012, 7(8);1551-68
[PubMed:22836136] ##WORLDCAT## [DOI] (I e) - ↑ http://www.ncbi.nlm.nih.gov/geo/
- ↑ http://www.ebi.ac.uk/ena/
- ↑ http://bowtie-bio.sourceforge.net/
- ↑ http://rsat.eu
- ↑ http://www.cbrc.kaust.edu.sa/hmcan/
[ Main_Page ]