Downloading NGS data from the internet

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | Quality control of NGS data ]

Introduction

NGS data repositories

First of all, you need data to analyze. You can generate your own data but there's a lot of NGS data available on the internet.

The main repositories for NGS data:

' NCBI - US EBI - Europe
Close-by so faster downloads
Gene expression database GEO ArrayExpress
Contain processed NGS data, no raw data ID starts with G ID starts with E-
NGS sequence database SRA ENA
Contain raw NGS data ID starts with SR ID starts with ER
ENA IDs also used by SRA SRA IDs also used by ENA
stores reads in sra format stores reads in fastq format

Both GEO and SRA use multiple types of IDs, ordered according to a certain hierarchy:

GEO ID points to definition
ID starts with GSE experiment Data of a full NGS experiment consisting of multiple samples
The samples belong to different groups that are to be compared e.g. treated and control samples
ID starts with GSM sample Data of one single sample
SRA ID points to definition
ID starts with SRP study Studies have an overall goal and may comprise several experiments.
ID starts with SRX experiment An Experiment describes what was sequenced and the method used.

Info on the source of the DNA, samples, sequencing platform and the processing of the data.

ID starts with SRR run Data of a particular sequencing experiment.

Experiments may contain many runs depending on the number of instrument runs that were needed.

There are two other resources of NGS data:

If you have an article describing an NGS dataset that is of interest to you, you should search in the article for a sentence mentioning the ID of the data in one of these databases.


Metadata of NGS data sets

You do not only need the data, you also need extra inforrmation to be able to do the analysis. For instance, you need to know where each sample comes from: in clinical datasets it is important to know if the reads are coming from a patient or from someone in the control group...

This kind of information is called metadata and is stored together with the actual data.


Exercise 1: Downloading a data set for the introduction training

For the introduction training we will use a data set containing short Illumina reads from Arabidopsis thaliana infected with a pathogen, Pseudomonas syringae, versus mock treated controls. The data set is described in the article of Cumbie et al., 2011.

The authors provide an ArrayExpress ID (E-GEOD-25818) in the section Analysis of a pilot RNA-Seq experiment, but this ID points to Affymetrix microarray data and not to NGS data:

Go to the ArrayExpress home page

Mortasecca.png Warning: So you see that IDs that are provided in articles are not always accurate !

Fortunately I could find the data in NCBI's SRA database, so we know the SRA ID. Since the connection with NCBI is too slow, we will do the download from ENA using the SRA ID.

Go to the EBI website.

It can take some time to download the file since it's very big. Firefox will give you an estimate on how long it's going to take. If it takes too long, cancel the download and use the file that is already present on the BITS laptops in the /Documents/NGSdata folder as SRR074262.fastq.

In a normal analysis we would of course download all 6 data files of this study. It's only because of time limits that we will only use a single sample during the training. If you are analyzing the 6 samples you need to take a look at the metadata to know which samples represent controls and which samples represent the treatment (in this case treatment with a plant pathogen).

In ENA and SRA, annotation is found in the record of the NGS study.


Exercise 2: Downloading a data set for the ChIP-Seq training

Exercise created by Morgane Thomas-Chollier

For the ChIP-Seq training, we are going to use the data set that is described in the article of Myers et al., 2013 [3]. The article contains the following sentence at the end of the Materials and Methods section:
"All genome-wide data from this publication have been deposited in NCBI’s Gene Expression Omnibus (GSE41195)."
In this case GSE41195 is the ID of the experiment in the GEO database.

Go to the GEO home page

Again, it will take too long to download the data from NCBI. So we will do the download from EBI.

Go to the EBI website.

It took only a few minutes to download the data on my laptop at work, but the internet connection at work will be faster than the one in the training room. Firefox will give you an estimate of the time it takes for the download. If it is too long, cancel the download and use the file that has already been downloaded and is available on the BITS laptops:

  • on Windows: in the /Documents/NGSdata folder as SRR576933.fastq
  • In Linux: in the /home/bits/NGS/ChIPSeq folder as SRR576933.fastq

ChIP-Seq always compares the ChIP sample to a control sample, consisting of genomic DNA isolated from cells that were cross-linked and fragmented under the same conditions as the ChIP sample or of DNA fragments isolated in a “mock” ChIP reaction using an antibody that reacts with an irrelevant, non-nuclear protein.

In this data set, control samples consist of full genomic DNA. To download a control sample, we should redo the same steps starting from the GEO record of the ChIP-Seq experiment and click the GEO sample ID of the anaerobic INPUT DNA sample... However, the fastq file is available in the same data folders (SRR576938.fastq)


Downloading data sets via Linux command line

See the exercises in the section on the Linux command line

Downloading data sets via R

Exercise created by Stephane Plaisance

Once you know the SRA or ENA ID of the data set you can download the data and the metadata automatically via an R script.
See the exercises of the RNA-Seq training to learn how to do this.


References:
  1. https://insilicodb.org/
  2. http://www.illumina.com/science/data_library.ilmn
  3. http://www.ncbi.nlm.nih.gov/pubmed/23818864