Downloading NGS data from the internet
NGS data repositories
First of all, you need data to analyze. You can generate your own data but there's a lot of NGS data available on the internet.
The main repositories for NGS data:
|'||NCBI - US||EBI - Europe|
|Close-by so faster downloads|
|Gene expression database||GEO||ArrayExpress|
|Contain processed NGS data, no raw data||ID starts with G||ID starts with E-|
|NGS sequence database||SRA||ENA|
|Contain raw NGS data||ID starts with SR||ID starts with ER|
|ENA IDs also used by SRA||SRA IDs also used by ENA|
|stores reads in sra format||stores reads in fastq format|
Both GEO and SRA use multiple types of IDs, ordered according to a certain hierarchy:
|GEO ID||points to||definition|
|ID starts with GSE||experiment||Data of a full NGS experiment consisting of multiple samples|
The samples belong to different groups that are to be compared e.g. treated and control samples
|ID starts with GSM||sample||Data of one single sample|
|SRA ID||points to||definition|
|ID starts with SRP||study||Studies have an overall goal and may comprise several experiments.|
|ID starts with SRX||experiment||An Experiment describes what was sequenced and the method used.|
Info on the source of the DNA, samples, sequencing platform and the processing of the data.
|ID starts with SRR||run||Data of a particular sequencing experiment.|
Experiments may contain many runs depending on the number of instrument runs that were needed.
There are two other resources of NGS data:
If you have an article describing an NGS dataset that is of interest to you, you should search in the article for a sentence mentioning the ID of the data in one of these databases.
Metadata of NGS data sets
You do not only need the data, you also need extra inforrmation to be able to do the analysis. For instance, you need to know where each sample comes from: in clinical datasets it is important to know if the reads are coming from a patient or from someone in the control group...
This kind of information is called metadata and is stored together with the actual data.
Exercise 1: Downloading a data set for the introduction training
For the introduction training we will use a data set containing short Illumina reads from Arabidopsis thaliana infected with a pathogen, Pseudomonas syringae, versus mock treated controls. The data set is described in the article of Cumbie et al., 2011.
The authors provide an ArrayExpress ID (E-GEOD-25818) in the section Analysis of a pilot RNA-Seq experiment, but this ID points to Affymetrix microarray data and not to NGS data:
|Find the description of the experiment with ArrayExpress ID E-GEOD-25818 ?|
Fortunately I could find the data in NCBI's SRA database, so we know the SRA ID. Since the connection with NCBI is too slow, we will do the download from ENA using the SRA ID.
Go to the EBI website.
|Download the data set with SRA ID SRR074262 from ENA ?|
It can take some time to download the file since it's very big. Firefox will give you an estimate on how long it's going to take. If it takes too long, cancel the download and use the file that is already present on the BITS laptops in the /Documents/NGSdata folder as SRR074262.fastq.
In a normal analysis we would of course download all 6 data files of this study. It's only because of time limits that we will only use a single sample during the training. If you are analyzing the 6 samples you need to take a look at the metadata to know which samples represent controls and which samples represent the treatment (in this case treatment with a plant pathogen).
In ENA and SRA, annotation is found in the record of the NGS study.
|Go to the ENA record of the study the downloaded sample belongs to and look at the grouping of the samples.|
The sample that we have downloaded for the introduction training thus comes from the group of infected samples.
Exercise 2: Downloading a data set for the ChIP-Seq training
Exercise created by Morgane Thomas-Chollier
For the ChIP-Seq training, we are going to use the data set that is described in the article of Myers et al., 2013 . The article contains the following sentence at the end of the Materials and Methods section:
"All genome-wide data from this publication have been deposited in NCBI’s Gene Expression Omnibus (GSE41195)."
In this case GSE41195 is the ID of the experiment in the GEO database.
Go to the GEO home page
|Download the data of the experiment with GEO ID GSE41195 ?|
Again, it will take too long to download the data from NCBI. So we will do the download from EBI.
Go to the EBI website.
|Download the data with SRA ID SRX189773 ?|
It took only a few minutes to download the data on my laptop at work, but the internet connection at work will be faster than the one in the training room. Firefox will give you an estimate of the time it takes for the download. If it is too long, cancel the download and use the file that has already been downloaded and is available on the BITS laptops:
- on Windows: in the /Documents/NGSdata folder as SRR576933.fastq
- In Linux: in the /home/bits/NGS/ChIPSeq folder as SRR576933.fastq
ChIP-Seq always compares the ChIP sample to a control sample, consisting of genomic DNA isolated from cells that were cross-linked and fragmented under the same conditions as the ChIP sample or of DNA fragments isolated in a “mock” ChIP reaction using an antibody that reacts with an irrelevant, non-nuclear protein.
In this data set, control samples consist of full genomic DNA. To download a control sample, we should redo the same steps starting from the GEO record of the ChIP-Seq experiment and click the GEO sample ID of the anaerobic INPUT DNA sample... However, the fastq file is available in the same data folders (SRR576938.fastq)
Downloading data sets via Linux command line
Downloading data sets via R
Exercise created by Stephane Plaisance
Once you know the SRA or ENA ID of the data set you can download the data and the metadata automatically via an R script.
See the exercises of the RNA-Seq training to learn how to do this.