Downloading NGS data from NCBI

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | Quality control of NGS data ]

First of all, you need data to analyze. You can generate your own data but there's also a lot of NGS data available on the internet. Check if someone has already done the experiments you want to do: it might save you a lot of time and money. There are three main repositories for NGS data:

  • Sequence Read Archive from NCBI[1]: stores raw data files in sra format, which FASTQC does not accept.
  • Gene Expression Omnibus from NCBI[2]: stores processed data files from gene-expression related experiments: RNA-Seq, ChIP-Seq
  • ENA from EBI [3]: stores raw data files in fastq format
  • ArrayExpress from EBI [4]: stores processed data files from gene-expression related experiments: RNA-Seq, ChIP-Seq
  • In Silico DB from the ULB [5]

If you have an article describing an NGS dataset that is of interest to you, you should search in the article for a sentence mentioning the deposition of the NGS data in a database.


Exercise 1: Downloading a data set for the introduction training

For the introduction training we will use a data set containing short Illumina reads from Arabidopsis thaliana infected with a pathogen, Pseudomonas syringae, versus mock treated controls. The data set is described in the article of Cumbie et al., 2011.

Although the authors provide an arrayExpress ID (E-GEOD-25818) in the section Analysis of a pilot RNA-Seq experiment this ID points to an Affymetrix microarray data set:

Go to the ArrayExpress home page

Mortasecca.png Warning: So you see that IDs that are provided in articles are not always accurate !

Fortunately I could find the data in NCBI's SRA database, so we could download the data there. However, the internet connection with the NCBI is very slow. I tried it on my laptop using the slow internet connection in the training room and it took several hours. This is why we will not do the download the data set from NCBI but from EBI in Europe using the SRA ID. ENA is the sequence database of EBI. NCBI and EBI exchange daily the contents of their sequence databases so the SRA database is also a part of ENA. As a result, ENA recognizes the SRA identifiers.

Go to the EBI website.

It can take some time to download the file since it's very big. Firefox will give you an estimate on how long it's going to take. If it takes too long, cancel the download and use the file that is already present on the BITS laptops in the /Documents/NGSdata folder as SRR074262.fastq.


Exercise 2: Downloading a data set for the ChIP-Seq training

Exercise created by Morgane Thomas-Chollier

For the ChIP-Seq training, we are going to use the data set that is described in the article of Myers et al., 2013 [6]. The data consists of reads from ChIP enriched genomic DNA fragments that interact with FNR, a well-studied global transcription regulator of anaerobiosis. As a control, reads from fragmented genomic DNA were used.

NGS datasets are (usually) made freely accessible, by depositing them into specialized databases. Sequence Read Archive (SRA) located in USA and hosted by NCBI, and its European equivalent European Nucleotide Archive (ENA) located in England hosted by EBI both contains raw, unprocessed reads.

Processed reads from functional genomics datasets (transcriptomics, genome-wide binding such as ChIPSeq,...) are deposited in Gene Expression Omnibus (GEO) or its European equivalent ArrayExpress. <p>The article contains the following sentence at the end of the Materials and Methods section:
"All genome-wide data from this publication have been deposited in NCBI’s Gene Expression Omnibus (GSE41195)."
In this case GSE41195 is the identifier that allows you to retrieve the dataset from the NCBI GEO (Gene Expression Omnibus) database.

GEO hosts processed data files from experiments related to gene expression studies, based on NGS or microarrays. The files of NGS experiments can include alignments, peaks and/or counts.

Go to the GEO home page

Although direct access to the SRA database at the NCBI is doable, SRA does not store sequences in a FASTQ format. So, in practice, it's simpler (and quicker!!) to download datasets from the ENA database (European Nucleotide Archive) hosted by EBI (European Bioinformatics Institute) in UK. ENA encompasses the data from SRA.

SRA identifiers are also recognized by ENA so we can download the file from EBI.

Go to the EBI website.

For the training you do not have to download the data, it's already on the GenePattern server.

To download the replicate and the control data set, we should redo the same steps starting from the GEO web page of the ChIP-Seq experiment (click the sample ID of the FNR IP ChIP-seq Anaerobic B and the anaerobic INPUT DNA sample). The fastq file of the control sample is also available on the GenePattern server.


References:
  1. http://www.ncbi.nlm.nih.gov/sra
  2. http://www.ncbi.nlm.nih.gov/geo/
  3. http:///www.ebi.ac.uk/ena/
  4. http:///www.ebi.ac.uk/arrayexpress/
  5. https://insilicodb.org/
  6. http://www.ncbi.nlm.nih.gov/pubmed/23818864