Downloading NGS data from NCBI

From BITS wiki
Jump to: navigation, search

[ Main_Page | NGS data analysis | Quality control of NGS data ]

First of all, you need data to analyze. You can generate your own data but there's also a lot of NGS data available on the internet. Check if someone has already done the experiments you want to do: it might save you a lot of time and money. There are three main repositories for NGS data:

  • Sequence Read Archive from NCBI[1]: stores raw data files in sra format, which FASTQC does not accept.
  • Gene Expression Omnibus from NCBI[2]: stores processed data files from gene-expression related experiments: RNA-Seq, ChIP-Seq
  • ENA from EBI [3]: stores raw data files in fastq format
  • ArrayExpress from EBI [4]: stores processed data files from gene-expression related experiments: RNA-Seq, ChIP-Seq
  • In Silico DB from the ULB [5]

If you have an article describing an NGS dataset that is of interest to you, you should search in the article for a sentence mentioning the deposition of the NGS data in a database.


Exercise 1: Downloading a data set for the introduction training

For the introduction training we will use a data set containing short Illumina reads from Arabidopsis thaliana infected with a pathogen, Pseudomonas syringae, versus mock treated controls. The data set is described in the article of Cumbie et al., 2011.

Although the authors provide an arrayExpress ID (E-GEOD-25818) in the section Analysis of a pilot RNA-Seq experiment this ID points to an Affymetrix microarray data set:

Go to the ArrayExpress home page

Mortasecca.png Warning: So you see that IDs that are provided in articles are not always accurate !

Fortunately I could find the data in NCBI's SRA database, so we could download the data there. However, the internet connection with the NCBI is very slow. I tried it on my laptop using the slow internet connection in the training room and it took several hours. This is why we will not do the download the data set from NCBI but from EBI in Europe using the SRA ID. ENA is the sequence database of EBI. NCBI and EBI exchange daily the contents of their sequence databases so the SRA database is also a part of ENA. As a result, ENA recognizes the SRA identifiers.

Go to the EBI website.

It can take some time to download the file since it's very big. Firefox will give you an estimate on how long it's going to take. If it takes too long, cancel the download and use the file that is already present on the BITS laptops in the /Documents/NGSdata folder as SRR074262.fastq.


Exercise 2: Downloading a data set for the ChIP-Seq training

Exercise created by Morgane Thomas-Chollier

For the ChIP-Seq training, we are going to use the data set that is described in the article of Myers et al., 2013 [6]. The data consists of reads from ChIP enriched genomic DNA fragments that interact with FNR, a well-studied global transcription regulator of anaerobiosis. As a control, reads from fragmented genomic DNA were used.
The article contains the following sentence at the end of the Materials and Methods section:
"All genome-wide data from this publication have been deposited in NCBI’s Gene Expression Omnibus (GSE41195)."
In this case GSE41195 is the identifier that allows you to retrieve the dataset from the NCBI GEO (Gene Expression Omnibus) database.

GEO hosts processed data files from experiments related to gene expression studies, based on NGS or on microarrays. The processed data files of NGS experiments can include alignment, peak and/or count data.

Go to the GEO home page

Again, it will take too long to download the data from the NCBI website. This is why we will do the download from the European Bioinformatics Institute.

SRA identifiers are also recognized by ENA, the sequence database of EBI. So we can also download the file from EBI. Since EBI is located in Europe, it's faster and also simpler to download data sets from here.

Go to the EBI website.

It took only a few minutes to download the data file on my laptop at work, the internet connection at work could be faster than the one in the training room. Firefox will give you an estimate of the time it takes for the download. If it is too long, cancel the download and use the file that has already been downloaded and is available on the BITS laptops:

  • on Windows: in the /Documents/NGSdata folder as SRR576933.fastq
  • In Linux: in the /home/bits/NGS/ChIP-Seq folder as SRR576933.fastq

To download the control data set, we should redo the same steps starting from the GEO web page specific to the ChIP-Seq experiment and click the sample ID of the anaerobic INPUT DNA sample. This fastq file is also available in the same data folders (SRR576938.fastq)



References:
  1. http://www.ncbi.nlm.nih.gov/sra
  2. http://www.ncbi.nlm.nih.gov/geo/
  3. http:///www.ebi.ac.uk/ena/
  4. http:///www.ebi.ac.uk/arrayexpress/
  5. https://insilicodb.org/
  6. http://www.ncbi.nlm.nih.gov/pubmed/23818864