Downloading NGS data from NCBI

[ Main_Page | NGS data analysis | Quality control of NGS data ]

First of all, you need data to analyze. You can generate your own data but there's also a lot of NGS data available on the internet. Check if someone has already done the experiments you want to do: it might save you a lot of time and money. There are three main repositories for NGS data:

Sequence Read Archive from NCBI^[1]: stores raw data files in sra format, which FASTQC does not accept.
Gene Expression Omnibus from NCBI^[2]: stores processed data files from gene-expression related experiments: RNA-Seq, ChIP-Seq
ENA from EBI ^[3]: stores raw data files in fastq format
ArrayExpress from EBI ^[4]: stores processed data files from gene-expression related experiments: RNA-Seq, ChIP-Seq
In Silico DB from the ULB ^[5]

If you have an article describing an NGS dataset that is of interest to you, you should search in the article for a sentence mentioning the deposition of the NGS data in a database.

Exercise 1: Downloading a data set for the introduction training

For the introduction training we will use a data set containing short Illumina reads from Arabidopsis thaliana infected with a pathogen, Pseudomonas syringae, versus mock treated controls. The data set is described in the article of Cumbie et al., 2011.

Although the authors provide an arrayExpress ID (E-GEOD-25818) in the section Analysis of a pilot RNA-Seq experiment this ID points to an Affymetrix microarray data set:

Go to the ArrayExpress home page

Find the description of the experiment with ArrayExpress ID E-GEOD-25818 ?
Type the ID in the search box on the ArrayExpress home page Click Search You immediately see that the experiment is stored as a Transcription profiling by array experiment (red) and that Affymetrix GeneChip Arabidopsis Genome [ATH1-121501] is described as the platform that was used (green). Click the Click for detailed sample information and links to data link (blue) You see that you will effectively download .CEL files, the file type for storing raw Affymetrix microarray data.

Warning: So you see that IDs that are provided in articles are not always accurate !

Fortunately I could find the data in NCBI's SRA database, so we could download the data there. However, the internet connection with the NCBI is very slow. I tried it on my laptop using the slow internet connection in the training room and it took several hours. This is why we will not do the download the data set from NCBI but from EBI in Europe using the SRA ID. ENA is the sequence database of EBI. NCBI and EBI exchange daily the contents of their sequence databases so the SRA database is also a part of ENA. As a result, ENA recognizes the SRA identifiers.

Go to the EBI website.

Download the data set with ID SRR074262 from ENA ?
Type SRR074262 in the search box Click Search Since we are using an SRA ID as a search term, we're doing a very specific search so the search returns a single SRA record: Click the SRA ID on the results page that leads to the actual data of the run To download the data in fastq format scroll to the table at the bottom of the page Click the link in the Fastq files (ftp) column:

It can take some time to download the file since it's very big. Firefox will give you an estimate on how long it's going to take. If it takes too long, cancel the download and use the file that is already present on the BITS laptops in the /Documents/NGSdata folder as SRR074262.fastq.

Exercise 2: Downloading a data set for the ChIP-Seq training

Exercise created by Morgane Thomas-Chollier

For the ChIP-Seq training, we are going to use the data set that is described in the article of Myers et al., 2013 ^[6]. The data consists of reads from ChIP enriched genomic DNA fragments that interact with FNR, a well-studied global transcription regulator of anaerobiosis. As a control, reads from fragmented genomic DNA were used.

NGS datasets are (usually) made freely accessible, by depositing them into specialized databases. Sequence Read Archive (SRA) located in USA and hosted by NCBI, and its European equivalent European Nucleotide Archive (ENA) located in England hosted by EBI both contains raw, unprocessed reads.

Processed reads from functional genomics datasets (transcriptomics, genome-wide binding such as ChIPSeq,...) are deposited in Gene Expression Omnibus (GEO) or its European equivalent ArrayExpress. <p>The article contains the following sentence at the end of the Materials and Methods section:
"All genome-wide data from this publication have been deposited in NCBI’s Gene Expression Omnibus (GSE41195)."
In this case GSE41195 is the identifier that allows you to retrieve the dataset from the NCBI GEO (Gene Expression Omnibus) database.

GEO hosts processed data files from experiments related to gene expression studies, based on NGS or microarrays. The files of NGS experiments can include alignments, peaks and/or counts.

Go to the GEO home page

Download the data of the experiment with GEO ID GSE41195 ?
Type the ID in the search box on the GEO home page Click Search This redirects you to the GEO record of the full experiment consisting of microarrays, tiling arrays and a ChIP-Seq experiment. In the Experiment type section you can see that this GEO record indeed reports a mixture of expression analysis and ChIP-Seq experiments. Scroll to the bottom of the page: You can see that the ChIP-Seq data have their own GEO ID: GSE41187 Click the ChIP-Seq data ID: GSE41187. This brings us on the GEO record of the ChIP-Seq experiment. In the GEO record scroll down to the Samples section: For time's sake, we will focus in the training on a single sample: FNR IP ChIP-seq Anaerobic A Click the GEO ID GSM1010219 of the sample that we will use in the training This brings us to the GEO record of the sample. Scroll to the bottom of GEO record of the sample to the Relations section: GEO only contains processed data, no raw data. The raw data is stored in the SRA database. In the Relations section you can find the SRA identifier of this data set. For the training we would like to have a fastq file containing the raw data. Copy the SRA identifier

Although direct access to the SRA database at the NCBI is doable, SRA does not store sequences in a FASTQ format. So, in practice, it's simpler (and quicker!!) to download datasets from the ENA database (European Nucleotide Archive) hosted by EBI (European Bioinformatics Institute) in UK. ENA encompasses the data from SRA.

SRA identifiers are also recognized by ENA so we can download the file from EBI.

Go to the EBI website.

Download the data with SRA ID SRX189773 ?
Type the ID in the search box on the EBI home page Click the search icon This returns two results: a link to the record of the experiment and a link to the record of the run: Click the first result (red) The table at the bottom of the page contains a column called Fastq files (ftp) Click the link in this column to download the data in fastq format

For the training you do not have to download the data, it's already on the GenePattern server.

To download the replicate and the control data set, we should redo the same steps starting from the GEO web page of the ChIP-Seq experiment (click the sample ID of the FNR IP ChIP-seq Anaerobic B and the anaerobic INPUT DNA sample). The fastq file of the control sample is also available on the GenePattern server.

References:

[sra-1] ttp://www.ncbi.nlm.nih.gov/sra

[GEO-2] ttp://www.ncbi.nlm.nih.gov/geo/

[ena-3] ttp:///www.ebi.ac.uk/ena/

[arrayexpress-4] ttp:///www.ebi.ac.uk/arrayexpress/

[insilicoDB-5] ttps://insilicodb.org/

[6] ttp://www.ncbi.nlm.nih.gov/pubmed/23818864

[1]

[2]

[3]

[4]

[5]

[6]

Downloading NGS data from NCBI

Exercise 1: Downloading a data set for the introduction training

Exercise 2: Downloading a data set for the ChIP-Seq training

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox