Download read information and FASTQ data from the SRA
[ Main_Page | NGS_data_analysis | Hands-on_introduction_to_NGS_RNASeq_DE_analysis ]
Download GSE37211 (SRP012167) read information and FASTQ data from the web
This document illustrates how information and read data can be fetched from the SRA (ENA) website using web-links and command-line calls.
Contents
Choosing a dataset for a hands-on session
Choosing data for a hands-on training session is doomed by the need to find a 'small quantity' of data that can be analyzed in a 'short time' by many users sharing the same internet connection; Despite all these limitations, the information contained in that data should still remain interesting for the average biologist audience. The practical session should, like a real experiment, analyze several replicates, and allow quality control steps that reveal key features found in real data and detect pitfalls.
A classical solution to this problem is that the trainer starts from a full-size dataset, performs the complete analysis, and creates single chromosome samples from each step of the analysis. This single chromosome data can in turn be shared with the trainees and will presents most of the real features of the full data while requiring less time for re-analysis during the session.
Obtaining information and reads for the 'SRP012167' SRA dataset
The 'SRP012167' dataset was selected for this training because it is frequently used by [R-Bioconductor] packages dedicated to RNASeq and is available as a standalone R-data library ("parathyroidSE"). This will facilitate participants to pursue self-training after this session.
Note that while we use [R-Bioconductor] several times in this session, this training is not exclusive to the use of [R] and is more generally aimed at learning the most popular and current command line applications used RNASeq workflows on average Unix operating system computers
The publication associated with the 'SRP012167' project
Full details about the biology behind this dataset can be found by reading the original paper [1].
'SRP012167' fastq-links and metadata annotations at the EBI ENA
Before analyzing any NGS data it is of good habit to check how the data was generated and which platform was used. The SRA page dedicated to SRP012167[2] is full of information over read structure (101bps reads, PAIRED, forward|reverse, insert distance=333-101-101=131bps in average), and other important NGS parameters shown below for one selected sample.
As in any high throughput data analysis, a metadata table is required to relate SRA sample IDs to names and corresponding annotations.
The ENA[3] offers a nice search utility page to obtain such information that was used extract selected columns. You can get data annotation for any SRA/ENA project from http://www.ebi.ac.uk/ena/data/view/ as shown below.
A small [R] call was made in RStudio to download directly the relevant information about SRP012167 from the ENA page into a 'data.frame'. Only fields relevant to downloading the reads are extracted here but other fields are available on the server for other purpose, including sample descritption.
PID <- "SRP012167" ena.url <- paste("http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=", PID, "&result=read_run", "&fields=run_accession,library_name,", "read_count,fastq_ftp,fastq_aspera,", "fastq_galaxy,sra_ftp,sra_aspera,sra_galaxy,", "&download=text", sep="") metadata <- read.table(url(ena.url), header=TRUE, sep="\t") # get first line and transpose t(metadata[1,])
The first row reports the following information:
|---------------|----------------------------------------------------------------------|
| library_name | GSM913873:24h-Adenoma1 |
| read_count | 12647534 |
| fastq_ftp | ftp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_1.fastq.gz; |
| | ftp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_2.fastq.gz |
| fastq_aspera | fasp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_1.fastq.gz; |
| | fasp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_2.fastq.gz |
| fastq_galaxy | ftp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_1.fastq.gz; |
| | ftp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_2.fastq.gz |
| sra_ftp | ftp.sra.ebi.ac.uk/vol1/srr/SRR479/SRR479052 |
| sra_aspera | fasp.sra.ebi.ac.uk/vol1/srr/SRR479/SRR479052 |
| sra_galaxy | ftp.sra.ebi.ac.uk/vol1/srr/SRR479/SRR479052 |
A second piece of code was prepared to extract sample information and build a table relating samples to experimental conditions
library("stringr") PID <- "SRP012167" ena.url <- paste("http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=", PID, "&result=read_run", "&fields=run_accession,library_name", sep="") # assemble into a data.frame metadata <- as.data.frame(read.table(url(ena.url), header=TRUE, sep="\t")) # sample the first line t(metadata[1,]) # add new columns metadata$treatment <- str_extract(metadata$library_name, "\\b[a-zA-Z]+\\b") metadata$treatment <- gsub("Control", "CON", metadata$treatment) metadata$time <- str_extract(metadata$library_name, "\\b[0-9]{2}") metadata$patient <- str_extract(metadata$library_name, "\\b[0-4]$") # remove the original merged-column metadata <- metadata[,-2] colnames(metadata) <- c("sample", "treatment", "time", "patient") # view table metadata
1 SRR479052 CON 24 1
2 SRR479053 CON 48 1
3 SRR479054 DPN 24 1
4 SRR479055 DPN 48 1
5 SRR479056 OHT 24 1
6 SRR479057 OHT 48 1
7 SRR479058 CON 24 2
8 SRR479059 CON 48 2
9 SRR479060 DPN 24 2
10 SRR479061 DPN 24 2
11 SRR479062 DPN 48 2
12 SRR479063 OHT 24 2
13 SRR479064 OHT 24 2
14 SRR479065 OHT 48 2
15 SRR479066 CON 24 3
16 SRR479067 CON 48 3
17 SRR479068 DPN 24 3
18 SRR479069 DPN 48 3
19 SRR479070 OHT 24 3
20 SRR479071 OHT 48 3
21 SRR479072 CON 48 4
22 SRR479073 DPN 24 4
23 SRR479074 DPN 48 4
24 SRR479075 DPN 48 4
25 SRR479076 OHT 24 4
26 SRR479077 OHT 48 4
27 SRR479078 OHT 48 4
# save to file for reuse write.table(metadata, file="GSE6943_metadata.txt", row.names=F, sep="\t", quote=FALSE)
As seen in the results, ftp, Aspera, and galaxy links are returned which can alternately be used to download the reads or integrate them in your favorite Galaxy server.
Download SRA-formatted data and convert it to fastQ using the SRA toolbox
The SRA data is not in FastQ format required for mapping but in a NCBI proprietary format which is handier for file exchange. We therefore need to download the data and convert it. This can be done manually or using a script to ease the process and run in background while you do other work at the bench or overnight.
Typical ftp and aspera links for the first 'SRP012167' sample 'SRR479052' are:
# ftp link http://www.ncbi.nlm.nih.gov/public/?/ftp/sra/sra-instant/reads/ByRun/sra/SRR/SRR479/SRR479052 # corresponding aspera link (from the European archive) fasp.sra.ebi.ac.uk/vol1/srr/SRR479/SRR479052
The easiest way to download SRA data is to proceed manually, file by file, from the browser. However, this can prove quit lengthy when you need 23 files as we now do. One alternate method involves creating a batch download script that uses the ftp list or the similar list of aspera links.
Aspera is a proprietary accelerated file transfer protocol that includes error correction and is much faster that the conventional FTP. Please refer to Aspera Transfer Guide - SRA Handbook - NCBI Bookshelf [4]
We prepared a batch script 'get_reads_with_aspera.sh' to get all 23 SRA files and regenerate FASTQ paired read files from each. The code is reproduced below and can be adapted for other datasets with few changes.
get_reads_with_aspera.sh script
#!/bin/bash # get_reads_with_aspera.sh # required: a functional aspera connect installation ## Aspera main parameters # –Q (for adaptive flow control) – needed for disk throttling! # –T to disable encryption # –k1 enable resume of failed transfers # –l (maximum bandwidth of request, try 200M and go up from there) # –r recursive copy # –i <private key file> ## set the following two variables according to your own Aspera location # this should match sra-connect installed for a single user under unix exefile="~/.aspera/connect/bin/ascp" sshcert="~/.aspera/connect/etc/asperaweb_id_dsa.openssh" # create a LIST of files to download and proceed with aspera LIST="SRR479052 SRR479053 SRR479054 SRR479055 SRR479056 \ SRR479057 SRR479058 SRR479059 SRR479060 SRR479061 \ SRR479062 SRR479063 SRR479064 SRR479065 SRR479066 \ SRR479067 SRR479068 SRR479069 SRR479070 SRR479071 \ SRR479072 SRR479073 SRR479074 SRR479075 SRR479076 \ SRR479077 SRR479078" baseurl="anonftp@ftp-private.ncbi.nlm.nih.gov:" # adapt the following line if you need other reads # it should end with the first letters of your SRA files uri="/sra/sra-instant/reads/ByRun/sra/SRR/SRR479" # create container for data and move into it mkdir -p SRP012167_fastq && cd SRP012167_fastq # loop in the LIST and download one file at a time for sra in $LIST; do ## download SRA data cmd="${exefile} \ -i ${sshcert} \ -k1 -QTr -l10000m \ ${baseurl}${uri}/${sra} \ sra_downloads" echo "# $cmd" eval $cmd # test if transfer succeeded or die RESULT=$? if [ $RESULT -eq 0 ]; then echo "# Aspera transfer succeeded for ${sra}" else echo "# Aspera transfer failed for ${sra}, aborting!" exit 1 fi ## convert SRA to fastq # archive data in gzip format to save space fastq-dump --split-3 --gzip sra_downloads/${sra}.sra # test if conversion succeeded or die RESULT=$? if [ $RESULT -eq 0 ]; then echo "## SRA to FASTQ conversion succeeded for ${sra}" else echo "## SRA to FASTQ conversion failed for ${sra}, aborting!" exit 1 fi done # return where you came from cd - # test for happy ending if [ $? = 0 ]; then echo "### all steps succeeded" else echo "### something ended wrong!, please check!" fi
- This script relies on the presence of the Aspera program and the required certificate at given locations defined in the code. If your Aspera executable and certificate are located elsewhere, please edit these two lines
- The script takes one to few hours to get the full data depending on your internet connection and processing speed and leads to storing 47GB of 'gzipped' read data.
Conclusion
The data has ben copied to the local computer and is now ready for QC and re-analysis as detailed in the remaining of the hands-on tutorial.
download exercise files
Download exercise files here.
References:
- ↑
Felix Haglund, Ran Ma, Mikael Huss, Luqman Sulaiman, Ming Lu, Inga-Lena Nilsson, Anders Höög, C Christofer Juhlin, Johan Hartman, Catharina Larsson
Evidence of a functional estrogen receptor in parathyroid adenomas.
J Clin Endocrinol Metab: 2012, 97(12);4631-9
[PubMed:23024189] ##WORLDCAT## [DOI] (I p) - ↑ http://www.ncbi.nlm.nih.gov/sra?term=SRP012167
- ↑ http://www.ebi.ac.uk/ena/
- ↑ http://www.ncbi.nlm.nih.gov/books/NBK47527/
[ Main_Page | NGS_data_analysis | Hands-on_introduction_to_NGS_RNASeq_DE_analysis ]