Download read information and FASTQ data from the SRA

From BITS wiki
Jump to: navigation, search


[ Main_Page | NGS_data_analysis | Hands-on_introduction_to_NGS_RNASeq_DE_analysis ]


Download GSE37211 (SRP012167) read information and FASTQ data from the web

sra_logo_eee.gif
ena_header.png

This document illustrates how information and read data can be fetched from the SRA (ENA) website using web-links and command-line calls.

Choosing a dataset for a hands-on session

Choosing data for a hands-on training session is doomed by the need to find a 'small quantity' of data that can be analyzed in a 'short time' by many users sharing the same internet connection; Despite all these limitations, the information contained in that data should still remain interesting for the average biologist audience. The practical session should, like a real experiment, analyze several replicates, and allow quality control steps that reveal key features found in real data and detect pitfalls.

A classical solution to this problem is that the trainer starts from a full-size dataset, performs the complete analysis, and creates single chromosome samples from each step of the analysis. This single chromosome data can in turn be shared with the trainees and will presents most of the real features of the full data while requiring less time for re-analysis during the session.

Obtaining information and reads for the 'SRP012167' SRA dataset

The 'SRP012167' dataset was selected for this training because it is frequently used by [R-Bioconductor] packages dedicated to RNASeq and is available as a standalone R-data library ("parathyroidSE"). This will facilitate participants to pursue self-training after this session.

Technical.png Note that while we use [R-Bioconductor] several times in this session, this training is not exclusive to the use of [R] and is more generally aimed at learning the most popular and current command line applications used RNASeq workflows on average Unix operating system computers

The publication associated with the 'SRP012167' project

Full details about the biology behind this dataset can be found by reading the original paper [1].

pubmed_ref.png

'SRP012167' fastq-links and metadata annotations at the EBI ENA

Before analyzing any NGS data it is of good habit to check how the data was generated and which platform was used. The SRA page dedicated to SRP012167[2] is full of information over read structure (101bps reads, PAIRED, forward|reverse, insert distance=333-101-101=131bps in average), and other important NGS parameters shown below for one selected sample.

SRX140503_SRA-info.png

As in any high throughput data analysis, a metadata table is required to relate SRA sample IDs to names and corresponding annotations.

The ENA[3] offers a nice search utility page to obtain such information that was used extract selected columns. You can get data annotation for any SRA/ENA project from http://www.ebi.ac.uk/ena/data/view/ as shown below.

A small [R] call was made in RStudio to download directly the relevant information about SRP012167 from the ENA page into a 'data.frame'. Only fields relevant to downloading the reads are extracted here but other fields are available on the server for other purpose, including sample descritption.

PID <- "SRP012167"
ena.url <- paste("http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=",
                 PID,
                 "&result=read_run",
                 "&fields=run_accession,library_name,",
                 "read_count,fastq_ftp,fastq_aspera,",
                 "fastq_galaxy,sra_ftp,sra_aspera,sra_galaxy,",
                 "&download=text",
                 sep="")
metadata <- read.table(url(ena.url), header=TRUE, sep="\t")
 
# get first line and transpose
t(metadata[1,])

The first row reports the following information:

| run_accession | SRR479052                                                            |
|---------------|----------------------------------------------------------------------|
| library_name  | GSM913873:24h-Adenoma1                                               |
| read_count    | 12647534                                                             |
| fastq_ftp     | ftp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_1.fastq.gz;  |
|               | ftp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_2.fastq.gz   |
| fastq_aspera  | fasp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_1.fastq.gz; |
|               | fasp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_2.fastq.gz  |
| fastq_galaxy  | ftp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_1.fastq.gz;  |
|               | ftp.sra.ebi.ac.uk/vol1/fastq/SRR479/SRR479052/SRR479052_2.fastq.gz   |
| sra_ftp       | ftp.sra.ebi.ac.uk/vol1/srr/SRR479/SRR479052                          |
| sra_aspera    | fasp.sra.ebi.ac.uk/vol1/srr/SRR479/SRR479052                         |
| sra_galaxy    | ftp.sra.ebi.ac.uk/vol1/srr/SRR479/SRR479052                          |

A second piece of code was prepared to extract sample information and build a table relating samples to experimental conditions

library("stringr")
 
PID <- "SRP012167"
ena.url <- paste("http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=",
                 PID,
                 "&result=read_run",
                 "&fields=run_accession,library_name",
                 sep="")
 
# assemble into a data.frame
metadata <- as.data.frame(read.table(url(ena.url), header=TRUE, sep="\t"))
 
# sample the first line
t(metadata[1,])
 
# add new columns
metadata$treatment <- str_extract(metadata$library_name, "\\b[a-zA-Z]+\\b")
metadata$treatment <- gsub("Control", "CON", metadata$treatment)
metadata$time <- str_extract(metadata$library_name, "\\b[0-9]{2}")
metadata$patient <- str_extract(metadata$library_name, "\\b[0-4]$")
 
# remove the original merged-column
metadata <- metadata[,-2]
colnames(metadata) <- c("sample", "treatment", "time", "patient")
 
# view table
metadata
      sample treatment time patient
1  SRR479052       CON   24       1
2  SRR479053       CON   48       1
3  SRR479054       DPN   24       1
4  SRR479055       DPN   48       1
5  SRR479056       OHT   24       1
6  SRR479057       OHT   48       1
7  SRR479058       CON   24       2
8  SRR479059       CON   48       2
9  SRR479060       DPN   24       2
10 SRR479061       DPN   24       2
11 SRR479062       DPN   48       2
12 SRR479063       OHT   24       2
13 SRR479064       OHT   24       2
14 SRR479065       OHT   48       2
15 SRR479066       CON   24       3
16 SRR479067       CON   48       3
17 SRR479068       DPN   24       3
18 SRR479069       DPN   48       3
19 SRR479070       OHT   24       3
20 SRR479071       OHT   48       3
21 SRR479072       CON   48       4
22 SRR479073       DPN   24       4
23 SRR479074       DPN   48       4
24 SRR479075       DPN   48       4
25 SRR479076       OHT   24       4
26 SRR479077       OHT   48       4
27 SRR479078       OHT   48       4
# save to file for reuse
write.table(metadata, file="GSE6943_metadata.txt", row.names=F, sep="\t", quote=FALSE)

As seen in the results, ftp, Aspera, and galaxy links are returned which can alternately be used to download the reads or integrate them in your favorite Galaxy server.

Download SRA-formatted data and convert it to fastQ using the SRA toolbox

The SRA data is not in FastQ format required for mapping but in a NCBI proprietary format which is handier for file exchange. We therefore need to download the data and convert it. This can be done manually or using a script to ease the process and run in background while you do other work at the bench or overnight.

Typical ftp and aspera links for the first 'SRP012167' sample 'SRR479052' are:

# ftp link
http://www.ncbi.nlm.nih.gov/public/?/ftp/sra/sra-instant/reads/ByRun/sra/SRR/SRR479/SRR479052
 
# corresponding aspera link (from the European archive)
fasp.sra.ebi.ac.uk/vol1/srr/SRR479/SRR479052

The easiest way to download SRA data is to proceed manually, file by file, from the browser. However, this can prove quit lengthy when you need 23 files as we now do. One alternate method involves creating a batch download script that uses the ftp list or the similar list of aspera links.

Handicon.png Aspera is a proprietary accelerated file transfer protocol that includes error correction and is much faster that the conventional FTP. Please refer to Aspera Transfer Guide - SRA Handbook - NCBI Bookshelf [4]

We prepared a batch script 'get_reads_with_aspera.sh' to get all 23 SRA files and regenerate FASTQ paired read files from each. The code is reproduced below and can be adapted for other datasets with few changes.

get_reads_with_aspera.sh script

#!/bin/bash
# get_reads_with_aspera.sh
 
# required: a functional aspera connect installation
 
## Aspera main parameters
# –Q (for adaptive flow control) – needed for disk throttling!
# –T to disable encryption
# –k1 enable resume of failed transfers
# –l (maximum bandwidth of request, try 200M and go up from there)
# –r recursive copy
# –i <private key file>
 
## set the following two variables according to your own Aspera location
# this should match sra-connect installed for a single user under unix
exefile="~/.aspera/connect/bin/ascp"
sshcert="~/.aspera/connect/etc/asperaweb_id_dsa.openssh"
 
# create a LIST of files to download and proceed with aspera
LIST="SRR479052 SRR479053 SRR479054 SRR479055 SRR479056 \
SRR479057 SRR479058 SRR479059 SRR479060 SRR479061 \
SRR479062 SRR479063 SRR479064 SRR479065 SRR479066 \
SRR479067 SRR479068 SRR479069 SRR479070 SRR479071 \
SRR479072 SRR479073 SRR479074 SRR479075 SRR479076 \
SRR479077 SRR479078"
 
baseurl="anonftp@ftp-private.ncbi.nlm.nih.gov:"
 
# adapt the following line if you need other reads
# it should end with the first letters of your SRA files
uri="/sra/sra-instant/reads/ByRun/sra/SRR/SRR479"
 
# create container for data and move into it
mkdir -p SRP012167_fastq && cd SRP012167_fastq
 
# loop in the LIST and download one file at a time
for sra in $LIST; do
 
	## download SRA data
	cmd="${exefile} \
		-i ${sshcert} \
		-k1 -QTr -l10000m \
		${baseurl}${uri}/${sra} \
		sra_downloads"
 
	echo "# $cmd"
	eval $cmd
 
	# test if transfer succeeded or die
	RESULT=$?
 
	if [ $RESULT -eq 0 ]; then
		echo "# Aspera transfer succeeded for ${sra}"
	else
		echo "# Aspera transfer failed for ${sra}, aborting!"
		exit 1
	fi
 
	## convert SRA to fastq
	# archive data in gzip format to save space
	fastq-dump --split-3 --gzip sra_downloads/${sra}.sra
 
	# test if conversion succeeded or die
	RESULT=$?
	if [ $RESULT -eq 0 ]; then
		echo "## SRA to FASTQ conversion succeeded for ${sra}"
	else
		echo "## SRA to FASTQ conversion failed for ${sra}, aborting!"
		exit 1
	fi
 
done
 
# return where you came from
cd -
 
# test for happy ending
if [ $? = 0 ]; then
echo "### all steps succeeded"
else
echo "### something ended wrong!, please check!"
fi
  • This script relies on the presence of the Aspera program and the required certificate at given locations defined in the code. If your Aspera executable and certificate are located elsewhere, please edit these two lines
  • The script takes one to few hours to get the full data depending on your internet connection and processing speed and leads to storing 47GB of 'gzipped' read data.

Conclusion

The data has ben copied to the local computer and is now ready for QC and re-analysis as detailed in the remaining of the hands-on tutorial.

download exercise files

Download exercise files here.

Use the right application to open the files present in NGSRNADE2015 files

References:
  1. Felix Haglund, Ran Ma, Mikael Huss, Luqman Sulaiman, Ming Lu, Inga-Lena Nilsson, Anders Höög, C Christofer Juhlin, Johan Hartman, Catharina Larsson
    Evidence of a functional estrogen receptor in parathyroid adenomas.
    J Clin Endocrinol Metab: 2012, 97(12);4631-9
    [PubMed:23024189] ##WORLDCAT## [DOI] (I p)

  2. http://www.ncbi.nlm.nih.gov/sra?term=SRP012167
  3. http://www.ebi.ac.uk/ena/
  4. http://www.ncbi.nlm.nih.gov/books/NBK47527/

[ Main_Page | NGS_data_analysis | Hands-on_introduction_to_NGS_RNASeq_DE_analysis ]