Variation data

From BITS wiki
Jump to: navigation, search
Go to parent Basic bioinformatics concepts, databases and tools#Exercises_during_the_training


Sequence variation information

*Ensembl

*SNPs in human F9

We come back to the search for the human F9 (Coagulation factor IX Precursor) gene in Ensembl since Ensembl contains a vast amount of sequence variation information. First to the Gene page.
Click on Sequence in the left menu.


Ensembl36c.png

You now see the genomic sequence of F9 on the gene page. Exons are highlighted in red.
Click Configure this page in the left menu to open the configure page window:

  • For Show variations select Yes and show links
  • For Number of base pairs per row select 90 bps
  • For Line numbering select Relative to this sequence


Ensembl37b.png

Close the window to allow the changes to take effect.
Now all SNPs and small variations are highlighted on the sequence following a colour code that represents the consequences of the varations and is explained at the top. At the right you see links that you can click to find more info on that specific variation. The IDs that start with "rs" are from dbSNP (check the overview of all variation databases that were used in Ensembl's Variation pages).

Click the rs number of the variant on position 722.


Ensembl50.png

This takes you to the corresponding Variation page.

More detailed information is available when you click one of the icons. For instance click the Population genetics icon.

For this variant, there's no phenotypic info available but for some of the variations, their phenotypic implications are known and documented in Ensembl.

*Somatic mutations in human F9

Go back to the Gene page of F9 and click the COSMIC ID (COSM385264) of the variant on position 635. COSMIC is a database of somatic mutations found in human cancers (check the overview of all variant databases used by Ensembl.

*All diseases F9 is linked to

Go back to the Gene page of F9.

*Overview of all variantions in F9

*Large variations in F9

All information of dbVar, NCBI's database of large(r) variations, is included in Ensembl.

Large variations can be viewed by clicking Structural variants in the left menu. This opens a graph and a table of known structural variations in the F9 gene sequence. Remember that CNV stands for copy number variants.

*Variations in a transcript

Go to the transcript page of F9-001, its longest transcript.
Here also you can view the variations in the sequence.

  • Click Exons under the Sequence menu item in the left menu to show the sequences of the exons. Remember that UTRs are displayed in purple, exons in black, introns in blue and upstream and downstream sequences in green.
  • Use configure this page to show variations as we did on the gene page (it might be a good idea to select Show full intronic sequences if you want to view variations in introns).
  • To export the view with the variations highlighted click Download Sequence and select RTF format.

If you only want to view variations in the CDS, click cDNA under the Sequence menu item in the left menu. Variations are displayed by default.

Again there are also other ways to view variation info.

On the Locations page you can visualize variations by adding Variation tracks.

*Variant effect predictor

Exercise developed by EBI
Resequencing the genomic region of the human CFTR (cystic fibrosis transmembrane conductance regulator gene) has revealed the following variants (alleles defined in the forward strand):

  • G/A at 7:117,171,039
  • T/C at 7:117,171,092
  • T/C at 7:117,171,122

The Variant Effect Predictor tool allows to predict the functional consequences of these variants.


*NCBI's variation resources

*Variant reporter

We sequenced blood samples from various patients with a disease and their family members. After analyzing the reads we obtained a list of sequence variants that are specific for the patients. We want to check if the variants that we find are novel.

To see if variants are already known you need to use NCBI's variant reporter tool. We are going to do the analysis on three variants but you can submit as many variants as you want.

The variants have to be submitted in a certain format but many formats are supported. Check the help files to see which other formats are supported. We will use the VCF format since this is typically generated by NGS workflows. It's a tab-delimited text file, you can check out the VCF specification to see the details of this format.

Download the variants file in VCF format.

Feel free to check out the dbSNP records of these SNPs.

Genomic Variation Server

GVS provides easy access to variation data from dbSNP, HAPMAP and other resources. You can define a location or region of the human genome to search in and GVS will give you access to all available variation info in that region. As an example, we'll take a look at the variation in BRCA2.

Go to the GVS website.

This returns an overview of all data sets containing data on variants in BRCA2.

Now you see an overview of the SNPs in the BRCA2 gene that occur in more than 48% of the Japanese population. As you can see, these are all intron variants and one synonymous substitution (different codon encoding same amino acid). It is expected that variants that occur at such high frequencies have no impact on the BRCA2 protein sequence.


GVS3.png

For each pair of SNPs the r2 score represents the number of cases that both SNPs occur in the same person (ranging from 0 to 1). The idea behind this is that if two individuals share the same variant, we would also expect that they share not just that variant but also the surrounding chromosomal region.

This groups SNPs based on r2 values. This is useful for the development of a minimal set of SNPs for genotyping similar populations (by selecting one SNP from each bin). The Tag SNPs are those for which the pairwise-r2 values exceed the r2 Threshold. The Other SNPs are those for which the pairwise-r2 values are less than the threshold. It is better to choose a SNP from Tag SNPs to represent the bin.


Database of genomic variants

Go to the query tool of the database of genomic variants. The database is a set of inter-related tables containing all the data from the studies included in DGV. You can search and filter the data in different ways, e.g.

  • data that come from a particular studyy
  • variants of a certain type e.g. copy number variations
  • sample size e.g. variants coming from large population studies
  • ...

You can set multiple filters at the same time.

This returns over 6000 variants on chromosome Y (found by mapping to assembly version hg19) coming from different studies. As you can see at the top right of the list, you can save the output in various formats.


OMIM

Exercise obtained from OpenHelix.

Exercise 1: the human RANKL gene

I would like to know if there are any phenotypes associated with the human RANKL gene, and whether this association is due to variation in the gene. Do a basic OMIM search for the gene RANKL, also known as TNFSF11 (see slides).

Exercise 2: myopathy

Imagine that you have a patient displaying evidence of myopathy that has been linked to the chromosomal location 2p13. Conduct an OMIM Gene Map search for the area (see slides). On the search results page assess phenotypes in the region that may be the cause.

To learn more about the Dysferlin gene, under the Gene/Locus MIM number column, click the <b>606768 link to open the gene report.

Exercise 3: rheumatoid arthritis

Analyzing NGS data for variant analysis

Go to our NGS wiki page for the introductory tutorial on NGS data analysis (checking/improving the quality of your data, mapping the reads, obtaining the tools/data) and the tutorial on variant analysis.