Exercises for the Sequence Assembly tutorial

From BITS wiki
Jump to: navigation, search
Go to parent CLC Bio Main Workbench

In this tutorial, we will see how to assemble sequencing data generated by conventional "Sanger" sequencing techniques into a contig (= a set of overlapping sequence reads). We will also see how to find and inspect any conflicts that may exist between different reads. For Next-Generation Sequencing (NGS) data, we refer to the CLC Genomics Workbench.

We will use the trace data in the Sequencing data folder of the Example Data and do the following steps in the Workbench:

  • Trim sequences
  • Assemble using a reference sequence
  • Find and edit conflicts
  • Tabular view of an assembled contig (easy data overview)

Open the Sequencing reads folder in the Sequencing data folder.

Trim the sequences

The first thing to do is to trim the sequences (= cut off low quality parts of the sequence). Trimming serves a dual purpose: it takes care of parts of the reads with poor quality and it removes potential vector contamination. Trimmed data give better results in further analysis.

The parts annotated as trimmed will be ignored in the subsequent assembly. A natural question is: Why not simply delete the trimmed regions instead of annotating them? In some cases, these regions could potentially contain valuable information, and this information would be lost if the regions were deleted instead of annotated.

Assemble the sequencing data

The next step is to assemble the sequences. This is the technical term for aligning the sequences where they overlap and reverse the reverse reads to make a contiguous sequence (also called a contig). In this tutorial, we will assemble to a reference sequence, a sequence that you know is similar to your sequencing data.

Assembly should generate the following result:


Assemble8.png

Actually, the result of the assembly is a contig, an alignment of the nine reads to the reference sequence.

You can adjust the contig overview and how it is displayed (colors, coverage graph...) by changing parameters in the Alignment info in the Contig Settings side panel.

Resolving conflicts

Set the compactness to Not compact in the Read layout settings in the Side Panel.

To determine which of the two reads you should trust, you should assess the quality of the reads at this position.

Based on this, we decide not to trust Rev3.

This conflict is the beginning of a stretch of gaps in the consensus sequence. This is because the reads have been trimmed at this position. However, if you look at the read at the bottom, Fwd2, you see that a lot of the peaks seem to be fine, so we could include this information in the contig. If you scroll a little to the right, you can see where the trimmed region begins.

Click the Find Conflict button again to find the next conflict.


Assemble15.png

Here both reads are different than the reference sequence. We now inspect the traces in more detail.

We have sequenced the coding part of the gene. Often you want to know what a variation like this would mean on the protein level.

The variation is on the third base of the codon coding for Threonine, a synonymous substitution. That is why the T is orange. A non-synonymous substitution would be in red.

Getting a list of all variations

Browsing the conflicts by clicking the Find Conflict button is useful in many cases, but most people want to get an overview of all the conflicts in the entire contig.

Getting a list of all the edits you made

When you make a change, it will be recorded in the contig's history.

Using the result for further analysis

This will make it possible to use this sequence for further analyses in the CLC Main Workbench. All the conflict annotations are preserved, and in the sequence's history, you will find a reference to the original contig. As long as you also save the original contig, you will always be able to go back to it by choosing the reference contig in the consensus sequence's history.