Exercises for the Sequence Assembly tutorial
Go to parent CLC Bio Main Workbench
In this tutorial, we will see how to assemble sequencing data generated by conventional "Sanger" sequencing techniques into a contig (= a set of overlapping sequence reads). We will also see how to find and inspect any conflicts that may exist between different reads. For Next-Generation Sequencing (NGS) data, we refer to the CLC Genomics Workbench.
We will use the trace data in the Sequencing data folder of the Example Data and do the following steps in the Workbench:
- Trim sequences
- Assemble using a reference sequence
- Find and edit conflicts
- Tabular view of an assembled contig (easy data overview)
Open the Sequencing reads folder in the Sequencing data folder.
Contents
Trim the sequences
The first thing to do is to trim the sequences (= cut off low quality parts of the sequence). Trimming serves a dual purpose: it takes care of parts of the reads with poor quality and it removes potential vector contamination. Trimmed data give better results in further analysis.
Trim the sequences |
---|
|
The parts annotated as trimmed will be ignored in the subsequent assembly. A natural question is: Why not simply delete the trimmed regions instead of annotating them? In some cases, these regions could potentially contain valuable information, and this information would be lost if the regions were deleted instead of annotated.
Assemble the sequencing data
The next step is to assemble the sequences. This is the technical term for aligning the sequences where they overlap and reverse the reverse reads to make a contiguous sequence (also called a contig). In this tutorial, we will assemble to a reference sequence, a sequence that you know is similar to your sequencing data.
Assemble the sequences to the ATP8a1 mRNA reference sequence |
---|
|
Assembly should generate the following result:
Actually, the result of the assembly is a contig, an alignment of the nine reads to the reference sequence.
Show an overview of the contig. |
---|
Click the Fit width button in the bottom toolbar to see an overview of the contig.
|
You can adjust the contig overview and how it is displayed (colors, coverage graph...) by changing parameters in the Alignment info in the Contig Settings side panel.
Compact the contig as much as you can. |
---|
If you want a more compact view of the contig change Not compact to Packed in the Read layout section of the Side Panel. You can compact even more by removing the annotations in the Annotation layout section.
|
Resolving conflicts
Zoom to 100% on the residues at the start of the contig. |
---|
Click the Zoom to 100% button in the bottom toolbar to zoom in on the contig.
|
Set the compactness to Not compact in the Read layout settings in the Side Panel.
Find the first conflict in the contig. |
---|
Here, the first read has a T (marked in light-pink), whereas the second read and the reference have a gap. |
To determine which of the two reads you should trust, you should assess the quality of the reads at this position.
Check the quality of the reads on this position. |
---|
A quick look at the regularity of the peaks of read Rev2 compared to read Rev3 indicates that we should trust the Rev2 read. In addition, you can see that we are close to the end of the end of Rev3, and the quality of the chromatogram traces is often low near the ends. |
Based on this, we decide not to trust Rev3.
Resolve the conflict. |
---|
This will resolve the conflict. |
Find the next conflict. |
---|
Click the Find Conflict button again to find the next conflict.
|
This conflict is the beginning of a stretch of gaps in the consensus sequence. This is because the reads have been trimmed at this position. However, if you look at the read at the bottom, Fwd2, you see that a lot of the peaks seem to be fine, so we could include this information in the contig. If you scroll a little to the right, you can see where the trimmed region begins.
Include the region of Fwd2 in the contig. |
---|
To include this region in the contig, move the vertical slider on the Trimmed region annotation at position 2073 to the left.
You will now see how the gaps in the consensus are replaced by the sequence information of Fwd2. |
Click the Find Conflict button again to find the next conflict.
Here both reads are different than the reference sequence. We now inspect the traces in more detail.
Zoom in to see the details of this conflict. |
---|
Zoom in on this conflict position by clicking Zoom to selection in the bottom toolbar.
This gives more space between the residues, but if we would like to inspect the peaks even more, simply drag the peaks up and down with your mouse. |
We have sequenced the coding part of the gene. Often you want to know what a variation like this would mean on the protein level.
Show the effect of the variation on the protein. |
---|
To do this, show the translation along the contig:
|
The variation is on the third base of the codon coding for Threonine, a synonymous substitution. That is why the T is orange. A non-synonymous substitution would be in red.
Getting a list of all variations
Browsing the conflicts by clicking the Find Conflict button is useful in many cases, but most people want to get an overview of all the conflicts in the entire contig.
Retrieve a table containing all conflicts. |
---|
Click Show Table on the bottom toolbar.
You can right-click the Notes field and select Edit conflict annotation to enter your own annotation. |
Getting a list of all the edits you made
When you make a change, it will be recorded in the contig's history.
Open the History. |
---|
click the History icon on the bottom toolbar.
|
Using the result for further analysis
Save the consensus sequence. |
---|
Right click the name Consensus and select Open Sequence
This opens the consensus sequence in a new window. Right click the tab of this window and select Save As
|
This will make it possible to use this sequence for further analyses in the CLC Main Workbench. All the conflict annotations are preserved, and in the sequence's history, you will find a reference to the original contig. As long as you also save the original contig, you will always be able to go back to it by choosing the reference contig in the consensus sequence's history.