Exercises for the Sequence Assembly tutorial

Go to parent CLC Bio Main Workbench

In this tutorial, we will see how to assemble sequencing data generated by conventional "Sanger" sequencing techniques into a contig (= a set of overlapping sequence reads). We will also see how to find and inspect any conflicts that may exist between different reads. For Next-Generation Sequencing (NGS) data, we refer to the CLC Genomics Workbench.

We will use the trace data in the Sequencing data folder of the Example Data and do the following steps in the Workbench:

Trim sequences
Assemble using a reference sequence
Find and edit conflicts
Tabular view of an assembled contig (easy data overview)

Open the Sequencing reads folder in the Sequencing data folder.

Trim the sequences

The first thing to do is to trim the sequences (= cut off low quality parts of the sequence). Trimming serves a dual purpose: it takes care of parts of the reads with poor quality and it removes potential vector contamination. Trimmed data give better results in further analysis.

Trim the sequences
Open the Toolbox in the top menu and select Sequencing Data Analysis and Trim Sequences Select the 9 trace files and click Next. In the next dialog, you will be able to specify how the trimming should be performed. If you have previously trimmed the sequences, you can check Ignore existing trim information to remove the existing trimming annotation. For this data, we wish to use a very stringent trimming, so we set the limit of the trim quality score to 0,02. Check Trim ambiguous nucleotides to trim the sequence ends based on the presence of N’s. The algorithm takes as input the maximal number of N’s allowed in the sequence after trimming. There is no vector contamination in these data, so we only trim for poor quality. If there was vector contamination you should check Trim contamination from vectors in UniVec database. If checked, the program will match the sequence reads against all vectors in the UniVec database and remove sequence ends with significant matches. The UniVec database is included when you install the Workbench. A list of all the vectors in the UniVec database can be found at http://www.ncbi.nlm.nih.gov/VecScreen/replist.html . If your vector is not in the list you can check Trim contamination from saved sequences. This option lets you select your own vector sequences. Click Next and choose to Save the results. When the trimming is performed, the parts of the sequences that are trimmed are annotated as trimmed sequences, not removed so you do not lose any data

The parts annotated as trimmed will be ignored in the subsequent assembly. A natural question is: Why not simply delete the trimmed regions instead of annotating them? In some cases, these regions could potentially contain valuable information, and this information would be lost if the regions were deleted instead of annotated.

Assemble the sequencing data

The next step is to assemble the sequences. This is the technical term for aligning the sequences where they overlap and reverse the reverse reads to make a contiguous sequence (also called a contig). In this tutorial, we will assemble to a reference sequence, a sequence that you know is similar to your sequencing data.

Assemble the sequences to the ATP8a1 mRNA reference sequence
Open the Toolbox in the top menu and select Sequencing Data Analysis and Assemble Sequences to Reference Select the 9 trace files and click Next. Select the reference sequence: click the Browse and select element button. Select ATP8a1 mRNA (reference) from the Sequencing data folder and click OK. You can leave the other options in the Set reference parameters window to their defaults. If Include reference sequence(s) in contig(s) is checked the contig will show the reference sequence at the top and the reads aligned below. This option is useful when comparing sequence reads to a closely related reference sequence e.g. for SNP characterization. Click Next. The minimum aligned length is the minimum number of nucleotides in a read which must be successfully aligned to the contig. If this criterium is not met by a read, it is excluded from the assembly. If there is a conflict, i.e. a position where there is disagreement about the nucleotide (A, C, T or G), you can specify how the contig should reflect this conflict: Vote (A, C, G, T): the conflict is solved by counting instances of each nucleotide and taking the majority. In case of equality, ACGT are given priority over one another in this order. Unknown nucleotide (N): the contig will be assigned an 'N' in positions with conflicts. Ambiguity nucleotides (R, Y, etc.): the contig displays an ambiguity nucleotide in positions with conflicts. For an overview of ambiguity codes see IUPAC codes for nucleotides. Note that conflicts will always be highlighted and marked no matter which option you choose. As a result, the details of the conflicts can be maintained and used when the result of the sequence analysis is interpreted. Choose Use existing trim information (that you have just created) Click Next Now you have to choose to open or save the data: if you select Open click Finish if you select Save, click Next and select the location to save the assembly and click Finish

Assembly should generate the following result:

Actually, the result of the assembly is a contig, an alignment of the nine reads to the reference sequence.

Show an overview of the contig.
Click the Fit width button in the bottom toolbar to see an overview of the contig.

You can adjust the contig overview and how it is displayed (colors, coverage graph...) by changing parameters in the Alignment info in the Contig Settings side panel.

Compact the contig as much as you can.
If you want a more compact view of the contig change Not compact to Packed in the Read layout section of the Side Panel. You can compact even more by removing the annotations in the Annotation layout section.

Resolving conflicts

Zoom to 100% on the residues at the start of the contig.
Click the Zoom to 100% button in the bottom toolbar to zoom in on the contig.

Set the compactness to Not compact in the Read layout settings in the Side Panel.

Find the first conflict in the contig.
Add Conflict and Trimmed region annotations in the Annotation type settings in the Side Panel. Click the Find Conflict button in the Contig Settings section of the side panel to find the first position of disagreement between the reads. Here, the first read has a T (marked in light-pink), whereas the second read and the reference have a gap.

To determine which of the two reads you should trust, you should assess the quality of the reads at this position.

Check the quality of the reads on this position.
A quick look at the regularity of the peaks of read Rev2 compared to read Rev3 indicates that we should trust the Rev2 read. In addition, you can see that we are close to the end of the end of Rev3, and the quality of the chromatogram traces is often low near the ends.

Based on this, we decide not to trust Rev3.

Resolve the conflict.
To correct the read, select the T in the Rev3 sequence by placing the cursor to the left of it and dragging the cursor across the T. Right click and select Delete selection. This will resolve the conflict.

Find the next conflict.
Click the Find Conflict button again to find the next conflict.

This conflict is the beginning of a stretch of gaps in the consensus sequence. This is because the reads have been trimmed at this position. However, if you look at the read at the bottom, Fwd2, you see that a lot of the peaks seem to be fine, so we could include this information in the contig. If you scroll a little to the right, you can see where the trimmed region begins.

Include the region of Fwd2 in the contig.
To include this region in the contig, move the vertical slider on the Trimmed region annotation at position 2073 to the left. You will now see how the gaps in the consensus are replaced by the sequence information of Fwd2.

Click the Find Conflict button again to find the next conflict.

Here both reads are different than the reference sequence. We now inspect the traces in more detail.

Zoom in to see the details of this conflict.
Zoom in on this conflict position by clicking Zoom to selection in the bottom toolbar. This gives more space between the residues, but if we would like to inspect the peaks even more, simply drag the peaks up and down with your mouse.

We have sequenced the coding part of the gene. Often you want to know what a variation like this would mean on the protein level.

Show the effect of the variation on the protein.
To do this, show the translation along the contig: Expand Nucleotide info in the Side Panel In the Translation section select Show Select ORF/CDS in the Frame box

The variation is on the third base of the codon coding for Threonine, a synonymous substitution. That is why the T is orange. A non-synonymous substitution would be in red.

Getting a list of all variations

Browsing the conflicts by clicking the Find Conflict button is useful in many cases, but most people want to get an overview of all the conflicts in the entire contig.

Retrieve a table containing all conflicts.
Click Show Table on the bottom toolbar. You can right-click the Notes field and select Edit conflict annotation to enter your own annotation.

Getting a list of all the edits you made

When you make a change, it will be recorded in the contig's history.

Open the History.
click the History icon on the bottom toolbar.

Using the result for further analysis

Save the consensus sequence.
Right click the name Consensus and select Open Sequence This opens the consensus sequence in a new window. Right click the tab of this window and select Save As

This will make it possible to use this sequence for further analyses in the CLC Main Workbench. All the conflict annotations are preserved, and in the sequence's history, you will find a reference to the original contig. As long as you also save the original contig, you will always be able to go back to it by choosing the reference contig in the consensus sequence's history.

Exercises for the Sequence Assembly tutorial

Contents

Trim the sequences

Assemble the sequencing data

Resolving conflicts

Getting a list of all variations

Getting a list of all the edits you made

Using the result for further analysis

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox