.contig

From BITS wiki
Jump to: navigation, search

original source
The .contig format is a concatenation of the .align files produced by TIGR Assembler. This format is a more concise representation of the output of the assembler (reported in the verbose .asm file) and is an extension of the GDE multiple alignment format.

Example

##56487 19 1623 bases, 00000000 checksum.
TTAGACCCAGGAGAAG-CATAAAATTTTCAGAGCCATCTGATGTAGGAGGAAGTTATGAA
#000035230611N10F(0) [RC] 711 bases, 00000000 checksum. {720 10} <1 710>
TTAGACCCAGGAGAAG-CATAAAATTTTCAGAGCCATCTGATGTAGGAGGAAGTTATGAA

  • Each contig is preceded by a header starting with ##, followed by the contig identifier, number of reads aligned to it, and the number of bases in the padded consensus. If generated by TIGR Assembler, these records also contain an 8-digit checksum, however most converters generate a blank checksum (it's not used by any code anyway).
  • The contig sequence, listed after the "##" header, is padded with the gap character.
  • Each read aligned to the consensus is preceded by a header starting with a single "#" character. Provided in parantheses, is the 0-based offset of the read in the consensus. Within the square brackets the string "RC" indicates the read was reverse complemented, a fact also indicated in the representation of the clear range within the braces ({720 10}). The clear range is 1-based with respect to the unpadded/ungapped read sequence. Note the low number is 10, meaning the first 9 bases (1-9) have been trimmed from the beginning (5' end) of the read. There may also be bases trimmed at the end of the read (3' end) beyond base 720, but this format does not record how many bases there are. Next comes the coordinates of the read along the ungapped 1-based consensus are provided within angle brackets (<1 710>). This header also contains a checksum (largely ignored) and information about the * After the read header, the aligned section of the read (the bases within the clear range alone) is provided in padded form, and in the correct orientation (complemented if necessary).

Note: the .contig format can be easily parsed in Perl using the AMOS::ParseFasta module as follows: $pf = new AMOS::ParseFasta(\*STDIN, "#", ""); For more information run perldoc AMOS::ParseFasta.