Cgatools commands
Contents
- 1 for analysis of Complete Genomics data
- 1.1 set path and other inits
- 1.2 Reference Coverage File
- 1.3 … in a region
- 1.4 Variant File
- 1.5 … in a region
- 1.6 Evidence Interval File
- 1.7 … in a locus
- 1.8 … in a region
- 1.9 Evidence DNB File
- 1.10 … in a locus
- 1.11 Evidence Correlation File
- 1.12 [long running]
- 1.13 Reference Genome File
- 1.14 … chromosome mode
- 1.15 … contig mode
- 1.16 Gene Variation Summary File
- 1.17 … summarized
- 1.18 Gene File
- 1.19 … for a given snpID
- 1.20 [long running, generates count of components]
- 1.21 [long running, generates count of impacts]
- 1.22 … FRAMESHIFT in TEKT4 only
- 1.23 … DELETE or INSERT in TEKT4 only
- 1.24 … MISSENSE in gene TEKT4 summarized
- 1.25 chr21 Varfile Subset
- 1.26 … subset chr21 from all
- 1.27 cgatools calldiff
- 1.28 cgatools snpdiff
- 1.29 end of 'September-9th' exercises
for analysis of Complete Genomics data
Full listing of commands used during the CGI training
make sure you copy the full text of the commands, some are spread over several lines or go outside of the dotted box
UCSC browser direct link [1]
back to [training page]
set path and other inits
people with a bits computer do not need to do this!
by typing this bash command in the cgatools folder, you can shorten the commands in the following exercises by replacing the path by its variable name ${path_to_data}
command:
cd <<the full path to the 'training_data' folder>> path_to_data=`pwd`
Reference Coverage File
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/REF/coverageRefScore-chr21-GS19240-180-36-ASM.tsv.bz2\ | head -20
result:
… in a region
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/REF/coverageRefScore-chr21-GS19240-180-36-ASM.tsv.bz2\ | perl -F'\t' -ane 'print if ($F[0]>= 40647496 && $F[0]<= 40647501);'
result:
Variant File
command:
if you want to see more lines than the first 20, increase the value at the end of the line accordingly
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/var-GS19240-180-36-ASM.tsv.bz2\ | head -20
result:
… in a region
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/var-GS19240-180-36-ASM.tsv.bz2\ | grep -w chr21 | perl -F'\t' -ane 'print if ($F[4]>=3946599 && $F[5]<=9719804);'
result:
Evidence Interval File
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/EVIDENCE/evidenceIntervals-chr21-GS19240-180-36-ASM.tsv.bz2\ | head -20
result:
… in a locus
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/EVIDENCE/evidenceIntervals-chr21-GS19240-180-36-ASM.tsv.bz2\ | perl -F'\t' -ane 'print if ($F[0]==19995);'
result:
… in a region
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/EVIDENCE/evidenceIntervals-chr21-GS19240-180-36-ASM.tsv.bz2\ | perl -F'\t' -ane 'print if ($F[2]<=9719803 && ($F[2]+$F[3])>=9719804);'
result:
Evidence DNB File
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/EVIDENCE/evidenceDnbs-chr21-GS19240-180-36-ASM.tsv.bz2\ | head -20
result:
… in a locus
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/EVIDENCE/evidenceDnbs-chr21-GS19240-180-36-ASM.tsv.bz2\ | perl -F'\t' -ane 'print if ($F[0]==19995 );'
result:
Evidence Correlation File
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/EVIDENCE/correlation-GS19240-180-36-ASM.tsv.bz2\ | head -20
result:
[long running]
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/EVIDENCE/correlation-GS19240-180-36-ASM.tsv.bz2\ | perl -F'\t' -ane 'print if (($F[0] eq 'chr21' && $F[1] == 9720357) or ($F[3] eq 'chr21' && $F[4] == 9720357));'
result:
Reference Genome File
command:
cgatools decodecrr --reference ${path_to_data}/cgatools/cgatools_input/hg18.crr --range chr21,45724800,45724870
result:
… chromosome mode
command:
cgatools listcrr --reference reference ${path_to_data}/cgatools/cgatools_input/hg18.crr --mode chromosome
result:
… contig mode
command:
cgatools listcrr --reference reference ${path_to_data}/cgatools/cgatools_input/hg18.crr --mode contig
result:
Gene Variation Summary File
command:
cat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/geneVarSummary-GS19240-180-36-ASM.tsv\ | head -20
result:
… summarized
command:
grep '[^0-9]' ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/geneVarSummary-GS19240-180-36-ASM.tsv\ | sort -k18nr | head
result:
Gene File
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/gene-GS19240-180-36-ASM.tsv.bz2\ | head -20
result:
… for a given snpID
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/gene-GS19240-180-36-ASM.tsv.bz2\ | grep 'rs1042522'
result:
[long running, generates count of components]
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/gene-GS19240-180-36-ASM.tsv.bz2\ | grep '^[0-9]' | cut -f16 | sort | uniq -c | sort -k1nr
result:
[long running, generates count of impacts]
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/gene-GS19240-180-36-ASM.tsv.bz2\ | grep '^[0-9]' | cut -f19 | sort | uniq -c | sort -k1nr
result:
… FRAMESHIFT in TEKT4 only
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/gene-GS19240-180-36-ASM.tsv.bz2\ | grep -w 'TEKT4' | perl -F'\t' -ane 'print if ($F[18] =~ /FRAMESHIFT/);'
result:
… DELETE or INSERT in TEKT4 only
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/gene-GS19240-180-36-ASM.tsv.bz2\ | grep -w 'TEKT4' | perl -F'\t' -ane 'print if ($F[18] =~ /DELETE/ || $F[18] =~ /INSERT/);'
result:
… MISSENSE in gene TEKT4 summarized
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/gene-GS19240-180-36-ASM.tsv.bz2\ | grep -w 'TEKT4' | grep -w 'MISSENSE' | cut -f1 | sort | uniq -c | sort -k1nr | head -5
result:
chr21 Varfile Subset
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/var-GS19240-180-36-ASM.tsv.bz2\ | head -200 | grep -v '^[0-9]' > ${path_to_data}/cgatools/cgatools_input/var-NA19240-chr21.tsv
result:
… subset chr21 from all
command:
bzcat ${path_to_data}/NA19240/GS00028-DNA_C01/ASM/var-GS19240-180-36-ASM.tsv.bz2\ | grep 'chr21' >> ${path_to_data}/cgatools/cgatools_input/var-NA19240-chr21.tsv
result:
# see the results with head -20 ${path_to_data}/cgatools/cgatools_input/var-NA19240-chr21.tsv
cgatools calldiff
command:
cgatools calldiff --reference ${path_to_data}/cgatools/cgatools_input/hg18.crr \ --variantsA ${path_to_data}/cgatools/cgatools_input/var-NA19240-chr21.tsv \ --variantsB ${path_to_data}/cgatools/cgatools_input/var-NA19238-chr21.tsv \ --superlocus-output ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-superlocus-output.tsv \ --superlocus-stats ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-superlocus-stats.csv \ --locus-output ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-locus-output.tsv \ --locus-stats ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-locus-stats.csv \ --debug-call-output ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-debug-call-output.tsv \ --debug-superlocus-output ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-debug-superlocus-output.tsv
result:
# preview results with # superlocus-output head -20 ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-superlocus-output.tsv # superlocus-stats cat ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-superlocus-stats.csv # locus-output head -20 ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-locus-output.tsv # locus-stats cat ${path_to_data}/cgatools/cgatools_output/NA19240vsNA19238-locus-stats.csv
cgatools snpdiff
command:
cgatools snpdiff --reference ${path_to_data}/cgatools/cgatools_input/hg18.crr \ --variants ${path_to_data}/cgatools/cgatools_input/var-NA19240-chr21.tsv \ --genotypes ${path_to_data}/cgatools/cgatools_input/NA19238_Infinium_Genotypes.tsv \ --output ${path_to_data}/cgatools/cgatools_output/NA19238_out.tsv \ --verbose ${path_to_data}/cgatools/cgatools_output/NA19238_verb.tsv \ --stats ${path_to_data}/cgatools/cgatools_output/NA19238_stats.csv
result:
# preview output with # output head -20 ${path_to_data}/cgatools/cgatools_output/NA19238_out.tsv # stats cat ${path_to_data}/cgatools/cgatools_output/NA19238_stats.csv
end of 'September-9th' exercises
when we get new commands for version 1.1, they will be added here
back to [training page]