ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Genotype calling

From angsd
(Redirected from Dogeno)
Jump to navigation Jump to search

We really don't recommend doing analysis based on called genotypes, but incorporate the uncertainty directly into the analysis you want to perform. But we recognise that many methods are still relying on called genotypes, and have therefore implemented a basic genotype caller into angsd.


Genotype calling in ANGSD is based on calculating the posterior probability of the genotypes. The -doGeno is therefore a simple wrapper around the -doPost along with some extra filtering options. See Allele Frequencies for more information.


Brief Overview

./angsd -dogeno         -> Wed Mar  2 12:39:19 2016
-----------------
abcCallGenotypes.cpp:

-doGeno 0
        1: write major and minor
        2: write the called genotype encoded as -1,0,1,2, -1=not called
        4: write the called genotype directly: eg AA,AC etc 
        8: write the posterior probability of all possible genotypes
        16: write the posterior probability of called genotype
        32: write the posterior probabilities of the 3 gentypes as binary
        -> A combination of the above can be choosen by summing the values, EG write 0,1,2 types with majorminor as -doGeno 3
        -postCutoff=0.333333 (Only genotype to missing if below this threshold)
        -geno_minDepth=-1       (-1 indicates no cutof)
        -geno_maxDepth=-1       (-1 indicates no cutof)
        -geno_minMM=-1.000000   (minimum fraction af major-minor bases)
        -minInd=0       (only keep sites if you call genotypes from this number of individuals)

        NB When writing the posterior the -postCutoff is not used
        NB geno_minDepth requires -doCounts
        NB geno_maxDepth requires -doCounts


angsd can also use the full information of the sample allele frequencies for calling genotypes see SFS Estimation.

Options

-doGeno [int]

1: print out major minor

2: print the called genotype as -1,0,1,2 (count of minor)

4: print the called genotype as AA, AC, AG, ...

8: print all 3 posts (major,major),(major,minor),(minor,minor)

16: print the posterior of the called genotype

32: somewhat different dumps the binary posterior for all samples, encoded as 3*nind double

Use the sum of the above to give the output you want. Forexample -doGeno 5 (1+4) prins the major and minor allele followed by the genotype (AA, AC ...) for each individual

-doPost [int]

1: estimate the posterior genotype probability based on the allele frequency as a prior

2: estimate the posterior genotype probability assuming a uniform prior

-geno_minDepth [int]

set genotypes to missing if the individual depth is less than [int]

-geno_maxDepth [int]

set genotypes to missing if the individual depth is larger than [int]

-geno_minMM [float]

set genotypes to missing if less than [float] of the bases are the major or minor (likely a triallic site). e.g. 0.1 means that less than 10% of reads in this individual is either the major or the minor

-postCutoff [float]

Call only a genotype with a posterior above this threshold.

NB if the raw posterior dump is requested the -postCutoff is not used

Examples

Allele frequency as prior

./angsd -bam bam.filelist -GL 1 -out outfile -doMaf 2 -doMajorMinor 1 -SNP_pval 0.000001 -doGeno 5 -doPost 1 -postCutoff 0.95

gives a output like this:

1       14000202        G       A       GG      NN      NN      GA      NN      
1       14000873        G       A       GG      GG      GG      AA      GA      
1       14001018        T       C       NN      NN      NN      CC      NN      
1       14001867        A       G       NN      AA      AA      NN      NN      
1       14002342        C       T       CC      CC      CC      CC      CC      
1       14002422        A       T       AA      NN      NN      NN      NN      
1       14002474        T       C       TC      TT      TT      TT      TT      
1       14003581        C       T       CC      CC      NN      NN      CT      
1       14004623        T       C       TT      TT      TT      NN      TC      
1       14005069        A       G       AA      AA      AA      AA      AA

Sample allele frequency with SFS as prior

1. First get an estimate of the site frequency spectrum

./angsd -dosaf 1 -anc ../hg19ancNoChr.fa.gz -gl 1 -b list
./realSFS angsdput.saf.idx >angsdput.saf.idx.ml

2. Now calculate diallelic genotype posterior probablity with

./angsd -dopost 3 -b list -gl 1 -domajorminor 1 -domaf 1 -pest angsdput.saf.idx.ml -dogeno 2 -r 1 -out angsdput2