ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Allele Frequencies: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
No edit summary
Line 13: Line 13:
INT=16 frequencies from genotype probabilities
INT=16 frequencies from genotype probabilities


Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the two first estimators
Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the two first estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)


==Allele frequencies from genotype likelihoods==
==Allele frequencies from genotype likelihoods==
Line 20: Line 20:
===ML estimator with known minor===
===ML estimator with known minor===


First infer the [[Inferring_Major_and_Minor_alleles|Major and Minor]] allele and then use BFGS optimazation to estimate the allele frequencies.  
First infer the [[Inferring_Major_and_Minor_alleles|Major and Minor]] allele and then use BFGS (-doMaf 1) optimazation or the EM algorithm (-doMaf 2) to estimate the allele frequencies.  
<math>
<math>
L(D|f) \propto \prod_i^N p(D_i|f) = \prod_i^N \sum_{g\in\{0,1,2\}}p(D_i|G=g)p(G=g|f)
L(D|f) \propto \prod_i^N p(D_i|f) = \prod_i^N \sum_{g\in\{0,1,2\}}p(D_i|G=g)p(G=g|f)
</math>
</math>
<math>
<math>
   \hat{f}=argmax_{f} L(D|f)
   \hat{f}=argmax_{f} L(D|f)
</math>
</math>


Let <math>\{M,m\}</math> denote the two possible alleles at the diallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function
===ML estimator with unknown minor===
 
First infer the [[Inferring_Major_and_Minor_alleles|Major]] allele and then use  BFGS (-doMaf 4) optimazation or the EM algorithm (-doMaf 8) to estimate the allele frequencies. Here only the Major allele needs to be known and the uncertaincy of infering the minor allele is modelled.
 
Let <math>\{M,m\}</math> denote the major an minor allele assuming adiallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function


<math>
<math>
   P(D|\{m,M\}) =  \prod_i P(D_i|\{m,M\})  
   P(D|M,f) =  \prod_i P(D_i|M,f) = \sum_m \sum_{A_1,A_2} P(D_i|G=A_1A_2)p(G=A_1A_2|m,M)p(m),
  =\prod_i \sum_{A_1,A_2 \in \{m,M\}} P(D_i|G=A_1A_2)p(G=A_1A_2|\{m,M\}),
</math>
</math>


===ML estimator with unknown minor===
First infer the [[Inferring_Major_and_Minor_alleles|Major and Minor]] allele and then use the EM algorithm to estimate the allele frequencies.


==Estimator from genotype probabilities==
==Estimator from genotype probabilities==

Revision as of 13:26, 18 June 2012

Allele Frequency estimation

-doMaf [int]

INT=1 bfgs known minor

INT=2 EM known minor

INT=4 BFGS unknown minor

INT=8 EM unknown minor

INT=16 frequencies from genotype probabilities

Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the two first estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)

Allele frequencies from genotype likelihoods

The allele frequency estimators are described in citation. For testing reasons two optimazations are availeble. The BFGS and the EM algorithm. The EM algorithm is much faster then the BFGS. The allele frequencies are estimated by assuming that the site is diallelic and the major or minor alleles can be infered prior to the estimation or the uncertaincy of the minor allele can be incorborated into the model.

ML estimator with known minor

First infer the Major and Minor allele and then use BFGS (-doMaf 1) optimazation or the EM algorithm (-doMaf 2) to estimate the allele frequencies.

ML estimator with unknown minor

First infer the Major allele and then use BFGS (-doMaf 4) optimazation or the EM algorithm (-doMaf 8) to estimate the allele frequencies. Here only the Major allele needs to be known and the uncertaincy of infering the minor allele is modelled.

Let denote the major an minor allele assuming adiallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function


Estimator from genotype probabilities

If the genotype probabilities are known the frequencies can be estimated by summing up the posterior probabilities where is the sequencing data and the allele count of the minor allele. The frequency estimate


Estimator from sequencing data