The allele frequency is the relative frequency of an allele across all alleles for a site.

This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see Inferring_Major_and_Minor_alleles).

Brief Overview

./angsd -doMaf 
	-> angsd version: 0.570	 build(Dec 17 2013 14:11:25)
	-> Analysis helpbox/synopsis information:
------------------------
analysisMaf.cpp:
-doMaf	0
	1: BFGS frequency (known major minor)
	2: EM frequency (known major minor)
	4: BFGS frequency (unknown major minor)
	8: EM frequency (unknown major minor)
	16: Frequency from genotype probabilities
	32: AlleleCounts based method (known major minor)
	-doSNP	0
	-minMaf	0.010000 0
	-minLRT	24.000000 0
	-ref	(null)
	-anc	(null)
	-eps	0.001000 [Only used for -doMaf &32]
-doPost	0	(Calculate posterior prob 3xgprob)
	1: Using frequency as prior
	2: Using uniform prior
-beagleProb	0 (Dump beagle style postprobs)
NB these frequency estimators requires major/minor -doMajorMinor

Allele Frequency estimation

-doMaf [int]

INT=1 bfgs known minor

INT=2 EM known minor

INT=4 BFGS unknown minor

INT=8 EM unknown minor

INT=16 frequencies from genotype probabilities

Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the first three estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)

Allele frequencies from genotype likelihoods

The allele frequency estimators are described in citation. For testing reasons two optimazations are availeble. The BFGS and the EM algorithm. The EM algorithm is much faster then the BFGS. The allele frequencies are estimated by assuming that the site is diallelic and the major or minor alleles can be infered prior to the estimation or the uncertaincy of the minor allele can be incorborated into the model.

ML estimator with known minor

First infer the Major and Minor allele and then use BFGS (-doMaf 1) optimazation or the EM algorithm (-doMaf 2) to estimate the allele frequencies.

$L(D|f)\propto \prod _{i}^{N}p(D_{i}|f)=\prod _{i}^{N}\sum _{g\in \{0,1,2\}}p(D_{i}|G=g)p(G=g|f)$

${\hat {f}}=argmax_{f}L(D|f)$

ML estimator with unknown minor

First infer the Major allele and then use BFGS (-doMaf 4) optimazation or the EM algorithm (-doMaf 8) to estimate the allele frequencies. Here only the Major allele needs to be known and the uncertaincy of infering the minor allele is modelled.

Let $\{M,m\}$ denote the major an minor allele assuming adiallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function

$P(D|M,f)=\prod _{i}P(D_{i}|M,f)=\sum _{m}\sum _{A_{1},A_{2}}P(D_{i}|G=A_{1}A_{2})p(G=A_{1}A_{2}|m,M)p(m),$

Example

Example for estimating the allele frequencies both while assuming known major and minor allele but also while taking the uncertaincy of the minor allele inference into account. The inference of the major and minor allele is done directly from the genotype likelihood

./angsd -out out -doMajorMinor 1 -doMaf 10 -bam bam.filelist

Estimator from genotype probabilities

If the genotype probabilities are known the frequencies can be estimated by summing up the posterior probabilities $p(G=g|D)$ where $D$ is the sequencing data and $g\in \{0,1,2\}$ the allele count of the minor allele. The frequency estimate

${\hat {f}}={\frac {1}{2N}}\sum _{i}^{N}\left(2p(G=2|D)+p(G=1|D)\right)$

example

Example of the use of a genotype probability file for example from the output from beagle.

./angsd -out out -doMaf 16 -beagle beagle.file.gz

Estimator from sequencing data

The allele frequencies can be infered directy from the sequencing data citation. This works by using "counts" of alleles, and should be invoked like

-doCounts 1 -doPhat 1

Output data

.mafs

chromo	position	major	minor	ref	knownEM	unknownEM	nInd
21      9719788 T       A       0.000001        -0.000012       3
21      9719789 G       A       0.000000        -0.000001       3
21      9719790 A       C       0.000000        -0.000004       3
21      9719791 G       A       0.000000        -0.000001       3
21      9719792 G       A       0.000000        -0.000002       3
21      9719793 G       T       0.498277        41.932766       3
21      9719794 T       A       0.000000        -0.000001       3
21      9719795 T       A       0.000000        -0.000001       3

The first 4 columns are always defined to be:

1. chromosome name
2. position
3. major allele
4. minor allele

Depending on whether or not a reference and/or ancestral fasta files has been supplied these can occur as column 5 and 6. There are 4 different MAF estimators the estimate for these are given by the names knownEM,unknownEM,knownBFGS,unknownBFGS.

Futhermore if -doSNP is included, then the corresponding LRT will be printed.

The nInd column is the effective sample size, as detmined by the genotype likelihoods.

Anders check below:

This pretty explanatory, nInd is the number of individuals where we have "reliable" reads (see bugs section) Depending on -doMaf INT, and -ref FILENAME and -anc FILENAME, extra column will be input.

Allele Frequencies

Contents