ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Allele Frequencies
The allele frequency is the relative frequency of an allele across all alleles for a site.
This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see Inferring_Major_and_Minor_alleles).
We allow for frequency estimation from different input data:
- Genotype Likelihoods
- Genotype posteriors
- Counts of bases
The allele frequency estimator from genotype likelihoods are from this publication, and the base counts method is from this publication. Unless you have very specific reasons for using the count based estimator we recommend that uses use the -doMaf 2. ./angsd -doMaf -> angsd version: 0.572 build(Jan 7 2014 02:33:35) -> Analysis helpbox/synopsis information:
analysisMaf.cpp: -doMaf 0 (Calculate persite frequencies '.mafs.gz') 1: Frequency (fixed major and minor) 2: Frequency (fixed major unknown minor) 4: Frequency from genotype probabilities 8: AlleleCounts based method (known major minor) Filedumping is supressed if value is negative -doSNP 0 (Perform an LRT of variability) -minMaf 0.010000 0 -minLRT 24.000000 0 -ref (null) (Filename for fasta reference) -anc (null) (Filename for fasta ancestral) -eps 0.001000 [Only used for -doMaf &8] -doPost 0 (Calculate posterior prob 3xgprob) 1: Using frequency as prior 2: Using uniform prior -beagleProb 0 (Dump beagle style postprobs) -indFname (null) (file containing individual inbreedcoeficients) NB These frequency estimators requires major/minor -doMajorMinor
./angsd -doMaf -> angsd version: 0.572 build(Jan 7 2014 02:33:35) -> Analysis helpbox/synopsis information: ------------------------ analysisMaf.cpp: -doMaf 0 (Calculate persite frequencies '.mafs.gz') 1: Frequency (fixed major and minor) 2: Frequency (fixed major unknown minor) 4: Frequency from genotype probabilities 8: AlleleCounts based method (known major minor) Filedumping is supressed if value is negative -doSNP 0 (Perform an LRT of variability) -minMaf 0.010000 0 -minLRT 24.000000 0 -ref (null) (Filename for fasta reference) -anc (null) (Filename for fasta ancestral) -eps 0.001000 [Only used for -doMaf &8] -doPost 0 (Calculate posterior prob 3xgprob) 1: Using frequency as prior 2: Using uniform prior -beagleProb 0 (Dump beagle style postprobs) -indFname (null) (file containing individual inbreedcoeficients) NB These frequency estimators requires major/minor -doMajorMinor
Output data
.mafs
chromo position major minor ref knownEM unknownEM nInd 21 9719788 T A 0.000001 -0.000012 3 21 9719789 G A 0.000000 -0.000001 3 21 9719790 A C 0.000000 -0.000004 3 21 9719791 G A 0.000000 -0.000001 3 21 9719792 G A 0.000000 -0.000002 3 21 9719793 G T 0.498277 41.932766 3 21 9719794 T A 0.000000 -0.000001 3 21 9719795 T A 0.000000 -0.000001 3
The first 4 columns are always defined to be:
- 1. chromosome name
- 2. position
- 3. major allele
- 4. minor allele
Depending on whether or not a reference and/or ancestral fasta files has been supplied these can occur as column 5 and 6. There are 4 different MAF estimators the estimate for these are given by the names knownEM,unknownEM,knownBFGS,unknownBFGS.
Futhermore if -doSNP is included, then the corresponding LRT will be printed.
The nInd column is the effective sample size, as detmined by the genotype likelihoods.
Anders check below:
This pretty explanatory, nInd is the number of individuals where we have "reliable" reads (see bugs section)
Depending on -doMaf INT, and -ref FILENAME and -anc FILENAME, extra column will be input.
Theory
ML estimator with known minor
First infer the Major and Minor allele and then use BFGS (-doMaf 1) optimazation or the EM algorithm (-doMaf 2) to estimate the allele frequencies.
ML estimator with unknown minor
First infer the Major allele and then use BFGS (-doMaf 4) optimazation or the EM algorithm (-doMaf 8) to estimate the allele frequencies. Here only the Major allele needs to be known and the uncertaincy of infering the minor allele is modelled.
Let denote the major an minor allele assuming adiallelic site, then the maximum likelihood estimate of this pair is found using the likelihood function