SFS Estimation: Difference between revisions

Revision as of 21:04, 19 December 2012

This method will estimate the site frequency spectrum, the method is described in Nielsen2012.

This is a 2 step procedure first generate a ".sfs" file, followed by an optimization of the .sfs file which will estimate the Site frequency spectrum. For the optimization we have implemented 2 different approaches both found in the misc subdir of the root subdir.This is shown in the diagram below.

NB the ancestral state needs to be supplied for this methodd <classdiagram type="dir:LR">

[sequence data{bg:orange}]->GL[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]

[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]->realSFS[.sfs file{bg:blue}] [.sfs file{bg:blue}]->optimize[.sfs.ml file{bg:red}] [.sfs file{bg:blue}]->optimize[.sfs.em.ml file{bg:red}]

</classdiagram>

-realSFS 1: an sfs file will be generated.

-realSFS 2: Taking into account perIndividual inbreeding coefficients. This is the work of Filipe G. Vieira

-realSFS 4: snpcalling (not implemented, in this angsd)

-realSFS 8: genotypecalling (not implemented, int this angsd)

For the inbreeding version you need to supply a file containing all the inbreeding coefficients. -indF

options

-underFlowProtect [INT]

a very basic underflowprotection

Example

A full example is shown below, here we use GATK genotype likelihoods and hg19.fa as the ancestral. The 'optimSFS and emOptim can be found in the misc subfolder.

#first generate .sfs file
./angsd -bam smallBam.filelist -realSFS 1 -out small -anc  hg19.fa -GL 2
#now try the EM optimization with 4 threads
./emOptim.g++ -binput small.sfs -nChr 20 -maxIter 100 -nThread 4 
#lets also try the optimization that uses derivates (bfgs)
./optimSFS.gcc small.sfs -nChr 20 -nThreads 4

NB

If you have say 10 diploid samples, then you should do -nChr 20
if you have say 12 diploid samples, then you should do -nChr 24.

The outpiles are then called small.sfs.em.ml and small.sfs.ml

0.995120	0.001202	0.000469	0.000255	0.000239	0.000254	0.000125 #capped

This is to be interpreted as:

column1 := probabilty of sampling zero derived alleles

column2 := probabilty of sampling one derived allele

column3 := probabilty of sampling two derived allele

column4 := probabilty of sampling three derived allele

etc

NB

The generation of the .sfs file is done on a persite basis, whereas the optimization requires information for a region of the genome.The optimization will therefore use large amounts of memory.

@@ Line 16: / Line 16: @@
 ;-realSFS 1: an sfs file will be generated.
-;-realSFS 2: snpcalling (not implemented, in this angsd)
+;-realSFS 2: Taking into account perIndividual inbreeding coefficients. This is the work of Filipe G. Vieira
-;-realSFS 4: genotypecalling (not implemented, int this angsd)
+;-realSFS 4: snpcalling (not implemented, in this angsd)
+;-realSFS 8: genotypecalling (not implemented, int this angsd)
+For the inbreeding version you need to supply a file containing all the inbreeding coefficients. -indF
 ==options==

SFS Estimation: Difference between revisions

Revision as of 21:04, 19 December 2012

options

Example

NB

Navigation menu