RealSFSmethod - Revision history

Thorfinn: /* Likelihood of the SFS */

2014-06-30T11:23:56Z

Likelihood of the SFS

← Older revision		Revision as of 13:23, 30 June 2014
Line 24:		Line 24:
	</math>		</math>

	Here <math>\theta</math> is our sfs. In the case of the folded sfs, we use n instead of 2n in the summation. We can find the MLE of the SFS by using either an BFGS approach that uses derivatives or by using en EM algorithm. Both is implemented in the ~~emOptim2~~ program.		Here <math>\theta</math> is our sfs. In the case of the folded sfs, we use n instead of 2n in the summation. We can find the MLE of the SFS by using either an BFGS approach that uses derivatives or by using en EM algorithm. Both is implemented in the '''realSFS''' program.

Albrecht: /* Theory */

2014-03-06T10:09:04Z

Theory

← Older revision		Revision as of 12:09, 6 March 2014
Line 1:		Line 1:
	~~=Theory=~~
	~~<pre>~~		We will try to elaborate on the theory behind the methods. Below is only a preliminary version of the theory. This method is described in detail in [[Nielsen2012]].
	We will try to elaborate on the theory behind the methods. Below is only a preliminary version of the theory.
	~~</pre>~~
	This method is described in [[Nielsen2012]].
	==SFS definition==		==SFS definition==
	For 'n' diploid samples, the site frequency spectrum '''(SFS)''' is the (2n+1) vector containing the proportion of site carrying 'k'-mutations. This means that the first element in the SFS is the proportion of sites where we don't observe any mutations, The second value is the proportion of sites where we observe 1 mutations. The last value is the proportion of sites we only observe mutations. It follows that the first and last column are the invariable categories and assuming that the SFS contains relative frequencies the variability in the sample can be estimated by:		For 'n' diploid samples, the site frequency spectrum '''(SFS)''' is the (2n+1) vector containing the proportion of site carrying 'k'-mutations. This means that the first element in the SFS is the proportion of sites where we don't observe any mutations, The second value is the proportion of sites where we observe 1 mutations. The last value is the proportion of sites we only observe mutations. It follows that the first and last column are the invariable categories and assuming that the SFS contains relative frequencies the variability in the sample can be estimated by:

Albrecht: Created page with "=Theory=

 We will try to elaborate on the theory behind the methods. Below is only a preliminary version of the theory.

This method is described in Nielsen2012..."

2014-03-06T10:08:42Z

Created page with "=Theory= <pre> We will try to elaborate on the theory behind the methods. Below is only a preliminary version of the theory. </pre> This method is described in Nielsen2012..."

New page

=Theory=
<pre>
We will try to elaborate on the theory behind the methods. Below is only a preliminary version of the theory.
</pre>
This method is described in [[Nielsen2012]].
==SFS definition==
For 'n' diploid samples, the site frequency spectrum '''(SFS)''' is the (2n+1) vector containing the proportion of site carrying 'k'-mutations. This means that the first element in the SFS is the proportion of sites where we don't observe any mutations, The second value is the proportion of sites where we observe 1 mutations. The last value is the proportion of sites we only observe mutations. It follows that the first and last column are the invariable categories and assuming that the SFS contains relative frequencies the variability in the sample can be estimated by:

<math>pvar=1-sfs_0-sfs_{2n}=\sum_{i=1}^{2n-1}sfs_i</math>

==Sample allele frequency/Multisample GL==
By supplying the -doSaf 1, flag to angsd. Angsd will calculate the likelihood of the sample allele frequency for each site and dump these into the .saf file. The likelihood of the sample allele frequency are in this context the likelihood of sampling k-derived alleles. This is estimated on the basis of the 10 possible genotype likelihoods for all individuals by summing over all combinations. This is done using the recursive algorithm described in [[Nielsen2012]]. This we write as <math>p(X^s\mid j)</math> meaning the likelihood of sampling j derived alleles for site s. And we calculate the folded as

<math>
p_{fold}(x^s\mid j) =p(x^s\mid j) + p(x^s\mid 2n- j),\qquad j\in\{0,1,3,\ldots,n-1\},
</math>

<math>
p_{fold}(x^s\mid j) =2p(x^s\mid j) ,\qquad j=n
</math>

==Likelihood of the SFS==
The likelihood of the sfs is then given as:

<math>
p(X|\theta) = \prod_{s=0}^S\sum_{i=0}^{2n} p(X^s\mid i )\theta_i
</math>

Here <math>\theta</math> is our sfs. In the case of the folded sfs, we use n instead of 2n in the summation. We can find the MLE of the SFS by using either an BFGS approach that uses derivatives or by using en EM algorithm. Both is implemented in the emOptim2 program.