PCA MDS

2018-10-04T07:37:24Z

Albrecht: Undo revision 3053 by Albrecht (talk)

= single read sampling approach for PCA or MDS =

This function is new and works from version '''0.912''' and in the latest developmental version from [https://github.com/ANGSD/angsd github]

For the PCA / MDS methods you should called SNP sites (use [[PCA]] if you do not want to call SNPs). SNPs can be called based on genotype likelihoods (see [[SNP_calling]]) or you can give the variable sites you want analysis using the [[Sites|-sites]] options.

__TOC__

=Brief Overview=
<pre>
./angsd -doIBS
-> angsd version: 0.911-26-gf1cb0e0-dirty (htslib: 1.3-1-gc72ae90) build(Apr 27 2016 11:15:33)
-> Analysis helpbox/synopsis information:
-> Command:
../angsd/angsd -doIBS -> Wed Apr 27 12:38:35 2016
--------------
abcIBS.cpp:
-doIBS 0
(Sampling strategies)
0: no IBS
1: (Sample single base)
2: (Concensus base)
-doCounts 0 Must choose -doCount 1
Optional
-minMinor 0 Minimum observed minor alleles
-minFreq 0.000 Minimum minor allele frequency
-output01 0 output 0 and 1s instead of based
-maxMis -1 Maximum missing bases (per site)
-doMajorMinor 0 use input files or data to select major and minor alleles
-makeMatrix 0 print out the ibs matrix
-doCov 0 print out the cov matrix
</pre>

==Options==
;-doIBS [int]
Print a single base from each individual at each position. 1: random sampled read. 2: Consensus base

;-doCounts [int]
Method requeres counting the different bases at each position. Therefore, -doCounts 1 must be used

;-doMajorMinor [int]
The covariance matrix can only be calculated for diallelic sites. Therefore, choose a methods for selecting the major and minor allele (see [[Inferring_Major_and_Minor_alleles]]). This can also be use if you only want to make this assumption for the IBS matrix or only want to print out bases that are either the major or minor.

;-minMinor [int]
Minimum observed minor alleles. The default in 0. If you do not use -doMajorMinor then the number of minor alleles are the sum of the 3 most uncommon alleles.

;-minFreq [float]
Minimum minor allele frequency based on the sampled bases. The default in 0. If you do not use -doMajorMinor then the frequency is the sum of the frequencies of the 3 most uncommon alleles.

;-output01 [int]
output the samples reads as 0 (for major) and 1s (for non major) instead of actual base

;-maxMis [int]
Maximum missing bases (per site) i.e. is the maximum number of allowed non-major/minor sampled bases

;-makeMatrix [int]
1 prints out the pairwise IBS matrix. This is the avg. distance between pairs of individuals. Distance is zero if the base in the same and 1 otherwise. You can use this for MDS (see below)

;-doCov [int]
1 print out the covariance matrix which can be used for PCA (see below). You should use the -minFreq option to avoid sites with low allele frequency.

=== run example ===

<pre>
./angsd -bam all.files -minMapQ 30 -minQ 20 -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -doIBS 1 -doCounts 1 -doCov 1 -makeMatrix 1 -minMaf 0.05 -P 5
</pre>

This will produce the output (see below) which includes pairwise differences (.ibsMat) and the covariance matrix (.covMat). These can be used for MDS and PCA respectively (see R example below). Note that only the PCA method require SNP calling and allele frequency estimation.

==Output==

=== sampled bases *ibs.gz ===
This function will print the sampled based *ibs.gz.
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor>0 and -output01 1
<pre class="mw-collapsible-content">
chr pos major minor ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7
1 14000873 A G 0 1 1 1 1 1 1
1 14001018 C T 0 1 1 1 1 1 1
1 14001867 G A 0 1 1 1 1 0 1
1 14002342 T C 1 1 1 1 1 -1 1
1 14002422 T A 0 1 1 1 1 0 -1
1 14003581 T C 0 1 1 1 1 1 1
1 14004623 C T 0 1 1 1 1 0 1
1 14006543 T G 0 -1 1 1 1 0 1
1 14007493 G A 0 0 1 -1 1 0 1
1 14007558 T C 0 0 1 1 -1 -1 1
1 14007649 A G 0 1 1 1 1 0 1
1 14008269 A G 1 1 0 -1 1 -1 1
</pre>
</div>

<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor>0 and -output01 0
<pre class="mw-collapsible-content">
chr pos major minor ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7
1 13116 G T N G T T N G N T
1 13118 G A N G A A N G N A
1 14930 A G G G G A N N A N
1 15211 T G N G T G N N N G
1 54490 A G N G N G N N N N
1 54716 T C T C C C T N N N
1 58814 A G N G N G G G N N
1 62777 T A N N A N A A A N
1 63268 C T N T N T C N T N
1 63671 A G N G N N G G G N
1 69428 G T N G T N N T T N
1 69761 T A A A T A N A N N
</pre>
</div>
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor 0 and -output01 0
<pre class="mw-collapsible-content">
chr pos major ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8
1 13116 T N G T T N G N T T
1 13118 A N G A A N G N A A
1 14930 A G G G A N N A N G
1 15211 G N G T G N N N G G
1 54490 G N G N G N N N N A
1 54716 C T C C C T N N N C
1 58814 G N G N G G G N N G
1 62777 A N N A N A A A N A
1 63268 T N T N T C N T N N
1 63336 C C C C C C N C N N
1 63671 G N G N N G G G N N
</pre>
</div>

'''chr''' is the chromosome

'''pos''' is the position

'''major''' is the major allele

'''minor''' is the minor allele. Needs -doMajorMinor

'''indX''' is the sampled base for individual number X. if -output01 1 then it is 1 for major, 0 for non major and -1 for missing

=== sample based IBS matrix *.ibsMat ===
This function will print the pairwise IBS distance
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibsMat with -makeMatrix 1
<pre class="mw-collapsible-content">
0.000000 0.510638 0.606383 0.595745 0.545455 0.428571
0.510638 0.000000 0.154639 0.154639 0.108911 0.408602
0.606383 0.154639 0.000000 0.121212 0.137255 0.489362
0.595745 0.154639 0.121212 0.000000 0.106796 0.484211
0.545455 0.108911 0.137255 0.106796 0.000000 0.404040
0.428571 0.408602 0.489362 0.484211 0.404040 0.000000
0.577320 0.121212 0.181818 0.171717 0.097087 0.473684
0.536082 0.090000 0.138614 0.118812 0.047619 0.428571
0.262500 0.571429 0.702381 0.694118 0.632184 0.353659
0.458333 0.383838 0.484848 0.494949 0.398058 0.368421
</pre>
</div>

Nind x Nind matrix with pairwise IBS distance

=== sample based covariance matrix *.covMat ===
This function will print the covariance matrix based on a single sampled read
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.covMat with -doCov 1
<pre class="mw-collapsible-content">

1.098251 -0.026225 -0.005617 -0.014726 -0.022438 -0.021786
-0.026225 1.115986 -0.017167 0.000735 -0.017163 -0.016899
-0.005617 -0.017167 1.074779 -0.015685 -0.019819 -0.015473
-0.014726 0.000735 -0.015685 1.072853 -0.013641 -0.007789
-0.022438 -0.017163 -0.019819 -0.013641 1.094612 -0.016045
-0.021786 -0.016899 -0.015473 -0.007789 -0.016045 1.059264
-0.005831 -0.009854 -0.001269 -0.002362 -0.018479 -0.011942
-0.015399 -0.020010 -0.001296 -0.022947 -0.006515 -0.003938
-0.001730 -0.040534 -0.002295 -0.017442 -0.024194 -0.007469
-0.016094 -0.015303 -0.018302 -0.022502 -0.030503 -0.001208
-0.122045 -0.106068 -0.103089 -0.104443 -0.110237 -0.103610
-0.106553 -0.100202 -0.104754 -0.109399 -0.107645 -0.111665
-0.108945 -0.102440 -0.105292 -0.101372 -0.107110 -0.106639
</pre>
</div>

Nind x Nind covariance matrix

==Model==

=== IBS ===

pairwise distance between individuals
<math>
d_{ij} = \frac{\sum_m^M 1-I_{b_j}(b_i)}{M}
</math>
where M in the number of sites with a read for both individuals. <math> 1-I_{b_j}(b_i) </math> is the indicator function which is equal to one with the two individuals i and j have the same base and zero otherwise

=== Covariance ===

Allele frequency based on single reads.

<math>
f_{m} = \frac{N_{minor}}{N_{major} + N_{minor}}
</math>

<math>
cov(ij) = \frac{1}{M}\sum_m^M \frac{ (h^i_m-f_m)(h^j_m-f_m) }{f_m(1-f_m)}
</math>

where M in the number of sites with a read for both individuals. <math> h^i_m</math> is 1 if individuals i for site m has the major allele and zero otherwise

=MDS/PCA using R=

[[File:PCA_MDS.png|thumb]]

<pre>
## MDS
name <- "angsdput.ibsMat"
m <- as.matrix(read.table(name))
mds <- cmdscale(as.dist(m))
plot(mds,lwd=2,ylab="Dist",xlab="Dist",main="multidimensional scaling",col=rep(1:3,each=10))

name <- "angsdput.covMat"
m <- as.matrix(read.table(name))
e <- eigen(m)
plot(e$vectors[,1:2],lwd=2,ylab="PC 2",xlab="PC 2",main="Principal components",col=rep(1:3,each=10),pch=16)

</pre>

=other fun stuff=

<pre>
## heatmap / clustering / trees
name <- "angsdput.ibsMat" # or covMat
m <- as.matrix(read.table(name))
#heat map
heatmap(m)
#neighbour joining
plot(ape::nj(m))
plot(hclust(dist(m), "ave")
</pre>

PCA MDS

2018-10-04T07:36:21Z

Albrecht: /* Model */

= single read sampling approach for PCA or MDS =

This function is new and works from version '''0.912''' and in the latest developmental version from [https://github.com/ANGSD/angsd github]

For the PCA / MDS methods you should called SNP sites (use [[PCA]] if you do not want to call SNPs). SNPs can be called based on genotype likelihoods (see [[SNP_calling]]) or you can give the variable sites you want analysis using the [[Sites|-sites]] options.

__TOC__

=Brief Overview=
<pre>
./angsd -doIBS
-> angsd version: 0.911-26-gf1cb0e0-dirty (htslib: 1.3-1-gc72ae90) build(Apr 27 2016 11:15:33)
-> Analysis helpbox/synopsis information:
-> Command:
../angsd/angsd -doIBS -> Wed Apr 27 12:38:35 2016
--------------
abcIBS.cpp:
-doIBS 0
(Sampling strategies)
0: no IBS
1: (Sample single base)
2: (Concensus base)
-doCounts 0 Must choose -doCount 1
Optional
-minMinor 0 Minimum observed minor alleles
-minFreq 0.000 Minimum minor allele frequency
-output01 0 output 0 and 1s instead of based
-maxMis -1 Maximum missing bases (per site)
-doMajorMinor 0 use input files or data to select major and minor alleles
-makeMatrix 0 print out the ibs matrix
-doCov 0 print out the cov matrix
</pre>

==Options==
;-doIBS [int]
Print a single base from each individual at each position. 1: random sampled read. 2: Consensus base

;-doCounts [int]
Method requeres counting the different bases at each position. Therefore, -doCounts 1 must be used

;-doMajorMinor [int]
The covariance matrix can only be calculated for diallelic sites. Therefore, choose a methods for selecting the major and minor allele (see [[Inferring_Major_and_Minor_alleles]]). This can also be use if you only want to make this assumption for the IBS matrix or only want to print out bases that are either the major or minor.

;-minMinor [int]
Minimum observed minor alleles. The default in 0. If you do not use -doMajorMinor then the number of minor alleles are the sum of the 3 most uncommon alleles.

;-minFreq [float]
Minimum minor allele frequency based on the sampled bases. The default in 0. If you do not use -doMajorMinor then the frequency is the sum of the frequencies of the 3 most uncommon alleles.

;-output01 [int]
output the samples reads as 0 (for major) and 1s (for non major) instead of actual base

;-maxMis [int]
Maximum missing bases (per site) i.e. is the maximum number of allowed non-major/minor sampled bases

;-makeMatrix [int]
1 prints out the pairwise IBS matrix. This is the avg. distance between pairs of individuals. Distance is zero if the base in the same and 1 otherwise. You can use this for MDS (see below)

;-doCov [int]
1 print out the covariance matrix which can be used for PCA (see below). You should use the -minFreq option to avoid sites with low allele frequency.

=== run example ===

<pre>
./angsd -bam all.files -minMapQ 30 -minQ 20 -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -doIBS 1 -doCounts 1 -doCov 1 -makeMatrix 1 -minMaf 0.05 -P 5
</pre>

This will produce the output (see below) which includes pairwise differences (.ibsMat) and the covariance matrix (.covMat). These can be used for MDS and PCA respectively (see R example below). Note that only the PCA method require SNP calling and allele frequency estimation.

==Output==

=== sampled bases *ibs.gz ===
This function will print the sampled based *ibs.gz.
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor>0 and -output01 1
<pre class="mw-collapsible-content">
chr pos major minor ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7
1 14000873 A G 0 1 1 1 1 1 1
1 14001018 C T 0 1 1 1 1 1 1
1 14001867 G A 0 1 1 1 1 0 1
1 14002342 T C 1 1 1 1 1 -1 1
1 14002422 T A 0 1 1 1 1 0 -1
1 14003581 T C 0 1 1 1 1 1 1
1 14004623 C T 0 1 1 1 1 0 1
1 14006543 T G 0 -1 1 1 1 0 1
1 14007493 G A 0 0 1 -1 1 0 1
1 14007558 T C 0 0 1 1 -1 -1 1
1 14007649 A G 0 1 1 1 1 0 1
1 14008269 A G 1 1 0 -1 1 -1 1
</pre>
</div>

<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor>0 and -output01 0
<pre class="mw-collapsible-content">
chr pos major minor ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7
1 13116 G T N G T T N G N T
1 13118 G A N G A A N G N A
1 14930 A G G G G A N N A N
1 15211 T G N G T G N N N G
1 54490 A G N G N G N N N N
1 54716 T C T C C C T N N N
1 58814 A G N G N G G G N N
1 62777 T A N N A N A A A N
1 63268 C T N T N T C N T N
1 63671 A G N G N N G G G N
1 69428 G T N G T N N T T N
1 69761 T A A A T A N A N N
</pre>
</div>
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor 0 and -output01 0
<pre class="mw-collapsible-content">
chr pos major ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8
1 13116 T N G T T N G N T T
1 13118 A N G A A N G N A A
1 14930 A G G G A N N A N G
1 15211 G N G T G N N N G G
1 54490 G N G N G N N N N A
1 54716 C T C C C T N N N C
1 58814 G N G N G G G N N G
1 62777 A N N A N A A A N A
1 63268 T N T N T C N T N N
1 63336 C C C C C C N C N N
1 63671 G N G N N G G G N N
</pre>
</div>

'''chr''' is the chromosome

'''pos''' is the position

'''major''' is the major allele

'''minor''' is the minor allele. Needs -doMajorMinor

'''indX''' is the sampled base for individual number X. if -output01 1 then it is 1 for major, 0 for non major and -1 for missing

=== sample based IBS matrix *.ibsMat ===
This function will print the pairwise IBS distance
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibsMat with -makeMatrix 1
<pre class="mw-collapsible-content">
0.000000 0.510638 0.606383 0.595745 0.545455 0.428571
0.510638 0.000000 0.154639 0.154639 0.108911 0.408602
0.606383 0.154639 0.000000 0.121212 0.137255 0.489362
0.595745 0.154639 0.121212 0.000000 0.106796 0.484211
0.545455 0.108911 0.137255 0.106796 0.000000 0.404040
0.428571 0.408602 0.489362 0.484211 0.404040 0.000000
0.577320 0.121212 0.181818 0.171717 0.097087 0.473684
0.536082 0.090000 0.138614 0.118812 0.047619 0.428571
0.262500 0.571429 0.702381 0.694118 0.632184 0.353659
0.458333 0.383838 0.484848 0.494949 0.398058 0.368421
</pre>
</div>

Nind x Nind matrix with pairwise IBS distance

=== sample based covariance matrix *.covMat ===
This function will print the covariance matrix based on a single sampled read
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.covMat with -doCov 1
<pre class="mw-collapsible-content">

1.098251 -0.026225 -0.005617 -0.014726 -0.022438 -0.021786
-0.026225 1.115986 -0.017167 0.000735 -0.017163 -0.016899
-0.005617 -0.017167 1.074779 -0.015685 -0.019819 -0.015473
-0.014726 0.000735 -0.015685 1.072853 -0.013641 -0.007789
-0.022438 -0.017163 -0.019819 -0.013641 1.094612 -0.016045
-0.021786 -0.016899 -0.015473 -0.007789 -0.016045 1.059264
-0.005831 -0.009854 -0.001269 -0.002362 -0.018479 -0.011942
-0.015399 -0.020010 -0.001296 -0.022947 -0.006515 -0.003938
-0.001730 -0.040534 -0.002295 -0.017442 -0.024194 -0.007469
-0.016094 -0.015303 -0.018302 -0.022502 -0.030503 -0.001208
-0.122045 -0.106068 -0.103089 -0.104443 -0.110237 -0.103610
-0.106553 -0.100202 -0.104754 -0.109399 -0.107645 -0.111665
-0.108945 -0.102440 -0.105292 -0.101372 -0.107110 -0.106639
</pre>
</div>

Nind x Nind covariance matrix

==Model==

=== IBS ===

pairwise distance between individuals
<math>
d_{ij} = \frac{\sum_m^M 1-I_{b_j}(b_i)}{M}
</math>
where M in the number of sites with a read for both individuals. <math> I_{b_j}(b_i) </math> is the indicator function which is equal to one with the two individuals i and j have the same base and zero otherwise

=== Covariance ===

Allele frequency based on single reads.

<math>
f_{m} = \frac{N_{minor}}{N_{major} + N_{minor}}
</math>

<math>
cov(ij) = \frac{1}{M}\sum_m^M \frac{ (h^i_m-f_m)(h^j_m-f_m) }{f_m(1-f_m)}
</math>

where M in the number of sites with a read for both individuals. <math> h^i_m</math> is 1 if individuals i for site m has the major allele and zero otherwise

=MDS/PCA using R=

[[File:PCA_MDS.png|thumb]]

<pre>
## MDS
name <- "angsdput.ibsMat"
m <- as.matrix(read.table(name))
mds <- cmdscale(as.dist(m))
plot(mds,lwd=2,ylab="Dist",xlab="Dist",main="multidimensional scaling",col=rep(1:3,each=10))

name <- "angsdput.covMat"
m <- as.matrix(read.table(name))
e <- eigen(m)
plot(e$vectors[,1:2],lwd=2,ylab="PC 2",xlab="PC 2",main="Principal components",col=rep(1:3,each=10),pch=16)

</pre>

=other fun stuff=

<pre>
## heatmap / clustering / trees
name <- "angsdput.ibsMat" # or covMat
m <- as.matrix(read.table(name))
#heat map
heatmap(m)
#neighbour joining
plot(ape::nj(m))
plot(hclust(dist(m), "ave")
</pre>

PCA MDS

2018-10-04T07:35:19Z

Albrecht: /* sampled bases *ibs.gz */

= single read sampling approach for PCA or MDS =

This function is new and works from version '''0.912''' and in the latest developmental version from [https://github.com/ANGSD/angsd github]

For the PCA / MDS methods you should called SNP sites (use [[PCA]] if you do not want to call SNPs). SNPs can be called based on genotype likelihoods (see [[SNP_calling]]) or you can give the variable sites you want analysis using the [[Sites|-sites]] options.

__TOC__

=Brief Overview=
<pre>
./angsd -doIBS
-> angsd version: 0.911-26-gf1cb0e0-dirty (htslib: 1.3-1-gc72ae90) build(Apr 27 2016 11:15:33)
-> Analysis helpbox/synopsis information:
-> Command:
../angsd/angsd -doIBS -> Wed Apr 27 12:38:35 2016
--------------
abcIBS.cpp:
-doIBS 0
(Sampling strategies)
0: no IBS
1: (Sample single base)
2: (Concensus base)
-doCounts 0 Must choose -doCount 1
Optional
-minMinor 0 Minimum observed minor alleles
-minFreq 0.000 Minimum minor allele frequency
-output01 0 output 0 and 1s instead of based
-maxMis -1 Maximum missing bases (per site)
-doMajorMinor 0 use input files or data to select major and minor alleles
-makeMatrix 0 print out the ibs matrix
-doCov 0 print out the cov matrix
</pre>

==Options==
;-doIBS [int]
Print a single base from each individual at each position. 1: random sampled read. 2: Consensus base

;-doCounts [int]
Method requeres counting the different bases at each position. Therefore, -doCounts 1 must be used

;-doMajorMinor [int]
The covariance matrix can only be calculated for diallelic sites. Therefore, choose a methods for selecting the major and minor allele (see [[Inferring_Major_and_Minor_alleles]]). This can also be use if you only want to make this assumption for the IBS matrix or only want to print out bases that are either the major or minor.

;-minMinor [int]
Minimum observed minor alleles. The default in 0. If you do not use -doMajorMinor then the number of minor alleles are the sum of the 3 most uncommon alleles.

;-minFreq [float]
Minimum minor allele frequency based on the sampled bases. The default in 0. If you do not use -doMajorMinor then the frequency is the sum of the frequencies of the 3 most uncommon alleles.

;-output01 [int]
output the samples reads as 0 (for major) and 1s (for non major) instead of actual base

;-maxMis [int]
Maximum missing bases (per site) i.e. is the maximum number of allowed non-major/minor sampled bases

;-makeMatrix [int]
1 prints out the pairwise IBS matrix. This is the avg. distance between pairs of individuals. Distance is zero if the base in the same and 1 otherwise. You can use this for MDS (see below)

;-doCov [int]
1 print out the covariance matrix which can be used for PCA (see below). You should use the -minFreq option to avoid sites with low allele frequency.

=== run example ===

<pre>
./angsd -bam all.files -minMapQ 30 -minQ 20 -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -doIBS 1 -doCounts 1 -doCov 1 -makeMatrix 1 -minMaf 0.05 -P 5
</pre>

This will produce the output (see below) which includes pairwise differences (.ibsMat) and the covariance matrix (.covMat). These can be used for MDS and PCA respectively (see R example below). Note that only the PCA method require SNP calling and allele frequency estimation.

==Output==

=== sampled bases *ibs.gz ===
This function will print the sampled based *ibs.gz.
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor>0 and -output01 1
<pre class="mw-collapsible-content">
chr pos major minor ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7
1 14000873 A G 0 1 1 1 1 1 1
1 14001018 C T 0 1 1 1 1 1 1
1 14001867 G A 0 1 1 1 1 0 1
1 14002342 T C 1 1 1 1 1 -1 1
1 14002422 T A 0 1 1 1 1 0 -1
1 14003581 T C 0 1 1 1 1 1 1
1 14004623 C T 0 1 1 1 1 0 1
1 14006543 T G 0 -1 1 1 1 0 1
1 14007493 G A 0 0 1 -1 1 0 1
1 14007558 T C 0 0 1 1 -1 -1 1
1 14007649 A G 0 1 1 1 1 0 1
1 14008269 A G 1 1 0 -1 1 -1 1
</pre>
</div>

<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor>0 and -output01 0
<pre class="mw-collapsible-content">
chr pos major minor ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7
1 13116 G T N G T T N G N T
1 13118 G A N G A A N G N A
1 14930 A G G G G A N N A N
1 15211 T G N G T G N N N G
1 54490 A G N G N G N N N N
1 54716 T C T C C C T N N N
1 58814 A G N G N G G G N N
1 62777 T A N N A N A A A N
1 63268 C T N T N T C N T N
1 63671 A G N G N N G G G N
1 69428 G T N G T N N T T N
1 69761 T A A A T A N A N N
</pre>
</div>
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor 0 and -output01 0
<pre class="mw-collapsible-content">
chr pos major ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8
1 13116 T N G T T N G N T T
1 13118 A N G A A N G N A A
1 14930 A G G G A N N A N G
1 15211 G N G T G N N N G G
1 54490 G N G N G N N N N A
1 54716 C T C C C T N N N C
1 58814 G N G N G G G N N G
1 62777 A N N A N A A A N A
1 63268 T N T N T C N T N N
1 63336 C C C C C C N C N N
1 63671 G N G N N G G G N N
</pre>
</div>

'''chr''' is the chromosome

'''pos''' is the position

'''major''' is the major allele

'''minor''' is the minor allele. Needs -doMajorMinor

'''indX''' is the sampled base for individual number X. if -output01 1 then it is 1 for major, 0 for non major and -1 for missing

=== sample based IBS matrix *.ibsMat ===
This function will print the pairwise IBS distance
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibsMat with -makeMatrix 1
<pre class="mw-collapsible-content">
0.000000 0.510638 0.606383 0.595745 0.545455 0.428571
0.510638 0.000000 0.154639 0.154639 0.108911 0.408602
0.606383 0.154639 0.000000 0.121212 0.137255 0.489362
0.595745 0.154639 0.121212 0.000000 0.106796 0.484211
0.545455 0.108911 0.137255 0.106796 0.000000 0.404040
0.428571 0.408602 0.489362 0.484211 0.404040 0.000000
0.577320 0.121212 0.181818 0.171717 0.097087 0.473684
0.536082 0.090000 0.138614 0.118812 0.047619 0.428571
0.262500 0.571429 0.702381 0.694118 0.632184 0.353659
0.458333 0.383838 0.484848 0.494949 0.398058 0.368421
</pre>
</div>

Nind x Nind matrix with pairwise IBS distance

=== sample based covariance matrix *.covMat ===
This function will print the covariance matrix based on a single sampled read
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.covMat with -doCov 1
<pre class="mw-collapsible-content">

1.098251 -0.026225 -0.005617 -0.014726 -0.022438 -0.021786
-0.026225 1.115986 -0.017167 0.000735 -0.017163 -0.016899
-0.005617 -0.017167 1.074779 -0.015685 -0.019819 -0.015473
-0.014726 0.000735 -0.015685 1.072853 -0.013641 -0.007789
-0.022438 -0.017163 -0.019819 -0.013641 1.094612 -0.016045
-0.021786 -0.016899 -0.015473 -0.007789 -0.016045 1.059264
-0.005831 -0.009854 -0.001269 -0.002362 -0.018479 -0.011942
-0.015399 -0.020010 -0.001296 -0.022947 -0.006515 -0.003938
-0.001730 -0.040534 -0.002295 -0.017442 -0.024194 -0.007469
-0.016094 -0.015303 -0.018302 -0.022502 -0.030503 -0.001208
-0.122045 -0.106068 -0.103089 -0.104443 -0.110237 -0.103610
-0.106553 -0.100202 -0.104754 -0.109399 -0.107645 -0.111665
-0.108945 -0.102440 -0.105292 -0.101372 -0.107110 -0.106639
</pre>
</div>

Nind x Nind covariance matrix

==Model==

=== IBS ===

pairwise distance between individuals
<math>
d_{ij} = \frac{\sum_m^M 1-I_{b_j}(b_i)}{M}
</math>
where M in the number of sites with a read for both individuals. <math> 1-I_{b_j}(b_i) </math> is the indicator function which is equal to one with the two individuals i and j have the same base and zero otherwise

=== Covariance ===

Allele frequency based on single reads.

<math>
f_{m} = \frac{N_{minor}}{N_{major} + N_{minor}}
</math>

<math>
cov(ij) = \frac{1}{M}\sum_m^M \frac{ (h^i_m-f_m)(h^j_m-f_m) }{f_m(1-f_m)}
</math>

where M in the number of sites with a read for both individuals. <math> h^i_m</math> is 1 if individuals i for site m has the major allele and zero otherwise

=MDS/PCA using R=

[[File:PCA_MDS.png|thumb]]

<pre>
## MDS
name <- "angsdput.ibsMat"
m <- as.matrix(read.table(name))
mds <- cmdscale(as.dist(m))
plot(mds,lwd=2,ylab="Dist",xlab="Dist",main="multidimensional scaling",col=rep(1:3,each=10))

name <- "angsdput.covMat"
m <- as.matrix(read.table(name))
e <- eigen(m)
plot(e$vectors[,1:2],lwd=2,ylab="PC 2",xlab="PC 2",main="Principal components",col=rep(1:3,each=10),pch=16)

</pre>

=other fun stuff=

<pre>
## heatmap / clustering / trees
name <- "angsdput.ibsMat" # or covMat
m <- as.matrix(read.table(name))
#heat map
heatmap(m)
#neighbour joining
plot(ape::nj(m))
plot(hclust(dist(m), "ave")
</pre>

PCA MDS

2018-10-04T07:31:29Z

Albrecht: /* sampled bases *ibs.gz */

= single read sampling approach for PCA or MDS =

This function is new and works from version '''0.912''' and in the latest developmental version from [https://github.com/ANGSD/angsd github]

For the PCA / MDS methods you should called SNP sites (use [[PCA]] if you do not want to call SNPs). SNPs can be called based on genotype likelihoods (see [[SNP_calling]]) or you can give the variable sites you want analysis using the [[Sites|-sites]] options.

__TOC__

=Brief Overview=
<pre>
./angsd -doIBS
-> angsd version: 0.911-26-gf1cb0e0-dirty (htslib: 1.3-1-gc72ae90) build(Apr 27 2016 11:15:33)
-> Analysis helpbox/synopsis information:
-> Command:
../angsd/angsd -doIBS -> Wed Apr 27 12:38:35 2016
--------------
abcIBS.cpp:
-doIBS 0
(Sampling strategies)
0: no IBS
1: (Sample single base)
2: (Concensus base)
-doCounts 0 Must choose -doCount 1
Optional
-minMinor 0 Minimum observed minor alleles
-minFreq 0.000 Minimum minor allele frequency
-output01 0 output 0 and 1s instead of based
-maxMis -1 Maximum missing bases (per site)
-doMajorMinor 0 use input files or data to select major and minor alleles
-makeMatrix 0 print out the ibs matrix
-doCov 0 print out the cov matrix
</pre>

==Options==
;-doIBS [int]
Print a single base from each individual at each position. 1: random sampled read. 2: Consensus base

;-doCounts [int]
Method requeres counting the different bases at each position. Therefore, -doCounts 1 must be used

;-doMajorMinor [int]
The covariance matrix can only be calculated for diallelic sites. Therefore, choose a methods for selecting the major and minor allele (see [[Inferring_Major_and_Minor_alleles]]). This can also be use if you only want to make this assumption for the IBS matrix or only want to print out bases that are either the major or minor.

;-minMinor [int]
Minimum observed minor alleles. The default in 0. If you do not use -doMajorMinor then the number of minor alleles are the sum of the 3 most uncommon alleles.

;-minFreq [float]
Minimum minor allele frequency based on the sampled bases. The default in 0. If you do not use -doMajorMinor then the frequency is the sum of the frequencies of the 3 most uncommon alleles.

;-output01 [int]
output the samples reads as 0 (for major) and 1s (for non major) instead of actual base

;-maxMis [int]
Maximum missing bases (per site) i.e. is the maximum number of allowed non-major/minor sampled bases

;-makeMatrix [int]
1 prints out the pairwise IBS matrix. This is the avg. distance between pairs of individuals. Distance is zero if the base in the same and 1 otherwise. You can use this for MDS (see below)

;-doCov [int]
1 print out the covariance matrix which can be used for PCA (see below). You should use the -minFreq option to avoid sites with low allele frequency.

=== run example ===

<pre>
./angsd -bam all.files -minMapQ 30 -minQ 20 -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -doIBS 1 -doCounts 1 -doCov 1 -makeMatrix 1 -minMaf 0.05 -P 5
</pre>

This will produce the output (see below) which includes pairwise differences (.ibsMat) and the covariance matrix (.covMat). These can be used for MDS and PCA respectively (see R example below). Note that only the PCA method require SNP calling and allele frequency estimation.

==Output==

=== sampled bases *ibs.gz ===
This function will print the sampled based *ibs.gz.
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor>0 and -output01 1
<pre class="mw-collapsible-content">
chr pos major minor ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7
1 14000873 A G 0 1 1 1 1 1 1
1 14001018 C T 0 1 1 1 1 1 1
1 14001867 G A 0 1 1 1 1 0 1
1 14002342 T C 1 1 1 1 1 -1 1
1 14002422 T A 0 1 1 1 1 0 -1
1 14003581 T C 0 1 1 1 1 1 1
1 14004623 C T 0 1 1 1 1 0 1
1 14006543 T G 0 -1 1 1 1 0 1
1 14007493 G A 0 0 1 -1 1 0 1
1 14007558 T C 0 0 1 1 -1 -1 1
1 14007649 A G 0 1 1 1 1 0 1
1 14008269 A G 1 1 0 -1 1 -1 1
</pre>
</div>

<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor>0 and -output01 0
<pre class="mw-collapsible-content">
chr pos major minor ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7
1 13116 G T N G T T N G N T
1 13118 G A N G A A N G N A
1 14930 A G G G G A N N A N
1 15211 T G N G T G N N N G
1 54490 A G N G N G N N N N
1 54716 T C T C C C T N N N
1 58814 A G N G N G G G N N
1 62777 T A N N A N A A A N
1 63268 C T N T N T C N T N
1 63671 A G N G N N G G G N
1 69428 G T N G T N N T T N
1 69761 T A A A T A N A N N
</pre>
</div>
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibs.gz with -doMajorMinor 0 and -output01 0
<pre class="mw-collapsible-content">
chr pos major ind0 ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8
1 13116 T N G T T N G N T T
1 13118 A N G A A N G N A A
1 14930 A G G G A N N A N G
1 15211 G N G T G N N N G G
1 54490 G N G N G N N N N A
1 54716 C T C C C T N N N C
1 58814 G N G N G G G N N G
1 62777 A N N A N A A A N A
1 63268 T N T N T C N T N N
1 63336 C C C C C C N C N N
1 63671 G N G N N G G G N N
</pre>
</div>

'''chr''' is the chromosome

'''pos''' is the position
'''major''' is the major allele

'''minor''' is the minor allele. Needs -doMajorMinor

'''indX''' is the sampled base for individual number X. if -output01 1 then it is 1 for major, 0 for non major and -1 for missing

=== sample based IBS matrix *.ibsMat ===
This function will print the pairwise IBS distance
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.ibsMat with -makeMatrix 1
<pre class="mw-collapsible-content">
0.000000 0.510638 0.606383 0.595745 0.545455 0.428571
0.510638 0.000000 0.154639 0.154639 0.108911 0.408602
0.606383 0.154639 0.000000 0.121212 0.137255 0.489362
0.595745 0.154639 0.121212 0.000000 0.106796 0.484211
0.545455 0.108911 0.137255 0.106796 0.000000 0.404040
0.428571 0.408602 0.489362 0.484211 0.404040 0.000000
0.577320 0.121212 0.181818 0.171717 0.097087 0.473684
0.536082 0.090000 0.138614 0.118812 0.047619 0.428571
0.262500 0.571429 0.702381 0.694118 0.632184 0.353659
0.458333 0.383838 0.484848 0.494949 0.398058 0.368421
</pre>
</div>

Nind x Nind matrix with pairwise IBS distance

=== sample based covariance matrix *.covMat ===
This function will print the covariance matrix based on a single sampled read
<div class="toccolours mw-collapsible mw-collapsed">
Example of output *.covMat with -doCov 1
<pre class="mw-collapsible-content">

1.098251 -0.026225 -0.005617 -0.014726 -0.022438 -0.021786
-0.026225 1.115986 -0.017167 0.000735 -0.017163 -0.016899
-0.005617 -0.017167 1.074779 -0.015685 -0.019819 -0.015473
-0.014726 0.000735 -0.015685 1.072853 -0.013641 -0.007789
-0.022438 -0.017163 -0.019819 -0.013641 1.094612 -0.016045
-0.021786 -0.016899 -0.015473 -0.007789 -0.016045 1.059264
-0.005831 -0.009854 -0.001269 -0.002362 -0.018479 -0.011942
-0.015399 -0.020010 -0.001296 -0.022947 -0.006515 -0.003938
-0.001730 -0.040534 -0.002295 -0.017442 -0.024194 -0.007469
-0.016094 -0.015303 -0.018302 -0.022502 -0.030503 -0.001208
-0.122045 -0.106068 -0.103089 -0.104443 -0.110237 -0.103610
-0.106553 -0.100202 -0.104754 -0.109399 -0.107645 -0.111665
-0.108945 -0.102440 -0.105292 -0.101372 -0.107110 -0.106639
</pre>
</div>

Nind x Nind covariance matrix

==Model==

=== IBS ===

pairwise distance between individuals
<math>
d_{ij} = \frac{\sum_m^M 1-I_{b_j}(b_i)}{M}
</math>
where M in the number of sites with a read for both individuals. <math> 1-I_{b_j}(b_i) </math> is the indicator function which is equal to one with the two individuals i and j have the same base and zero otherwise

=== Covariance ===

Allele frequency based on single reads.

<math>
f_{m} = \frac{N_{minor}}{N_{major} + N_{minor}}
</math>

<math>
cov(ij) = \frac{1}{M}\sum_m^M \frac{ (h^i_m-f_m)(h^j_m-f_m) }{f_m(1-f_m)}
</math>

where M in the number of sites with a read for both individuals. <math> h^i_m</math> is 1 if individuals i for site m has the major allele and zero otherwise

=MDS/PCA using R=

[[File:PCA_MDS.png|thumb]]

<pre>
## MDS
name <- "angsdput.ibsMat"
m <- as.matrix(read.table(name))
mds <- cmdscale(as.dist(m))
plot(mds,lwd=2,ylab="Dist",xlab="Dist",main="multidimensional scaling",col=rep(1:3,each=10))

name <- "angsdput.covMat"
m <- as.matrix(read.table(name))
e <- eigen(m)
plot(e$vectors[,1:2],lwd=2,ylab="PC 2",xlab="PC 2",main="Principal components",col=rep(1:3,each=10),pch=16)

</pre>

=other fun stuff=

<pre>
## heatmap / clustering / trees
name <- "angsdput.ibsMat" # or covMat
m <- as.matrix(read.table(name))
#heat map
heatmap(m)
#neighbour joining
plot(ape::nj(m))
plot(hclust(dist(m), "ave")
</pre>

Genotype Likelihoods

2018-05-18T12:14:11Z

Albrecht: /* SYK (Kim et al.) */

Many methods in ANGSD are based on genotype likelihoods, and ANGSD has 4 different genotype likelihood models implemented.

Genotype likelihoods and the four models are described in the [[#Theory | Bottom]].

The SOAPsnp requires that a reference is supplied. Preferably the recalibration should only be performed on non-variable sites, so we recommend that the reference fasta should be modified such that all snp sites have an 'N'.

We also allow for output of the calculated genotype likelihoods in various formats that might be handy for some users.

;NB the GATK model described and implemented in this program are the one described in the first GATK paper. This might be drastically different from the one used in the newer version of GATK.
__TOC__
=Brief Overview=
<pre>
./angsd -GL
-> angsd version: 0.567 build(Dec 7 2013 14:56:25)
-> Analysis helpbox/synopsis information:
---------------------
analysisEstLikes.cpp:
-GL=0:
1: SAMtools
2: GATK
3: SOAPsnp
4: SYK
-trim 0 (zero means no trimming)
-tmpdir angsd_tmpdir/ (used by SOAPsnp)
-errors (null) (used by SYK)
-minInd 0 (0 indicates no filtering)

Filedumping:
-doGlf 0
1: binary glf (10 log likes) .glf.gz
2: beagle likelihood file .beagle.gz
3: binary 3 times likelihood .glf.gz
4: text version (10 log likes) .glf.gz
</pre>
=Options=
; -GL [int]
If your input is sequencing file you can estimate genotype likelhoods from the mapped reads. Four different methods are available.
# SAMtools model
# GATK model
# SOAPsnp model
# SYK model

; NB
When estimating GL with soapSNP we need to generate a calibration matrix. This is done automaticly if these doesn't exist. These are located in angsd_tmpdir/basenameNUM.count,angsd_tmpdir/basenameNUM.qual, and the read length is not allowed to exceed 256 base pairs.

;-trim [int]
This will discards [int] bases at both ends of the reads when calculating the genotype likelihoods.

;-tmpdir [directoryPath]
default is `angsd_tmpdir`. SOAPsnp generates a mismatch matrix for each BAM file and based on this mismatch matrix it calculates the type specific errors for each position in the read. So for each BAM file it generates two files, to avoid cluttering up the working directory you can specify a folder that should be used. SOAPsnp assumes that all reads have the same length, if this is not the case this model might not be suited (also true for other recalibration tools).

;-errors [fileName]
SYK model requires a file containing the type specific errors, as estimated from [[Error estimation | -doError 1]].

;-minInd [int]
Discard the sites where we don't have data from '''-minInd''' individuals. If you have 100 individuals, and you only want to base your downstream analysis on the sites where you have data for at least half your samples then set '''-minInd 50'''.

==Filtering==
See [[Input#BAM_files]] for Bam specific filters.

=Examples=
SAMtools and GATK likelihood are chosen simply with

<pre>
./angsd -GL 1 #SAMtools
./angsd -GL 2 #GATK
</pre>

SOAPsnp and SYK requires some extra arguments as shown below.
==SOAPsnp==
First run through the bam files ones to generate the calibration matrix
<pre>
./angsd -bam bam.filelist -GL 3 -out outfile -ref hg19.fa -minQ 0
#NB important to set -minQ to zero, ANGSD defaults to minQ 13
</pre>

This first loop doesn't estimate anything else than the calibration matrix.

After this run the we can estimate the genotype likelihoods and any other further analysis we desire.

<pre>
./angsd -bam bam.filelist -GL 3 -out outfile -doGlf 1
</pre>

==SYK (Kim et al.)==
<pre>
./angsd -bam bam.filelist -GL 4 -out outfile -errors error.file -doCounts 1
</pre>

This model is based on counts of bases and therefore needs [[Alleles_counts]] "-doCounts 1". The error file is one line of 16 values as outputted from -doError

=Output genotype likelihoods=
; -doGlf [int]
Output the log genotype likelihoods to a file

;0. don't output the genotype likelihoods (default)

;1. binary all 10 log genotype likelihood

;2. beagle genotype likelihood format (use directly for imputation)

;3. beagle binary

;4. textoutput of all 10 log genotype likelihoods.

==Binary==
Glf file in binary doubles. All 10 genotype likelihoods are printed to a file. For each printed site there are 10*N doubles where N is the number of individuals. The order of the 10 genotypes are alphabetical AA AC AG AT CC CG CT GG GT TT. These are log scaled likelihood ratios to the most likely.

Pseudocode for parsing these files in '''c/c++'''.
<pre>
FILE *fp = fopen(genotypelikelihood.bin,"r")
ind nInd = 5;
double gls[5*10];
fread(gls,sizeof(double),5*10,fp);
</pre>

==Beagle format==
Beagle haplotype imputation and be performed directly on genotype likelhoods. To generate beagle input file use

; -doGlf 2

In order to make this file the major and minor allele has the be inferred [[Inferring Major and Minor alleles | -doMajorMinor]]. It is also a good idea to only use the polymorphic sites.

===Example===
In this example our input files are bam files. We use the samtools genotype likelihood methods. We use 10 threads. We infer the major and minor allele from the likelihoods and estimate the allele frequencies. We test for polymorphic sites and only outbut the ones with are likelhood ratio test statistic of minimum 24 (ca. p-value<1e-6).
<pre>
./angsd -GL 1 -out genolike -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 2e-6 -bam bam.filelist
</pre>

===Output===
The above command generates the file genolike.beagle.gz that can be use as input for the beagle software

<pre>
marker allele1 allele2 Ind0 Ind0 Ind0 Ind1 Ind1 Ind1 Ind2 Ind2 Ind2 Ind3 Ind3 Ind3
1_14000023 1 0 0.941177 0.058822 0.000001 0.799685 0.199918 0.000397 0.666316 0.333155 0.000529
1_14000072 2 3 0.709983 0.177493 0.112525 0.941178 0.058822 0.000000 0.665554 0.332774 0.001672
1_14000113 0 2 0.855993 0.106996 0.037010 0.333333 0.333333 0.333333 0.799971 0.199989 0.000040
1_14000202 2 0 0.835380 0.104420 0.060201 0.799685 0.199918 0.000397 0.333333 0.333333 0.333333
...
</pre>
Note that the above values sum to one per sites for each individuals. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software it does not mean that they are genotype probabilities.

; column 1 (marker)
the chromosome and position
; column 2 (allele 1)
the major allele codes as 0=A, 1=C, 2=G, 3=T
; column 3 (allele 2)
the minor allele codes as 0=A, 1=C, 2=G, 3=T
; column 4 (Ind0)
Genotype likelihood for the major/major genotype for the first individual
; column 5 (Ind0)
Genotype likelihood for the major/minor genotype for the first individual
; column 6 (Ind0)
Genotype likelihood for the minor/minor genotype for the first individual
; column 7 (Ind1)
Genotype likelihood for the major/major genotype for the second individual
...

==Simple Text Format==
<pre>
./angsd -GL 1 -bam bam.filelist -doGlf 4 -nInd 1
</pre>
We use SAMtools genotype likelihoods from the first sample ('''-nInd 1''') in the file list called '''bam.filelist'''.

Generates '''angsdput.glf.gz''', which looks like:
<pre>
1 13999965 -2.072327 -0.693156 -2.072327 -2.072327 0.000000 -0.693156 -0.693156 -2.072327 -2.072327 -2.072327
1 13999966 -2.072327 -2.072327 -0.693156 -2.072327 -2.072327 -0.693156 -2.072327 0.000000 -0.693156 -2.072327
1 13999967 0.000000 -0.693156 -0.693156 -0.693156 -2.072327 -2.072327 -2.072327 -2.072327 -2.072327 -2.072327
1 13999968 -2.072327 -2.072327 -0.693156 -2.072327 -2.072327 -0.693156 -2.072327 0.000000 -0.693156 -2.072327
1 13999969 0.000000 -0.693156 -0.693156 -0.693156 -2.072327 -2.072327 -2.072327 -2.072327 -2.072327 -2.072327
1 13999970 -2.072327 -2.072327 -2.072327 -0.693156 -2.072327 -2.072327 -0.693156 -2.072327 -0.693156 0.000000
</pre>
First 2 columns are the genomic positions, and the final 10 values are the genotype likelihoods in the usual ordering.
=Which genotype likelihood model should I choose ?=
It depends on the data. As shown on this example [[Glcomparison]], there was a huge difference between '''-GL 1''' and '''-GL 2''' for older 1000genomes BAM files, but little difference for newer bam files.

=Theory=
Genotype likelihoods are in this context the likelihood the data given a genotype. This is to be understood as we take all the information from our data for a specific position for a single individual, and we use this information to calculate the likelihood for our different genotypes. Since we assume diploid individuals it follows that we have 10 different genotypes.

{| class="wikitable" style="text-align: center; color: green;"
|0
|1
|2
|3
|4
|5
|6
|7
|8
|9
|-
|AA
|AC
|AG
|AT
|CC
|CG
|CT
|GG
|GT
|TT
|}
And we write the genotype likelihood as
<math>
L(G=\{A_1 ,A_2\}|D ) \propto Pr (D|G={A_1 ,A_2 } ),\qquad A_1 ,A_2 \in \{A,C,G,T\}.
</math>

==GATK genotype likelihoods==
In angsd we use the direct method of the first version of GATK (dragon). This is simply

<math>
Pr(D|G=\{A_1,A_2\})=\prod_{i=1}^M Pr \left ( b_i|G=\{A_1,A_2\} \right) = \prod_{i=1}^M (\frac{1}{2}Pr( b_i|A_1) + \frac{1}{2}Pr( b_i|A_2) )
</math>

<math>
Pr(b|A) =\left\{
\begin{array}{lr}
\frac{e}{3} & : b \neq A\\
1-e & : b = A
\end{array}
\right.
</math>

where M is the sequencing depth <math>b_i</math> is the observed base in read ''i, e'' is the probability of error calculated from the phredscaled qscore e.g. <math> e=10^{-q/10} </math>

==SAMtools genotype likelihoods==
<pre>
This subsection with SAMtools gl are preliminary
</pre>
Define:

<math>
fk_i = 0.83^i*0.97+0.03
</math>

<math>
lhet_{n,k} = \log \frac{\binom{n}{k}}{2^n}
</math>

<math>
\beta_{n,k} = \frac{\beta_{n,k-1}}{\beta_{n,k-1}+\binom{n}{k}\cdot k \cdot log(prob(e))+(n-k)*log(1-prob(e))}
</math>

==SOAPsnp genotype likelihoods==
==SYK genotype likelihoods==

Dodepth

2018-04-30T11:37:17Z

Albrecht: Redirected page to Allele Counts

#REDIRECT [[Allele_Counts]]

Docounts

2018-04-30T11:36:26Z

Albrecht: Redirected page to Allele Counts

#REDIRECT [[Allele_Counts]]

Docount

2018-04-02T10:02:47Z

Albrecht: Redirected page to Allele Counts

#REDIRECT [[Allele_Counts]]

Dogeno

2018-04-02T10:01:44Z

Albrecht: Redirected page to Genotype calling

#REDIRECT [[Genotype_calling]]

Domajorminor

2018-04-02T10:00:55Z

Albrecht: Redirected page to Major Minor

#REDIRECT [[Major_Minor]]

Rmtriallelic

2018-04-02T09:59:21Z

Albrecht: Redirected page to Allele Frequencies

#REDIRECT [[Allele_Frequencies]]

Snp pval

2018-04-02T09:58:37Z

Albrecht: Redirected page to Allele Frequencies

#REDIRECT [[Allele_Frequencies]]

Dopost

2018-04-02T09:57:41Z

Albrecht: Redirected page to Allele Frequencies

#REDIRECT [[Allele_Frequencies]]

Domaf

2018-04-02T09:56:11Z

Albrecht: Redirected page to Allele Frequencies

#REDIRECT [[Allele_Frequencies]]

Allele Frequencies

2018-04-02T09:53:41Z

Albrecht:

<div class="keywords"> -domaf,-domaf,-domaf,-domaf,-domaf, domaf, domaf, domaf, domaf, domaf, domaf, dopost, SNP_pval </div>

The allele frequency is the relative frequency of an allele for a site. This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. .Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see [[Inferring_Major_and_Minor_alleles]]).

We allow for frequency estimation from different input data:

# Genotype Likelihoods
# Genotype posterior probabilities
# Counts of bases

The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]], and the base counts method is from this [[Li2010 |publication]].

For the case of the genotype likelihood based methods we allow for deviations from Hardy-Weinberg, namely we allow for users to supply a file containing inbreeding coefficients for each individual.

=Brief Overview=

<pre>
./angsd -doMaf
-> angsd version: 0.910-76-gad32889 (htslib: 1.3-32-gecdc348) build(Mar 2 2016 12:38:33)
-> Analysis helpbox/synopsis information:
-> Command:
./angsd -doMaf -> Wed Mar 2 12:45:40 2016
------------------------
abcFreq.cpp:
-doMaf 0 (Calculate persite frequencies '.mafs.gz')
1: Frequency (fixed major and minor)
2: Frequency (fixed major unknown minor)
4: Frequency from genotype probabilities
8: AlleleCounts based method (known major minor)
NB. Filedumping is supressed if value is negative
-doPost 0 (Calculate posterior prob 3xgprob)
1: Using frequency as prior
2: Using uniform prior
3: Using SFS as prior (still in development)
Filters:
-minMaf -1.000000 (Remove sites with MAF below)
-SNP_pval 1.000000 (Remove sites with a pvalue larger)
-rmTriallelic 0.000000 (Remove sites with a pvalue lower)
Extras:
-ref (null) (Filename for fasta reference)
-anc (null) (Filename for fasta ancestral)
-eps 0.001000 [Only used for -doMaf &8]
-beagleProb 0 (Dump beagle style postprobs)
-indFname (null) (file containing individual inbreedcoeficients)
NB These frequency estimators requires major/minor -doMajorMinor
</pre>

=Allele Frequency estimation=
The major and minor allele is first inferred from the data or given by the user (see [[Inferring_Major_and_Minor_alleles]]). This includes information from both major and minor allele, a reference genome (for major) or an ancestral genome.

; -doMaf [int]

1: Known major, and Known minor. Here both the major and minor allele is assumed to be known (inferred or given by user). The allele frequency is the obtained using based on the genotype likelihoods. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].

2: Known major, Unknown minor. Here the major allele is assumed to be known (inferred or given by user) however the minor allele is not determined. Instead we sum over the 3 possible minor alleles weighted by their probabilities. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].
.

4: frequency based on genotype posterior probabilities. If genotype probabilities are used as input to ANGSD the allele frequency is estimated directly on these by [[postFreq|summing over the probabitlies]].

8: frequency based on base counts. This method does not rely on genotype likelihood or probabilities but instead infers the allele frequency directly on the base counts. The base counts method is from this [[Li2010 |publication]].

Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the first three estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)

;NB using -doMaf 4 is only supported if the posteriors are supplied as external files. Since the estimation of genotype posteriors in itself requires a maf estimator.

=Example=

==From genotype likelihood==
Example for estimating the allele frequencies both while assuming known major and minor allele but also while taking the uncertaincy of the minor allele inference into account. The [[Inferring_Major_and_Minor_alleles|inference of the major and minor]] allele is done directly from the genotype likelihood

<pre>
./angsd -out out -doMajorMinor 1 -doMaf 3 -bam bam.filelist -GL 1
</pre>

==From genotype probabilities==
Example of the use of a genotype probability file for example from the output from beagle.

<pre>
./angsd -out out -doMaf 4 -beagle beagle.file.gz
</pre>

==Estimator from base counts==

The allele frequencies can be infered directy from the sequencing data [[Li2010|citation]].
This works by using "counts" of alleles, and should be invoked like

<pre>
./angsd -out out -doMajorMinor 2 -doMaf 8 -bam bam.filelist -doCounts 1
</pre>

=Output data=
==.mafs.gz==
<pre>
chromo position major minor ref knownEM unknownEM nInd
21 9719788 T A 0.000001 -0.000012 3
21 9719789 G A 0.000000 -0.000001 3
21 9719790 A C 0.000000 -0.000004 3
21 9719791 G A 0.000000 -0.000001 3
21 9719792 G A 0.000000 -0.000002 3
21 9719793 G T 0.498277 41.932766 3
21 9719794 T A 0.000000 -0.000001 3
21 9719795 T A 0.000000 -0.000001 3

</pre>

;chromo
chromosome name
;position
position
;major
major allele
;minor
minor allele
;knownEM
frequency using -doMaf 1
;unknownEM
frequency using -doMaf 2
;phat
frequency using -doMaf 8
;nInd
is the number of individuals with data

MediaWiki:Common.css

2018-04-02T09:52:01Z

Albrecht:

/* CSS placed here will be applied to all skins */
.keywords {
display: none;
}

MediaWiki:Common.css

2018-04-02T09:50:14Z

Albrecht:

/* CSS placed here will be applied to all skins */
div.keywords {
display: none;
}

Allele Frequencies

2018-04-02T09:49:05Z

Albrecht:

<div class="keywords"> domaf, dopost, SNP_pval </div>

The allele frequency is the relative frequency of an allele for a site. This can be polarized according to the major/minor, reference/non-refernce or ancestral/derived. .Therefore the choice of allele frequency estimator is closely related to choosing which alleles are segregating (see [[Inferring_Major_and_Minor_alleles]]).

We allow for frequency estimation from different input data:

# Genotype Likelihoods
# Genotype posterior probabilities
# Counts of bases

The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]], and the base counts method is from this [[Li2010 |publication]].

For the case of the genotype likelihood based methods we allow for deviations from Hardy-Weinberg, namely we allow for users to supply a file containing inbreeding coefficients for each individual.

=Brief Overview=

<pre>
./angsd -doMaf
-> angsd version: 0.910-76-gad32889 (htslib: 1.3-32-gecdc348) build(Mar 2 2016 12:38:33)
-> Analysis helpbox/synopsis information:
-> Command:
./angsd -doMaf -> Wed Mar 2 12:45:40 2016
------------------------
abcFreq.cpp:
-doMaf 0 (Calculate persite frequencies '.mafs.gz')
1: Frequency (fixed major and minor)
2: Frequency (fixed major unknown minor)
4: Frequency from genotype probabilities
8: AlleleCounts based method (known major minor)
NB. Filedumping is supressed if value is negative
-doPost 0 (Calculate posterior prob 3xgprob)
1: Using frequency as prior
2: Using uniform prior
3: Using SFS as prior (still in development)
Filters:
-minMaf -1.000000 (Remove sites with MAF below)
-SNP_pval 1.000000 (Remove sites with a pvalue larger)
-rmTriallelic 0.000000 (Remove sites with a pvalue lower)
Extras:
-ref (null) (Filename for fasta reference)
-anc (null) (Filename for fasta ancestral)
-eps 0.001000 [Only used for -doMaf &8]
-beagleProb 0 (Dump beagle style postprobs)
-indFname (null) (file containing individual inbreedcoeficients)
NB These frequency estimators requires major/minor -doMajorMinor
</pre>

=Allele Frequency estimation=
The major and minor allele is first inferred from the data or given by the user (see [[Inferring_Major_and_Minor_alleles]]). This includes information from both major and minor allele, a reference genome (for major) or an ancestral genome.

; -doMaf [int]

1: Known major, and Known minor. Here both the major and minor allele is assumed to be known (inferred or given by user). The allele frequency is the obtained using based on the genotype likelihoods. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].

2: Known major, Unknown minor. Here the major allele is assumed to be known (inferred or given by user) however the minor allele is not determined. Instead we sum over the 3 possible minor alleles weighted by their probabilities. The allele frequency estimator from genotype likelihoods are from this [[suYeon | publication]] but using the EM algorithm and is briefly described [[SYKmaf|here]].
.

4: frequency based on genotype posterior probabilities. If genotype probabilities are used as input to ANGSD the allele frequency is estimated directly on these by [[postFreq|summing over the probabitlies]].

8: frequency based on base counts. This method does not rely on genotype likelihood or probabilities but instead infers the allele frequency directly on the base counts. The base counts method is from this [[Li2010 |publication]].

Multiple estimators can be used simultaniusly be summing up the above numbers. Thus -doMaf 7 (1+2+4) will use the first three estimators. If the allele frequencies are estimated from the genotype likelihoods then you need to infer the major and minor allele (-doMajorMinor)

;NB using -doMaf 4 is only supported if the posteriors are supplied as external files. Since the estimation of genotype posteriors in itself requires a maf estimator.

=Example=

==From genotype likelihood==
Example for estimating the allele frequencies both while assuming known major and minor allele but also while taking the uncertaincy of the minor allele inference into account. The [[Inferring_Major_and_Minor_alleles|inference of the major and minor]] allele is done directly from the genotype likelihood

<pre>
./angsd -out out -doMajorMinor 1 -doMaf 3 -bam bam.filelist -GL 1
</pre>

==From genotype probabilities==
Example of the use of a genotype probability file for example from the output from beagle.

<pre>
./angsd -out out -doMaf 4 -beagle beagle.file.gz
</pre>

==Estimator from base counts==

The allele frequencies can be infered directy from the sequencing data [[Li2010|citation]].
This works by using "counts" of alleles, and should be invoked like

<pre>
./angsd -out out -doMajorMinor 2 -doMaf 8 -bam bam.filelist -doCounts 1
</pre>

=Output data=
==.mafs.gz==
<pre>
chromo position major minor ref knownEM unknownEM nInd
21 9719788 T A 0.000001 -0.000012 3
21 9719789 G A 0.000000 -0.000001 3
21 9719790 A C 0.000000 -0.000004 3
21 9719791 G A 0.000000 -0.000001 3
21 9719792 G A 0.000000 -0.000002 3
21 9719793 G T 0.498277 41.932766 3
21 9719794 T A 0.000000 -0.000001 3
21 9719795 T A 0.000000 -0.000001 3

</pre>

;chromo
chromosome name
;position
position
;major
major allele
;minor
minor allele
;knownEM
frequency using -doMaf 1
;unknownEM
frequency using -doMaf 2
;phat
frequency using -doMaf 8
;nInd
is the number of individuals with data

MediaWiki:Common.css

2018-04-02T09:48:46Z

Albrecht: Created page with "/* CSS placed here will be applied to all skins */ .keywords { display: none; }"

/* CSS placed here will be applied to all skins */
.keywords {
display: none;
}

PCA MDS

2018-03-12T15:39:59Z

2018-02-07T14:15:34Z

Albrecht:

Genotype calling

2018-01-19T12:59:11Z

Albrecht: /* Options */

We really don't recommend doing analysis based on called genotypes, but incorporate the uncertainty directly into the analysis you want to perform. But we recognise that many methods are still relying on called genotypes, and have therefore implemented a basic genotype caller into angsd.

Genotype calling in ANGSD is based on calculating the posterior probability of the genotypes. The '''-doGeno''' is therefore a simple wrapper around the '''-doPost''' along with some extra filtering options. See [[Allele Frequencies]] for more information.

=Brief Overview=
<pre>
./angsd -dogeno -> Wed Mar 2 12:39:19 2016
-----------------
abcCallGenotypes.cpp:

-doGeno 0
1: write major and minor
2: write the called genotype encoded as -1,0,1,2, -1=not called
4: write the called genotype directly: eg AA,AC etc
8: write the posterior probability of all possible genotypes
16: write the posterior probability of called genotype
32: write the posterior probabilities of the 3 gentypes as binary
-> A combination of the above can be choosen by summing the values, EG write 0,1,2 types with majorminor as -doGeno 3
-postCutoff=0.333333 (Only genotype to missing if below this threshold)
-geno_minDepth=-1 (-1 indicates no cutof)
-geno_maxDepth=-1 (-1 indicates no cutof)
-geno_minMM=-1.000000 (minimum fraction af major-minor bases)
-minInd=0 (only keep sites if you call genotypes from this number of individuals)

NB When writing the posterior the -postCutoff is not used
NB geno_minDepth requires -doCounts
NB geno_maxDepth requires -doCounts

</pre>

angsd can also use the full information of the sample allele frequencies for calling genotypes see [[SFS Estimation]].
==Options==
;-doGeno [int]
1: print out major minor

2: print the called genotype as -1,0,1,2

4: print the called genotype as AA, AC, AG, ...

8: print all 3 posts (major,major),(major,minor),(minor,minor)

16: print the posterior of the called genotype

32: somewhat different dumps the binary posterior for all samples, encoded as 3*nind double

Use the sum of the above to give the output you want. Forexample -doGeno 5 (1+4) prins the major and minor allele followed by the genotype (AA, AC ...) for each individual

; -doPost [int]
1: estimate the posterior genotype probability based on the allele frequency as a prior

2: estimate the posterior genotype probability assuming a uniform prior

; -geno_minDepth [int]
set genotypes to missing if the individual depth is less than [int]

; -geno_maxDepth [int]
set genotypes to missing if the individual depth is larger than [int]

; -geno_minMM [float]
set genotypes to missing if less than [float] of the bases are the major or minor (likely a triallic site). e.g. 0.1 means that less than 10% of reads in this individual is either the major or the minor

; -postCutoff [float]
Call only a genotype with a posterior above this threshold.

NB if the raw posterior dump is requested the -postCutoff is not used

==Examples==
===Allele frequency as prior===
<pre>
./angsd -bam bam.filelist -GL 1 -out outfile -doMaf 2 -doMajorMinor 1 -SNP_pval 0.000001 -doGeno 5 -doPost 1 -postCutoff 0.95
</pre>

gives a output like this:

<pre>
1 14000202 G A GG NN NN GA NN
1 14000873 G A GG GG GG AA GA
1 14001018 T C NN NN NN CC NN
1 14001867 A G NN AA AA NN NN
1 14002342 C T CC CC CC CC CC
1 14002422 A T AA NN NN NN NN
1 14002474 T C TC TT TT TT TT
1 14003581 C T CC CC NN NN CT
1 14004623 T C TT TT TT NN TC
1 14005069 A G AA AA AA AA AA
</pre>
===Sample allele frequency with SFS as prior===
1. First get an estimate of the site frequency spectrum
<pre>
./angsd -dosaf 1 -anc ../hg19ancNoChr.fa.gz -gl 1 -b list
./realSFS angsdput.saf.idx >angsdput.saf.idx.ml
</pre>
2. Now calculate diallelic genotype posterior probablity with
<pre>
./angsd -dopost 3 -b list -gl 1 -domajorminor 1 -domaf 1 -pest angsdput.saf.idx.ml -dogeno 2 -r 1 -out angsdput2
</pre>

Genotype calling

2018-01-19T12:58:38Z

Albrecht: /* Options */

We really don't recommend doing analysis based on called genotypes, but incorporate the uncertainty directly into the analysis you want to perform. But we recognise that many methods are still relying on called genotypes, and have therefore implemented a basic genotype caller into angsd.

Genotype calling in ANGSD is based on calculating the posterior probability of the genotypes. The '''-doGeno''' is therefore a simple wrapper around the '''-doPost''' along with some extra filtering options. See [[Allele Frequencies]] for more information.

=Brief Overview=
<pre>
./angsd -dogeno -> Wed Mar 2 12:39:19 2016
-----------------
abcCallGenotypes.cpp:

-doGeno 0
1: write major and minor
2: write the called genotype encoded as -1,0,1,2, -1=not called
4: write the called genotype directly: eg AA,AC etc
8: write the posterior probability of all possible genotypes
16: write the posterior probability of called genotype
32: write the posterior probabilities of the 3 gentypes as binary
-> A combination of the above can be choosen by summing the values, EG write 0,1,2 types with majorminor as -doGeno 3
-postCutoff=0.333333 (Only genotype to missing if below this threshold)
-geno_minDepth=-1 (-1 indicates no cutof)
-geno_maxDepth=-1 (-1 indicates no cutof)
-geno_minMM=-1.000000 (minimum fraction af major-minor bases)
-minInd=0 (only keep sites if you call genotypes from this number of individuals)

NB When writing the posterior the -postCutoff is not used
NB geno_minDepth requires -doCounts
NB geno_maxDepth requires -doCounts

</pre>

angsd can also use the full information of the sample allele frequencies for calling genotypes see [[SFS Estimation]].
==Options==
;-doGeno [int]
1: print out major minor

2: print the called genotype as -1,0,1,2

4: print the called genotype as AA, AC, AG, ...

8: print all 3 posts (major,major),(major,minor),(minor,minor)

16: print the posterior of the called genotype

32: somewhat different dumps the binary posterior for all samples, encoded as 3*nind double

Use the sum of the above to give the output you want. Forexample -doGeno 5 (1+4) prins the major and minor allele followed by the genotype (AA, AC ...) for each individual

; -doPost [int]
1: estimate the posterior genotype probability based on the allele frequency as a prior

2: estimate the posterior genotype probability assuming a uniform prior

; -geno_minDepth [int]
set genotypes to missing if the individual depth is less than [int]

; -geno_maxDepth [int]
set genotypes to missing if the individual depth is larger than [int]

; -geno_minMM [float]
set genotypes to missing if less than [float] of the bases are the major or minor (likely a triallic site). e.g. 0.1 means that less than 10% of reads are either the major or the minor in this indivual

; -postCutoff [float]
Call only a genotype with a posterior above this threshold.

NB if the raw posterior dump is requested the -postCutoff is not used

==Examples==
===Allele frequency as prior===
<pre>
./angsd -bam bam.filelist -GL 1 -out outfile -doMaf 2 -doMajorMinor 1 -SNP_pval 0.000001 -doGeno 5 -doPost 1 -postCutoff 0.95
</pre>

gives a output like this:

<pre>
1 14000202 G A GG NN NN GA NN
1 14000873 G A GG GG GG AA GA
1 14001018 T C NN NN NN CC NN
1 14001867 A G NN AA AA NN NN
1 14002342 C T CC CC CC CC CC
1 14002422 A T AA NN NN NN NN
1 14002474 T C TC TT TT TT TT
1 14003581 C T CC CC NN NN CT
1 14004623 T C TT TT TT NN TC
1 14005069 A G AA AA AA AA AA
</pre>
===Sample allele frequency with SFS as prior===
1. First get an estimate of the site frequency spectrum
<pre>
./angsd -dosaf 1 -anc ../hg19ancNoChr.fa.gz -gl 1 -b list
./realSFS angsdput.saf.idx >angsdput.saf.idx.ml
</pre>
2. Now calculate diallelic genotype posterior probablity with
<pre>
./angsd -dopost 3 -b list -gl 1 -domajorminor 1 -domaf 1 -pest angsdput.saf.idx.ml -dogeno 2 -r 1 -out angsdput2
</pre>

Association

2017-08-10T08:32:11Z

Albrecht: /* Output */

Association can be performed using two approaches.
# Based on testing differences in allele frequencies between cases and controls, using genotype likelihoods
# Based on a generalized linear framework which also allows for quantitative traits and binary and for including additional covariates, using genotype posteriors.

__TOC__
We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter away the low mapping quality reads and the low qscore bases.

The filtering of the alignment data is described in [[Input]], and filtering based on frequencies/polymorphic sites are described [[Filters#Allele_frequencies| here]].
<div class="toccolours mw-collapsible mw-collapsed">
This can be done easily at the command line by adding the below commands
<pre class="mw-collapsible-content">
-minQ 20 -minMapQ 30 -SNP_pval 1e-6 #Use polymorphic sites with a p-value of 10^-6
-minQ 20 -minMapQ 30 -minMaf 0.05 #Use sites with a MAF >0.05
</pre>
</div>
=Brief Overview=
<pre>
./angsd -doAsso
analysisAsso.cpp:
-doAsso 0
1: Frequency Test (Known Major and Minor)
2: Score Test
3: Frequency Test (Unknown Minor)
Frequency Test Options:
-yBin (null) (File containing disease status)

Score Test Options:
-yBin (null) (File containing disease status)
-yQuant (null) (File containing phenotypes)
-minHigh 10 (Require atleast minHigh number of high credible genotypes)
-minCount 10 (Require this number of minor alleles, estimated from MAF)
-cov (null) (File containing additional covariates)
-model 1
1: Additive/Log-Additive (Default)
2: Dominant
3: Recessive

</pre>
=Case control association using allele frequencies=
To test for differences in the allele frequencies, genotype likelihood needs to be provided or [[Genotype_likelihoods_from_alignments | estimated]]. The test is an implimentation of the likelihoods ratio test for differences between cases and controls described in details in [[Kim2011]].

;-doAsso [int]
'''1''': The test is performed assuming the minor allele is known. <br>

;-yBin [Filename]
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line.
<div class="toccolours mw-collapsible mw-collapsed">
Example of cases control phenotype file
<pre class="mw-collapsible-content">
1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1
</pre>
</div>

==Example==

create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R

<pre>
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
</pre>

<pre>
./angsd -yBin pheno.ybin -doAsso 1 -GL 1 -out out -doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6 -bam large.filelist -r 1: -P 5
</pre>
Note that because you are reading 500 bam files it takes a little while

<div class="toccolours mw-collapsible mw-collapsed">
gunzip -c out.lrt0.gz | head
<pre class="mw-collapsible-content">
Chromosome Position Major Minor Frequency LRT
1 14000003 G A 0.057070 0.016684
1 14000013 G A 0.067886 0.029014
1 14000019 G T 0.052904 0.569061
1 14000023 C A 0.073336 0.184060
1 14000053 T C 0.038903 0.604695
1 14000170 C T 0.050756 0.481033
1 14000176 G A 0.053157 0.424910
1 14000200 C A 0.085332 0.485030
1 14000202 G A 0.257132 0.025047
</pre>
</div>

The LRT is the likelihood ration statistics which is chi square distributed with one degree of freedom.

==Dependency Chain==
The method is based on estimating frequencies from genotype likelihoods. If alignment data has been supplied you need to specify the following.

# [[Genotype_likelihoods_from_alignments | Genotype likelihood model (-GL)]].
#[[Inferring_Major_and_Minor_alleles |Determine Major/Minor (-doMajorMinor)]].
#[[Allele_Frequency_estimation| Maf estimator (-doMaf)]].

If you have supplied genotype likelihood files as input for angsd you can skip 1.

=Score statistic=
To perform the test in a generalized linear framework posterior genotype probabilities must be provided or [[Genotype_calling|estimated]]. The approach is published here [[skotte2012]].
;-doAsso 2

;-yBin [Filename]
A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes.
<div class="toccolours mw-collapsible mw-collapsed">
Example of cases control phenotype file
<pre class="mw-collapsible-content">
1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1
</pre>
</div>
;-yQuant [Filename]
File containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.
<div class="toccolours mw-collapsible mw-collapsed">
Example of quantitative phenotype file
<pre class="mw-collapsible-content">
-999
2.06164722761138
-0.091935218675602
-0.287527686061831
-999
-999
-1.20996664036026
0.0188541092307412
-2.1122713873334
-999
-1.32920529536579
-1.10582299663753
-0.391773417823766
-0.501400984567535
-999
1.06014677976046
-1.10582299663753
-999
0.223156127557052
-0.189660869820135
</pre>
</div>
;-cov [Filename]
Files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates.

<div class="toccolours mw-collapsible mw-collapsed">
Example of covariate file
<pre class="mw-collapsible-content">
1 0 0 1
1 0.1 0 0
2 0 1 0
2 0 1 0
2 0.1 0 1
1 0 0 1
1 0.3 0 0
2 0 0 0
1 0 0 0
2 0.2 0 1
1 0 1 0
1 0 0 0
1 0.1 0 0
1 0 0 0
2 0 0 1
2 0 0 0
2 0 0 0
1 0 0 1
1 0.5 0 0
2 0 0 0
</pre>
</div>
;-minHigh [int]
default = 10 <br>
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of of homozygous major, heterozygous and homozygous minor genotypes. At least two of the three genotypes categories needs at least [int] individuals with a genotype probability above 0.9. This filter avoids the scenario where all individuals have genotypes with the same probability e.g. all are heterozygous with a high probability or all have 0.33333333 probability for all three genotypes.
;-minCount [int]
default = 10 <br>
The minimum expected minor alleles in the sample. This is the frequency multiplied by two times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.
;-model [int]
# Additive/Log-additive for Linear/Logistic Regression (Default).
# Dominant.
# Recessive.

==Example==
create a large number of individuals by recycling the example files (500 individuals) and simulate some phentypes (case/control) using R

<pre>
rm large.filelist
for i in `seq 1 50`;do cat bam.filelist>>large.filelist;done
Rscript -e "write.table(cbind(rbinom(500,1,0.5)),'pheno.ybin',row=F,col=F)"
Rscript -e "write.table(cbind(rnorm(500)),'pheno.yquant',row=F,col=F)"
Rscript -e "set.seed(1);write.table(cbind(rbinom(500,1,0.5),rnorm(500)),'cov.file',row=F,col=F)"
</pre>

For cases control data for polymorphic sites (p-value < 1e-6)
<pre>
./angsd -yBin pheno.ybin -doAsso 2 -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:
</pre>

For quantitative traits (normal distributed errors) for polymorphic sites (p-value < 1e-6) and additional covariates
<pre>
./angsd -yQuant pheno.yquant -doAsso 2 -cov cov.file -GL 1 -doPost 1 -out out -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1:
</pre>

==Example with imputation (using BEAGLE)==

First the polymorphic sites to be analysed needs to be selected (-doMaf 1 -SNP_pval -doMajorMinor) and the genotype likelihoods estimated (-GL 1) for use in [http://faculty.washington.edu/browning/beagle/beagle.html the Beagle software] (-doGlf 2).

<pre>
./angsd -GL 1 -out input -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam large.filelist -P 5 -r 1: -doGlf 2
</pre>

Perform the imputation

<pre>
java -Xmx15000m -jar beagle.jar like=input.beagle.gz out=beagleOut
</pre>

the reference fai can be obtained by indexing the reference genome or by using a bam files header
<pre>
samtools view -H bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | grep SN |cut -f2,3 | sed 's/SN\://g' | sed 's/LN\://g' > ref.fai
</pre>

The association can then be performed on the genotype probabilities using the score statistics
<pre>
./angsd -doMaf 4 -beagle beagleOut.impute.beagle.gz.gprobs.gz -fai ref.fai -yBin pheno.ybin -doAsso 2
</pre>

==Dependency Chain==
The method is based on genotype probabilities. If alignment data has been supplied you need to specify the following.

# [[Genotype_likelihoods_from_alignments | Genotype likelihood model (-GL)]].
#[[Inferring_Major_and_Minor_alleles |Determine Major/Minor (-doMajorMinor)]].
#[[Allele_Frequency_estimation| Maf estimator (-doMaf)]].
#[[Genotype_calling| Calculate posterior genotype probability (-doPost)]]. If you use the score statistics -doAsso 2 then calculate the posterior using the allele frequency as prior (-doPost 1).

If you have supplied genotype likelihoods for angsd, then you should skip 1.<br>

If you have supplied genotype probabilities (as beagle output format), there are no dependencies.

=Output=
==Output format==
The output from the association analysis is a list of files called '''prefix.lrt'''. These are tab separated plain text files, with nine columns.
{| class="wikitable"
|-
! scope="col"| Chromosome
! scope="col"| Position
! scope="col"| Major
! scope="col"| Minor
! scope="col"| Frequency
! scope="col"| N*
! scope="col"| LRT
! scope="col"| highHe*
! scope="col"| highHo*
|}
'''*''' Indicates that these columns are only used for the score test.
{| class="wikitable"
|-
! scope="col"| Field
! scope="col"| Description
|-
! scope="row"| Chromosome
| Chromosome.
|-
! scope="row"| Position
| Physical Position.
|-
! scope="row"| Major
| The Major allele as determined by [[MajorMinor |-doMajorMinor]]. If posterior genotype files has been supplied as input, this column is not defined.
|-
! scope="row"| Minor
| The Minor allele as determined by [[MajorMinor |-doMajorMinor]]. If posterior genotype files has been supplied as input, this column is not defined.
|-
! scope="row"| Frequency
| The Minor allele frequency as determined by [[Maf|-doMaf]].
|-
! scope="row"| N*
| Number of individuals. That is the number of samples that have both sequencing data and phenotypic data.
|-
! scope="row"| LRT
| The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000.
|-
! scope="row"| high_WT/HE/HO*
| Number of sites with a WE/HE/HO genotype posterior probability above 0.9. WT=major/major,HE=major/minor,HO=minor/minor.
|}

Example:
<pre>
Chromosome Position Major Minor Frequency N LRT high_WT/HE/HO
1 14000023 C A 0.052976 330 2.863582 250/10/0
1 14000072 G T 0.020555 330 1.864555 320/10/0
1 14000113 A G 0.019543 330 0.074985 320/10/0
1 14000202 G A 0.270106 330 0.181530 50/90/0
1 14000375 T C 0.020471 330 1.845881 320/10/0
1 14000851 T C 0.016849 330 0.694058 320/10/0
1 14000873 G A 0.305990 330 0.684507 140/60/10
1 14001008 T C 0.018434 330 0.031631 320/10/0
1 14001018 T C 0.296051 330 0.761196 110/40/10
</pre>


==problems with inflation of p-values==

You can evaluate the behavior of the tests by making a QQ plot of the LRT. There are several reasons why it might show signs of inflation
; -doPost (when using doAsso 2 without the use of posterior input -beagle
if you estimate the posterior genotype probability using a uniform prior (-doPost 2) then small differences in depth between sample will inflate the test statistics (see [[Skotte2012]]. Use the allele frequency as a prior (doPost 1)
; -minCount/-minHigh
If you set this too low then it will results in inflation of the test statistics.
; -yQuant (when using -doAsso 2 with a quantitative trait)
If your trait is not continues or the distribution of the trait is skewed or has outliers then you will get inflation of p-values. Same rules apply as for a standard regression. Consider transforming you trait into a normal distribution
; Population structure
If you have population structure then you will have to adjust for it in the regression model (doAssso 2). Consider using NGSadmix or PCAngsd and use the results as covariates. Note that the model will still have some issues because it uses the allele frequency as a prior. For the adventurous you can use PCAngsd or NGSadmix to estimate the individual allele frequencies and calculate your own genotype probabilities that take structure into account. These can then be used in angsd using the -beagle input format.
; low N
Usually a GWAS is performed on thousands of samples and we have only tested the use of the score statistics on hundreds of samples. If you have a low number of samples then try to figure out what minor allele frequency you would need in order to have some power. Also be careful with reducing -minCount/-minHigh.