 <?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://www.popgen.dk/angsd/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Samuele</id>
	<title>angsd - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://www.popgen.dk/angsd/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Samuele"/>
	<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php/Special:Contributions/Samuele"/>
	<updated>2026-05-02T09:31:29Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.40.1</generator>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2939</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2939"/>
		<updated>2017-03-07T14:55:47Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       1=use the last group of bam files as outgroup&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites of the 4-population trees that can be built from the data, where the outgroup is fixed.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). Columns 4 and 5 are the numerator and denominator of the D-statistic for their specific block. Column 6 is the number of sites containing data in that block. The other 256 columns are the normalized counts of the 256 allele patterns between the 4 populations, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the trees that are built.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: use the last group of bam files as outgroup for the D-stat analysys. Default: 0 (use the fasta file as outgroup)&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined). If not provided, it is assumed that each population has only one individual.&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
This tutorial require having Samtools previously installed, and the library 'pracma' previously installed in R.&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals, where the last population is fixed as outgroup (so that there are 30 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use as outgroup by the option -useLast 0. One can use the last population of .bam files as outgroup with the option -useLast 1. Create a file named sizeFile.size and write the size of each population (IT IS NECESSARY to define the size of the -anc outgroup population, that is always 1):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We decide to target three chromosomes, one of the three with loci between position 10Mb and 15Mb. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:&lt;br /&gt;
5:&lt;br /&gt;
16:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output of ANGSD will show no data about chromosome 1. This happens when all blocks within that chromosome contained no data and therefore where not printed.&lt;br /&gt;
&lt;br /&gt;
After running ANGSD to count ABBA and BABA combinations, we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 0 -minQ 20 -minMapQ 30 -p 1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 30 trees (so each block is written on 30 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed in the block. The next 256 columns are the counted patterns of alleles in the tree, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file (including a line for the outgroup population). If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all 30 4-populations trees. Note: If error correction is not needed, it is sufficient to avoid specifying any error file. If no names need to be provided, the script will assign Population_* as standard name. If no size file is provided, the script assigns 1 to each population. At least one between the name file and the size file is needed. It is possible to recycle the size file used in ANGSD.&lt;br /&gt;
&lt;br /&gt;
The D-statistics and other informations are contained in four distinct files depending on the application of error correction and ancient transition removal. The files are named as follow:&lt;br /&gt;
;[[1)result.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.ErrorCorr.TransRem.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.TransRem.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBlocks=number of blocks with observed data, H*=the names of the four populations for the specific tree. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2935</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2935"/>
		<updated>2017-02-28T17:51:27Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       1=use the last group of bam files as outgroup&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites of the 4-population trees that can be built from the data, where the outgroup is fixed.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). Columns 4 and 5 are the numerator and denominator of the D-statistic for their specific block. Column 6 is the number of sites containing data in that block. The other 256 columns are the normalized counts of the 256 allele patterns between the 4 populations, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the trees that are built.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: use the last group of bam files as outgroup for the D-stat analysys. Default: 0 (use the fasta file as outgroup)&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined). If not provided, it is assumed that each population has only one individual.&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
This tutorial require having Samtools previously installed, and the library 'pracma' previously installed in R.&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals, where the last population is fixed as outgroup (so that there are 30 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use as outgroup by the option -useLast 0. One can use the last population of .bam files as outgroup with the option -useLast 1. Create a file named sizeFile.size and write the size of each population (IT IS NECESSARY to define the size of the -anc outgroup population, that is always 1):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We decide to target three chromosomes, one of the three with loci between position 10Mb and 15Mb. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:&lt;br /&gt;
5:&lt;br /&gt;
16:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output of ANGSD will show no data about chromosome 1. This happens when all blocks within that chromosome contained no data and therefore where not printed.&lt;br /&gt;
&lt;br /&gt;
After running ANGSD to count ABBA and BABA combinations, we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 0 -minQ 20 -minMapQ 30 -p 1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 30 trees (so each block is written on 30 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed in the block. The next 256 columns are the counted patterns of alleles in the tree, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file (including a line for the outgroup population). If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 of steps 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;-0.005,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all 30 4-populations trees. Note: If error correction is not needed, it is sufficient to avoid specifying any error file. If no names need to be provided, the script will assign Population_* as standard name. If no size file is provided, the script assigns 1 to each population. At least one between the name file and the size file is needed. It is possible to recycle the size file used in ANGSD.&lt;br /&gt;
&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files (only if you have provided an error rate file and added error rates in the options for the R script). The other text files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the tree (((Population1, Population2)PopWithError)Chimpanzee) the files containing the results are:&lt;br /&gt;
;[[1)result.Population1.Population2.PopWithError.Chimpanzee.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Population2.PopWithError.Chimpanzee.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, the names of the four populations. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Population2.PopWithError.Chimpanzee.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Population2.PopWithError.Chimpanzee.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2934</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2934"/>
		<updated>2017-02-28T17:49:40Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       1=use the last group of bam files as outgroup&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites of the 4-population trees that can be built from the data, where the outgroup is fixed.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). Columns 4 and 5 are the numerator and denominator of the D-statistic for their specific block. Column 6 is the number of sites containing data in that block. The other 256 columns are the normalized counts of the 256 allele patterns between the 4 populations, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the trees that are built.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: use the last group of bam files as outgroup for the D-stat analysys. Default: 0 (use the fasta file as outgroup)&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined). If not provided, it is assumed that each population has only one individual.&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
This tutorial require having Samtools previously installed, and the library 'pracma' previously installed in R.&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals, where the last population is fixed as outgroup (so that there are 30 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use as outgroup by the option -useLast 0. One can use the last population of .bam files as outgroup with the option -useLast 1. Create a file named sizeFile.size and write the size of each population (IT IS NECESSARY to define the size of the -anc outgroup population, that is always 1):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We decide to target three chromosomes, one of the three with loci between position 10Mb and 15Mb. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:&lt;br /&gt;
5:&lt;br /&gt;
16:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD to count ABBA and BABA combinations, we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 0 -minQ 20 -minMapQ 30 -p 1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 30 trees (so each block is written on 30 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed in the block. The next 256 columns are the counted patterns of alleles in the tree, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file (including a line for the outgroup population). If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 of steps 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;-0.005,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all 30 4-populations trees. Note: If error correction is not needed, it is sufficient to avoid specifying any error file. If no names need to be provided, the script will assign Population_* as standard name. If no size file is provided, the script assigns 1 to each population. At least one between the name file and the size file is needed. It is possible to recycle the size file used in ANGSD.&lt;br /&gt;
&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files (only if you have provided an error rate file and added error rates in the options for the R script). The other text files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the tree (((Population1, Population2)PopWithError)Chimpanzee) the files containing the results are:&lt;br /&gt;
;[[1)result.Population1.Population2.PopWithError.Chimpanzee.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Population2.PopWithError.Chimpanzee.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, the names of the four populations. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Population2.PopWithError.Chimpanzee.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Population2.PopWithError.Chimpanzee.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2933</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2933"/>
		<updated>2017-02-28T17:09:25Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       1=use the last group of bam files as outgroup&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites of the 4-population trees that can be built from the data, where the outgroup is fixed.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). Columns 4 and 5 are the numerator and denominator of the D-statistic for their specific block. Column 6 is the number of sites containing data in that block. The other 256 columns are the normalized counts of the 256 allele patterns between the 4 populations, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the trees that are built.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: use the last group of bam files as outgroup for the D-stat analysys. Default: 0 (use the fasta file as outgroup)&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined). If not provided, it is assumed that each population has only one individual.&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
This tutorial require having Samtools previously installed, and the library 'pracma' previously installed in R.&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals, where the last population is fixed as outgroup (so that there are 30 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use as outgroup by the option -useLast 0. One can use the last population of .bam files as outgroup with the option -useLast 1. Create a file named sizeFile.size and write the size of each population (IT IS NECESSARY to define the size of the -anc outgroup population, that is always 1):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We decide to target two chromosomes, one of the two with loci between position 10Mb and 15Mb. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:&lt;br /&gt;
16:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD to count ABBA and BABA combinations, we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 0 -minQ 20 -minMapQ 30 -p 1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 30 trees (so each block is written on 30 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed in the block. The next 256 columns are the counted patterns of alleles in the tree, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file (including a line for the outgroup population). If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 of steps 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;-0.005,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all 30 4-populations trees. Note: If error correction is not needed, it is sufficient to avoid specifying any error file. If no names need to be provided, the script will assign Population_* as standard name. If no size file is provided, the script assigns 1 to each population. At least one between the name file and the size file is needed.&lt;br /&gt;
&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files (only if you have provided an error rate file and added error rates in the options for the R script). The other text files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the tree (((Population1, Population2)PopWithError)Chimpanzee) the files containing the results are:&lt;br /&gt;
;[[1)result.Population1.Population2.PopWithError.Chimpanzee.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Population2.PopWithError.Chimpanzee.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, the names of the four populations. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Population2.PopWithError.Chimpanzee.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Population2.PopWithError.Chimpanzee.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2932</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2932"/>
		<updated>2017-02-24T08:50:28Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       1=use the last group of bam files as outgroup&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites of the 4-population trees that can be built from the data, where the outgroup is fixed.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). Columns 4 and 5 are the numerator and denominator of the D-statistic for their specific block. Column 6 is the number of sites containing data in that block. The other 256 columns are the normalized counts of the 256 allele patterns between the 4 populations, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the trees that are built.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: use the last group of bam files as outgroup for the D-stat analysys. Default: 0 (use the fasta file as outgroup)&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined). If not provided, it is assumed that each population has only one individual.&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
This tutorial require having Samtools previously installed, and the library 'pracma' previously installed in R.&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals, where the last population is fixed as outgroup (so that there are 30 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use as outgroup by the option -useLast 0. One can use the last population of .bam files as outgroup with the option -useLast 1. Create a file named sizeFile.size and write the size of each population (it is NOT necessary to define the size of the -anc fasta outgroup):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We decide to target three chromosomes with loci in the range 10000000 to 15000000 for chromosomes 1,2 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD to count ABBA and BABA combinations, we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 0 -minQ 20 -minMapQ 30 -p 1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 30 trees (so each block is written on 30 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed in the block. The next 256 columns are the counted patterns of alleles in the tree, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file (including a line for the outgroup population). If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 of steps 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;-0.005,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all 30 4-populations trees. Note: If error correction is not needed, it is sufficient to avoid specifying any error file. If no names need to be provided, the script will assign Population_* as standard name. If no size file is provided, the script assigns 1 to each population. At least one between the name file and the size file is needed.&lt;br /&gt;
&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files (only if you have provided an error rate file and added error rates in the options for the R script). The other text files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the tree (((Population1, Population2)PopWithError)Chimpanzee) the files containing the results are:&lt;br /&gt;
;[[1)result.Population1.Population2.PopWithError.Chimpanzee.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Population2.PopWithError.Chimpanzee.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, the names of the four populations. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Population2.PopWithError.Chimpanzee.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Population2.PopWithError.Chimpanzee.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2931</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2931"/>
		<updated>2017-02-15T11:03:35Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       set fasta file defined by -anc as outgroup in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites of the possible 4-population trees that can be built from the data.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to create from the data (with one population fixed as outgroup)&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: use the file defined by -anc as outgroup of the D-stat analysys. Default: 0 (use last population as a fixed outgroup)&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
This tutorial require having Samtools previously installed, and the library 'pracma' previously installed in R.&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals, where the last population is fixed as outgroup (so that there are 30 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use as outgroup by the option -useLast 1. One can use the last population of .bam files as outgroup with the option -useLast 0. Create a file named sizeFile.size and write the size of each population (it is NOT necessary to define the size of the -anc fasta outgroup):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We decide to target three chromosomes with loci in the range 10000000 to 15000000 for chromosomes 1,2 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD to count ABBA and BABA combinations, we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 30 combinations (so each block is written on 30 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file (including a line for the outgroup population). If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 of steps 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;-0.005,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all 30 4-populations trees.&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files (only if you have provided an error rate file and added error rates in the options for the R script). The other text files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the tree (((Population1, Population2)PopWithError)Chimpanzee) the files containing the results are:&lt;br /&gt;
;[[1)result.Population1.Population2.PopWithError.Chimpanzee.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Population2.PopWithError.Chimpanzee.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, the names of the four populations. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Population2.PopWithError.Chimpanzee.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Population2.PopWithError.Chimpanzee.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2911</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2911"/>
		<updated>2017-02-06T12:53:06Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: include the file defined by -anc in the D-stat analysys. Default: 0&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We decide to target three chromosomes with loci in the range 10000000 to 15000000 for chromosomes 1,2 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 360 combinations (so each block is written on 360 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file. If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 with step 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;-0.005,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all combinations.&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files. The other files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the populations Population1, Chimpanzee, Population2, Population3 the files containing the results are:&lt;br /&gt;
;[[1)result.Population1.Chimpanzee.Population2.Population3.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Chimpanzee.Population2.Population3.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Chimpanzee.Population2.Population3.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.Population1.Chimpanzee.Population2.Population3.RemTrans.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, the names of the four populations. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Chimpanzee.Population2.Population3.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Chimpanzee.Population2.Population3.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2910</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2910"/>
		<updated>2017-02-06T12:50:55Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: include the file defined by -anc in the D-stat analysys. Default: 0&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using the 6 groups of 1,2,2,3,2,1 individuals (where the last one is the fasta file). We decide to target two regions between loci 10000000 and 15000000 for chromosomes 1 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 360 combinations (so each block is written on 360 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file. If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 with step 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;-0.005,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all combinations.&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files. The other files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the populations Population1, Chimpanzee, Population2, Population3 the files containing the results are:&lt;br /&gt;
;[[1)result.Population1.Chimpanzee.Population2.Population3.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Chimpanzee.Population2.Population3.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Chimpanzee.Population2.Population3.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.Population1.Chimpanzee.Population2.Population3.RemTrans.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, the names of the four populations. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Chimpanzee.Population2.Population3.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Chimpanzee.Population2.Population3.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2909</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2909"/>
		<updated>2017-02-06T12:43:39Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: include the file defined by -anc in the D-stat analysys. Default: 0&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using the 6 groups of 1,2,2,3,2,1 individuals (where the last one is the fasta file). We decide to target two regions between loci 10000000 and 15000000 for chromosomes 1 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 360 combinations (so each block is written on 360 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file. If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 with step 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;-0.005,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all combinations.&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files. The other files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the populations Population1, Chimpanzee, Population2, Population3 the files containing the results are:&lt;br /&gt;
;[[1)result.Observed.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Chimpanzee.Population2.Population3.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Chimpanzee.Population2.Population3.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.RemTrans.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, the names of the four populations. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Chimpanzee.Population2.Population3.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Chimpanzee.Population2.Population3.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2908</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2908"/>
		<updated>2017-02-06T12:41:01Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: include the file defined by -anc in the D-stat analysys. Default: 0&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using the 6 groups of 1,2,2,3,2,1 individuals (where the last one is the fasta file). We decide to target two regions between loci 10000000 and 15000000 for chromosomes 1 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 360 combinations (so each block is written on 360 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file. If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 with step 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;-0.005,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all combinations.&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files. The other files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the populations Population1, Chimpanzee, Population2, Population3 the files containing the results are:&lt;br /&gt;
;[[1)result.Observed.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Chimpanzee.Population2.Population3.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Chimpanzee.Population2.Population3.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.RemTrans.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Chimpanzee.Population2.Population3.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Chimpanzee.Population2.Population3.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2907</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2907"/>
		<updated>2017-02-06T12:32:11Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: include the file defined by -anc in the D-stat analysys. Default: 0&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R DSTAT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using the 6 groups of 1,2,2,3,2,1 individuals (where the last one is the fasta file). We decide to target two regions between loci 10000000 and 15000000 for chromosomes 1 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 360 combinations (so each block is written on 360 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file. If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 with step 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;0,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all combinations.&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files. The other files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the populations Population1, Chimpanzee, Population2, Population3 the files containing the results are:&lt;br /&gt;
;[[1)result.Observed.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Chimpanzee.Population2.Population3.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Chimpanzee.Population2.Population3.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.RemTrans.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Chimpanzee.Population2.Population3.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Chimpanzee.Population2.Population3.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2906</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2906"/>
		<updated>2017-02-06T12:31:35Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: include the file defined by -anc in the D-stat analysys. Default: 0&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using the 6 groups of 1,2,2,3,2,1 individuals (where the last one is the fasta file). We decide to target two regions between loci 10000000 and 15000000 for chromosomes 1 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 360 combinations (so each block is written on 360 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file. If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 with step 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;0,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all combinations.&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files. The other files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the populations Population1, Chimpanzee, Population2, Population3 the files containing the results are:&lt;br /&gt;
;[[1)result.Observed.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Chimpanzee.Population2.Population3.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Chimpanzee.Population2.Population3.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.RemTrans.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Chimpanzee.Population2.Population3.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Chimpanzee.Population2.Population3.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2905</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2905"/>
		<updated>2017-02-06T12:22:42Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: include the file defined by -anc in the D-stat analysys. Default: 0&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using the 6 groups of 1,2,2,3,2,1 individuals (where the last one is the fasta file). We decide to target two regions between loci 10000000 and 15000000 for chromosomes 1 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 360 combinations (so each block is written on 360 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file. If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 with step 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;0,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all combinations.&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files. The other files are used by the Rscript for the necessary computations and do not contain useful information.&lt;br /&gt;
The D-statistics are contained in four distinct files for each combination of populations. For example, for the populations Population1, Chimpanzee, Population2, Population3 the files containing the results are:&lt;br /&gt;
;[[1)result.Observed.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[2) result.Population1.Chimpanzee.Population2.Population3.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
;[[3) result.Population1.Chimpanzee.Population2.Population3.TransRem.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
;[[4) result.RemTrans.Population1.Chimpanzee.Population2.Population3.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.Population1.Chimpanzee.Population2.Population3.pdf showing a barplot of the error rates&lt;br /&gt;
;-the file plotAddErr.A2T.Population1.Chimpanzee.Population2.Population3.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2904</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2904"/>
		<updated>2017-02-06T10:41:25Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: include the file defined by -anc in the D-stat analysys. Default: 0&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using the 6 groups of 1,2,2,3,2,1 individuals (where the last one is the fasta file). We decide to target two regions between loci 10000000 and 15000000 for chromosomes 1 and 3. Thus create a file called regions.txt in which is written&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
1:10000000-15000000&lt;br /&gt;
2:10000000-15000000&lt;br /&gt;
3:10000000-15000000&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1) for one of the possible 360 combinations (so each block is written on 360 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file. If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
./errorFile.ancError&lt;br /&gt;
NA&lt;br /&gt;
NA&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 with step 0.001 and involving transition A-&amp;gt;T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Population1&lt;br /&gt;
Population2&lt;br /&gt;
Population3&lt;br /&gt;
PopWithError&lt;br /&gt;
Population4&lt;br /&gt;
Chimpanzee&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Run the Rscript with the command&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript DSTAT angsdFile=&amp;quot;bam.Angsd&amp;quot; out=&amp;quot;result&amp;quot; sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr=&amp;quot;0,0.005,0.001;A;T&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all combinations.&lt;br /&gt;
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder.&lt;br /&gt;
The D-statistics are contained in four distinct files for each population.&lt;br /&gt;
&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;[[1)result.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[2) result.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[3) result.TransRemErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[4) result.RemTrans.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
;-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
;-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
;-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2903</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2903"/>
		<updated>2017-02-06T10:15:29Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -useLast [int]&lt;br /&gt;
1: include the file defined by -anc in the D-stat analysys. Default: 0&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
file that specifies number of individuals in each population (more than 4 populations can be defined)&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
1&lt;br /&gt;
2&lt;br /&gt;
2&lt;br /&gt;
3&lt;br /&gt;
2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
We will apply error correction to the group with 3 individuals, using &amp;quot;perfectSampleCEU&amp;quot; as high-quality reference genome.&lt;br /&gt;
The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 6,8p bam.filelist &amp;gt; bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. &amp;quot;doAncError&amp;quot; applies the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using the 6 groups of 1,2,2,3,2,1 individuals (where the last one is the fasta file)&lt;br /&gt;
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -useLast 1 -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.AllelePatterns.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T. You can run more transitions separating the letters with a comma. It is also possible to specify the names of H1,H2,H3 to be seen on the plot (do not specify them to keep standard names). In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;[[1)result.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[2) result.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[3) result.TransRemErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[4) result.RemTrans.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
;-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
;-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
;-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2902</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2902"/>
		<updated>2017-02-06T09:24:35Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeFile		       (null)   file with sizes of the populations	&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
        -useLast                        0       include fasta file defined by -anc in the analysis&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [int]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeFile [fileName]&lt;br /&gt;
1: specifies number of individuals in each population&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on H1,H2,H3,H4 consisting respectively of 3,5,2,1 individuals. In this case we use a fasta file for the outgroup H4 (we use a bam file in next tutorial). We will apply error correction to groups H1 and H2 assuming one of the samples from H3 as high-quality reference one.&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.AllelePatterns.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T. You can run more transitions separating the letters with a comma. It is also possible to specify the names of H1,H2,H3 to be seen on the plot (do not specify them to keep standard names). In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;[[1)result.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[2) result.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[3) result.TransRemErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[4) result.RemTrans.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
;-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
;-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
;-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2839</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2839"/>
		<updated>2016-08-31T19:47:17Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
;2)*.abbbababa2counts (optional file)&lt;br /&gt;
The columns of this file are the same as above. The file shows the sum of allele patterns observed between any possible combination of 4 genomes in the tree. This file has NO USE in the 4-population test. To print this file, see the option -combFile in the summary.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events between single individuals (however, this file is NOT USED for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on H1,H2,H3,H4 consisting respectively of 3,5,2,1 individuals. In this case we use a fasta file for the outgroup H4 (we use a bam file in next tutorial). We will apply error correction to groups H1 and H2 assuming one of the samples from H3 as high-quality reference one.&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.AllelePatterns.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T. You can run more transitions separating the letters with a comma. It is also possible to specify the names of H1,H2,H3 to be seen on the plot (do not specify them to keep standard names). In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;[[1)result.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[2) result.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[3) result.TransRemErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;[[4) result.RemTrans.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
;-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
;-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
;-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
;-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2838</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2838"/>
		<updated>2016-08-29T05:49:22Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
;2)bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
The columns of this file are the same as above. The file shows the sum of allele patterns observed between any possible combination of 4 genomes in the tree. This file has NO USE in the 4-population test. To print this file, see the option -combFile in the summary.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events between single individuals (however, this file is NOT USED for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on H1,H2,H3,H4 consisting respectively of 3,5,2,1 individuals. In this case we use a fasta file for the outgroup H4 (we use a bam file in next tutorial). We will apply error correction to groups H1 and H2 assuming one of the samples from H3 as high-quality reference one.&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
[[bam.AllelePatterns.abbbababa2 (used for the 4-population test)]]&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T. You can run more transitions separating the letters with a comma. It is also possible to specify the names of H1,H2,H3 to be seen on the plot (do not specify them to keep standard names). In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
[[1)result.Observed.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[2) result.ErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[3) result.TransRemErrorCorr.txt]]&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[4) result.RemTrans.txt]]&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2837</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2837"/>
		<updated>2016-08-29T05:37:08Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
;2)bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
The columns of this file are the same as above. The file shows the sum of allele patterns observed between any possible combination of 4 genomes in the tree. This file has NO USE in the 4-population test. To print this file, see the option -combFile in the summary.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events between single individuals (however, this file is NOT USED for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
== Prepare BAM and FASTA files ==&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on H1,H2,H3,H4 consisting respectively of 3,5,2,1 individuals. In this case we use a fasta file for the outgroup H4 (we use a bam file in next tutorial). We will apply error correction to groups H1 and H2 assuming one of the samples from H3 as high-quality reference one.&lt;br /&gt;
&lt;br /&gt;
==Generate files for the error correction==&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==4-population test==&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
[[The output file is]]&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T. You can run more transitions separating the letters with a comma. It is also possible to specify the names of H1,H2,H3 to be seen on the plot (do not specify them to keep standard names). In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2836</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2836"/>
		<updated>2016-08-29T05:33:06Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)*.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
;2)bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
The columns of this file are the same as above. The file shows the sum of allele patterns observed between any possible combination of 4 genomes in the tree. This file has NO USE in the 4-population test. To print this file, see the option -combFile in the summary.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events between single individuals (however, this file is NOT USED for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial for the ABBABABA (Multipop) test=&lt;br /&gt;
&lt;br /&gt;
;[[Some preparation steps before using ANGSD]]&lt;br /&gt;
== Headline text ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on H1,H2,H3,H4 consisting respectively of 3,5,2,1 individuals. In this case we use a fasta file for the outgroup H4 (we use a bam file in next tutorial). We will apply error correction to groups H1 and H2 assuming one of the samples from H3 as high-quality reference one.&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2835</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2835"/>
		<updated>2016-08-26T13:59:23Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
;2)bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on H1,H2,H3,H4 consisting respectively of 3,5,2,1 individuals. In this case we use a fasta file for the outgroup H4 (we use a bam file in next tutorial). We will apply error correction to groups H1 and H2 assuming one of the samples from H3 as high-quality reference one.&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2834</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2834"/>
		<updated>2016-08-26T13:58:10Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
;2)bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on H1,H2,H3,H4 consisting respectively of 3,5,2,1 individuals. In this case we use a fasta file for the outgroup H4 (we use a bam file in next tutorial). We will apply error correction to groups H1 and H2 assuming one of the samples from H3 as high-quality reference one.&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2833</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2833"/>
		<updated>2016-08-26T13:53:23Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
=Output=&lt;br /&gt;
;1)bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
;2)bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;\pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on H1,H2,H3,H4 consisting respectively of 3,5,2,1 individuals. In this case we use a fasta file for the outgroup H4 (we use a bam file in next tutorial). We will apply error correction to groups H1 and H2 assuming one of the samples from H3 as high-quality reference one.&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2832</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2832"/>
		<updated>2016-08-26T13:50:50Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;\pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In this tutorial we perform the ABBABABA test on H1,H2,H3,H4 consisting respectively of 3,5,2,1 individuals. In this case we use a fasta file for the outgroup H4 (we use a bam file in next tutorial). We will apply error correction to groups H1 and H2 assuming one of the samples from H3 as high-quality reference one.&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test using:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2831</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2831"/>
		<updated>2016-08-26T13:15:49Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This is how the file bam.filelist looks like&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cat bam.filelist&lt;br /&gt;
&lt;br /&gt;
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam&lt;br /&gt;
&amp;lt;\pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2830</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2830"/>
		<updated>2016-08-26T13:12:30Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2829</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2829"/>
		<updated>2016-08-26T13:12:08Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates (plot shown below)&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T (plot shown below)&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]ciao]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2828</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2828"/>
		<updated>2016-08-26T13:11:02Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Ancient Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;br /&gt;
-the file plotAddErr.C2T.pdf showing the effect of error correction on transition C--&amp;gt;T&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2827</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2827"/>
		<updated>2016-08-26T13:09:04Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;br /&gt;
-the file plotAddErr.C2T.pdf showing the effect of error correction on transition C--&amp;gt;T&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorJPGs.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2826</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2826"/>
		<updated>2016-08-26T13:08:06Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;br /&gt;
-the file plotAddErr.C2T.pdf showing the effect of error correction on transition C--&amp;gt;T&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrorsJPG.jpg]]&lt;br /&gt;
[[File:PlotAddErr.A2TJPG.jpg]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=File:PlotAddErr.A2TJPG.jpg&amp;diff=2825</id>
		<title>File:PlotAddErr.A2TJPG.jpg</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=File:PlotAddErr.A2TJPG.jpg&amp;diff=2825"/>
		<updated>2016-08-26T13:06:50Z</updated>

		<summary type="html">&lt;p&gt;Samuele: Error correction A to T doAbbababa2&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Error correction A to T doAbbababa2&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=File:BarPlotErrorJPGs.jpg&amp;diff=2824</id>
		<title>File:BarPlotErrorJPGs.jpg</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=File:BarPlotErrorJPGs.jpg&amp;diff=2824"/>
		<updated>2016-08-26T13:05:21Z</updated>

		<summary type="html">&lt;p&gt;Samuele: bar plot error estimation abbababa2&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;bar plot error estimation abbababa2&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2823</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2823"/>
		<updated>2016-08-26T13:00:55Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;br /&gt;
-the file plotAddErr.C2T.pdf showing the effect of error correction on transition C--&amp;gt;T&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf and plotAddErr.C2T.pdf.&lt;br /&gt;
&lt;br /&gt;
[[File:barPlotErrors.pdf]]&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=File:PlotAddErr.C2T.pdf&amp;diff=2822</id>
		<title>File:PlotAddErr.C2T.pdf</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=File:PlotAddErr.C2T.pdf&amp;diff=2822"/>
		<updated>2016-08-26T12:59:36Z</updated>

		<summary type="html">&lt;p&gt;Samuele: (Samuele) Error correction C to T doAbbababa2&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;(Samuele) Error correction C to T doAbbababa2&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=File:PlotAddErr.A2T.pdf&amp;diff=2821</id>
		<title>File:PlotAddErr.A2T.pdf</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=File:PlotAddErr.A2T.pdf&amp;diff=2821"/>
		<updated>2016-08-26T12:59:20Z</updated>

		<summary type="html">&lt;p&gt;Samuele: (Samuele) Error correction A to T doAbbababa2&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;(Samuele) Error correction A to T doAbbababa2&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=File:BarPlotErrors.pdf&amp;diff=2820</id>
		<title>File:BarPlotErrors.pdf</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=File:BarPlotErrors.pdf&amp;diff=2820"/>
		<updated>2016-08-26T12:58:35Z</updated>

		<summary type="html">&lt;p&gt;Samuele: (Samuele)bar plot error estimation&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;(Samuele)bar plot error estimation&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2819</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2819"/>
		<updated>2016-08-26T12:52:17Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.RemTrans.txt|D stat and Z scores with Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;br /&gt;
-the file plotAddErr.C2T.pdf showing the effect of error correction on transition C--&amp;gt;T&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf and plotAddErr.C2T.pdf.&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2818</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2818"/>
		<updated>2016-08-26T12:51:12Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRem.txt|D stat and Z scores with Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA (Multipop) test=&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done #index bam files&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz #remove zipped file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
rm chimpHg19.fa.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
;ABBABABA test&lt;br /&gt;
&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;br /&gt;
-the file plotAddErr.C2T.pdf showing the effect of error correction on transition C--&amp;gt;T&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf and plotAddErr.C2T.pdf.&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2817</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2817"/>
		<updated>2016-08-26T12:46:15Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRem.txt|D stat and Z scores with Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Tutorial of the ABBABABA Multipop test=&lt;br /&gt;
&lt;br /&gt;
;Some preparation steps before using ANGSD&lt;br /&gt;
Download the latest version of angsd in your working folder from the github repository&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
https://github.com/ANGSD/angsd.git&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Create symbolic links to angsd and the necessary R script&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ln -s ./angsd/angsd ANGSD&lt;br /&gt;
ln -s ./angsd/R/estAvgError.R RSCRIPT&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Get 10 example .bam datasets, positioned in the folder ./bams/ and create a file bam.filelist containing the pathname of those datasets &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/bams.tar.gz&lt;br /&gt;
tar xf bams.tar.gz&lt;br /&gt;
for i in bams/*.bam;do samtools index $i;done&lt;br /&gt;
ls bams/*.bam &amp;gt; bam.filelist&lt;br /&gt;
rm bams.tar.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the four-population test. One can use a bam file as well.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz&lt;br /&gt;
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz&lt;br /&gt;
gunzip chimpHg19.fa.gz&lt;br /&gt;
samtools faidx chimpHg19.fa #indexing the fasta file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -out perfectSampleCEU&lt;br /&gt;
gunzip perfectSampleCEU.fa.gz&lt;br /&gt;
samtools faidx perfectSampleCEU.fa&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;Prepare files for the estimation of type specific error rates&lt;br /&gt;
Assume population H1 consists of the first two genomes of our list, while population H2 consists of the genomes 3 to 7. We want to apply error correction to those genomes, because we know they have been subjected to contamination. We generate two files containing the pathnames of the genomes of H1 and H2 on which we want to apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
sed -n 1,2p bam.filelist &amp;gt; bamH1.filelist&lt;br /&gt;
sed -n 3,7p bam.filelist &amp;gt; bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
and then we use &amp;quot;doAncError&amp;quot; to generate the intermediate files that we will use later to estimate the error rates for the two groups H1 and H2. &amp;quot;doAncError&amp;quot; apply the so called &amp;quot;perfect individual assumption&amp;quot;, based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH1 -bam bamH1.filelist&lt;br /&gt;
./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out bamH2 -bam bamH2.filelist&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now, we want to run the four population test on:&lt;br /&gt;
H1: first 2 bam files&lt;br /&gt;
H2: bam files from 3 to 7&lt;br /&gt;
H3: bam files from 8 to 10&lt;br /&gt;
H4: chimpHg19.fa file&lt;br /&gt;
After running ANGSD we will call the R script who apply error correction to the ABBA and BABA allele combinations and produce the final output files.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
./ANGSD -doAbbababa2 1 -bam bam.filelist -doCounts 1 -out bam.AllelePatterns -sizeH1 2 -sizeH2 5 -sizeH3 3 -anc chimpHg19.fa -minQ 20 -minMapQ 30&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The output file is&lt;br /&gt;
;bam.AllelePatterns.abbbababa2 (used for the 4-population test)&lt;br /&gt;
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R.&lt;br /&gt;
Optionally, one can also produce&lt;br /&gt;
;bam.AllelePatterns.abbbababa2counts (optional file)&lt;br /&gt;
As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
&lt;br /&gt;
We run the R script specifying the intermediate error files for populations H1 and H2. We also want to study the effect of error correction if we add individually to each population an error rate between 0 and 0.005 with step 0.001 and involving transitions A-&amp;gt;T and C--&amp;gt;T. It is also possible to specify the names of H1,H2,H3 to be seen on the plot. In this case we use the generic names CEU1,CEU2,CEU3. When at least an error file is given as input, the script will apply error correction.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Rscript RSCRIPT angsdFile=&amp;quot;bam.AllelePatterns&amp;quot; out=&amp;quot;result&amp;quot; file1=&amp;quot;bamH1.ancError&amp;quot; file2=&amp;quot;bamH2.ancError&amp;quot;  addErr=&amp;quot;0,0.005,0.001;A,C;T;CEU1,CEU2,CEU3&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
--- Table of Results ---&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
  Mode		|Dstat		|sd(Dstat)	|Djack		|Zscore	|Pvalue&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Observed	|-6.323e-02	|6.985e-02	|-6.323e-02	|-0.905	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|-6.430e-02	|7.226e-02	|-6.431e-02	|-0.890	|3.7e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
No Trans	|-1.141e-02	|6.311e-02	|-1.141e-02	|-0.181	|8.6e-01&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
Err Corr	|		|		|		|	|	&lt;br /&gt;
   and		|-1.494e-02	|6.615e-02	|-1.496e-02	|-0.226	|8.2e-01&lt;br /&gt;
No Trans	|		|		|		|	|	&lt;br /&gt;
---------------------------------------------------------------------------------&lt;br /&gt;
plots with effect of removed errors and D statistic files for all the removed errors are in folder result.errorDataFolder&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Those results are also contained in four distinct files&lt;br /&gt;
;1) result.Observed.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.063233       -0.063233       0.004878        -0.905320       0.365296        246.033565      279.248560      292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;2) result.ErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.064295       -0.064309       0.005221        -0.889833       0.373555        238.242964      270.983960      293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;3) result.TransRemErrorCorr.txt&lt;br /&gt;
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.014939       -0.014959       0.004376        -0.225829       0.821335        81.636843       84.112983       293.326044&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
;4) result.RemTrans.txt&lt;br /&gt;
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D) JK-D    V(JK-D) Z       pvalue  nABBA   nBABA   nBBAA&lt;br /&gt;
-0.011406       -0.011406       0.003983        -0.180730       0.856580        85.730478       87.708709       292.834879&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, nBBAA=all the other observed patterns. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.&lt;br /&gt;
&lt;br /&gt;
In case of error correction, the R script also creates the folder result.errorDataFolder containing:&lt;br /&gt;
-the file barPlotErrors.pdf showing a barplot of the error rates&lt;br /&gt;
-the file plotAddErr.A2T.pdf showing the effect of error correction on transition A--&amp;gt;T&lt;br /&gt;
-the file plotAddErr.C2T.pdf showing the effect of error correction on transition C--&amp;gt;T&lt;br /&gt;
-the file errorRates.txt showing in each line transition errors for each population, respectively&lt;br /&gt;
-all the files related to the addition of error correction to H1,H2,H3, necessary to plot the files plotAddErr.A2T.pdf and plotAddErr.C2T.pdf.&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2816</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2816"/>
		<updated>2016-08-26T11:48:25Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRem.txt|D stat and Z scores with Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads (If you want to use it, it must be only when you have one genome in the outgroup, it won't work otherwise).&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2815</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2815"/>
		<updated>2016-08-26T11:45:55Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|Observed D stat and Z scores{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.ErrorCorr.txt|D stat and Z scores Error Corrected{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRemErrorCorr.txt|D stat and Z scores Error Corrected with ancient Transition Removal{bg:blue}]&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.TransRem.txt|D stat and Z scores with Transition Removal{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2814</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2814"/>
		<updated>2016-08-26T11:42:17Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;-&amp;gt;estAvgError.R[*.Observed.txt|D stat and Z scores + Error est and Plots{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2813</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2813"/>
		<updated>2016-08-26T11:41:16Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.bam and/or *.cram| NGS genome datasets{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[*.bam and/or *.cram]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[*.bam and/or *.cram]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;-&amp;gt;estAvgError.R[*.Observed.txt|D stat and Z scores + Error est and Plots{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2812</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2812"/>
		<updated>2016-08-26T11:36:48Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.BAM| NGS genomes{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;-&amp;gt;estAvgError.R[*.Observed.txt|D stat and Z scores + Error est and Plots{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2811</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2811"/>
		<updated>2016-08-26T11:36:32Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.BAM| NGS genomes{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]--&amp;gt;estAvgError.R[*.Observed.txt|D stat and Z scores + Error est and Plots{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2810</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2810"/>
		<updated>2016-08-26T11:35:16Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [*.BAM| NGS genomes{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per population]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Elaborate multiple genomes per population]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[*.Observed.txt|D stat and Z scores + Error est and Plots{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2809</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2809"/>
		<updated>2016-08-26T11:32:26Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [BAM files{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes|per populations]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Weighting data]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores|Error est and Plots{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2808</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2808"/>
		<updated>2016-08-26T11:31:56Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [BAM files{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Elaborate multiple genomes per populations]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Weighting data]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores|Error est and Plots{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2807</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2807"/>
		<updated>2016-08-26T11:30:29Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [BAM files{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Weighting data|Weighted combination of individuals]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Weighting data]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores|Error est and Plots{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
	<entry>
		<id>https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2806</id>
		<title>Abbababa2</title>
		<link rel="alternate" type="text/html" href="https://www.popgen.dk/angsd/index.php?title=Abbababa2&amp;diff=2806"/>
		<updated>2016-08-26T11:29:49Z</updated>

		<summary type="html">&lt;p&gt;Samuele: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The ''ABBABABA (multipop)'' compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology).&lt;br /&gt;
Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.&lt;br /&gt;
; some of the options only works for the developmental version availeble from github&lt;br /&gt;
; if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
 [BAM files{bg:orange}]-&amp;gt;[Sequence data|All bases or Random bases]&lt;br /&gt;
[Sequence data]-&amp;gt;[Weighting data|Weighted combination of individuals]&lt;br /&gt;
[Sequence data]-&amp;gt;[*.abbababa2counts|ABBA and BABA intermediate counts file {bg:blue}]&lt;br /&gt;
[Weighting data]-&amp;gt;[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;classdiagram type=&amp;quot;dir:LR&amp;quot;&amp;gt;&lt;br /&gt;
[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]-&amp;gt;estAvgError.R[D stat and Z scores{bg:blue}]&lt;br /&gt;
&amp;lt;/classdiagram&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Method=&lt;br /&gt;
&lt;br /&gt;
[[File:tree.png|800px]]&lt;br /&gt;
&lt;br /&gt;
=Brief Overview=&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt; ./angsd -doAbbababa2&lt;br /&gt;
&lt;br /&gt;
--------------&lt;br /&gt;
abcDstat2.cpp:&lt;br /&gt;
	-doAbbababa2	                0	run the abbababa analysis&lt;br /&gt;
	-rmTrans		        0       remove transitions&lt;br /&gt;
	-blockSize		       5000000	size of each block in bases&lt;br /&gt;
	-anc			       (null)	fasta file with outgroup&lt;br /&gt;
	-sample			        0	sample a single base in each individual&lt;br /&gt;
	-maxDepth		        100	max depth of each site allowed&lt;br /&gt;
	-sizeH1			        1	num of individuals in group H1&lt;br /&gt;
	-sizeH2			        1	num of individuals in group H2&lt;br /&gt;
	-sizeH3			        1	num of individuals in group H3&lt;br /&gt;
	-sizeH4			        1	num of individuals in group H4&lt;br /&gt;
	-enhance			0	only analyze sites where outgroup H4 is non poly&lt;br /&gt;
	-Aanc			        0	set H4 outgroup allele as A in each site&lt;br /&gt;
	-combFile		        0	create an optional *.abbababa2counts file where are printed the &lt;br /&gt;
                                     numbers of alleles combinations without having weighted the individuals&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This function will counts the number of ABBA and BABA sites.&lt;br /&gt;
&lt;br /&gt;
=Options=&lt;br /&gt;
;-doAbbababa2 1: &lt;br /&gt;
take all bases at each position.&lt;br /&gt;
;-rmTrans [int]&lt;br /&gt;
0; use all reads (default), 1 Remove transitions (important for ancient DNA)&lt;br /&gt;
;-blockSize [INT]&lt;br /&gt;
Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used. &lt;br /&gt;
; -anc [fileName.fa]&lt;br /&gt;
Include an outgroup in fasta format.&lt;br /&gt;
; -doCounts 1&lt;br /&gt;
use -doCounts 1 in order to count the bases at each sites after filters.&lt;br /&gt;
; -enhance [int]&lt;br /&gt;
1: use only sites where the reads for the outgroup has the same base for all reads.&lt;br /&gt;
; -sample [int]&lt;br /&gt;
1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test&lt;br /&gt;
; -maxDepth [int]&lt;br /&gt;
allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts&lt;br /&gt;
; -sizeH* [int]&lt;br /&gt;
decide how many individuals are in each group (the file list must contain the BAM files ordered from population 1 to 4). &lt;br /&gt;
If you are using a fasta file (option -anc) for population H4, leave -sizeH4 at its default value&lt;br /&gt;
; -Aanc [int]&lt;br /&gt;
1: H4 allele is A in each site.&lt;br /&gt;
; -combFile [int]&lt;br /&gt;
1: create an intermediate *.abbababa2counts to obtain the allele events before weighting the samples (however, this file is not used for the estimation of the D-statistic).&lt;br /&gt;
&lt;br /&gt;
In order to do fancy filtering of bases based on quality scores see the [[Alleles_counts|Allele counts]] options.&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
;*.abbbababa2&lt;br /&gt;
Output: Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.&lt;br /&gt;
This file is used as input for the R script estAvgError.R. Type &amp;quot;Rscript R/estAvgError.R&amp;quot; to see additional options.&lt;br /&gt;
&lt;br /&gt;
;*.abbbababa2counts (optional file)&lt;br /&gt;
Output: As above each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is not used as input for the ABBABABA test.&lt;br /&gt;
 &lt;br /&gt;
==Example==&lt;br /&gt;
Run the ABBABABA (multipop), without sampling a single allele for each site. &lt;br /&gt;
We use 2 individuals for each group, thus the smallBam.filelist file contains the 8 file names ordered from population H1 to H4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# select 8 individuals (suppose they are already in the desired order H1,..,H4)&lt;br /&gt;
head -n8 bam.filelist &amp;gt; smallBam.filelist&lt;br /&gt;
&lt;br /&gt;
#run angsd&lt;br /&gt;
./angsd -out out -doAbbababa2 1 -bam smallBam.filelist -doCounts 1 -sample 0 -enhance 0 -sizeH1 2 -sizeH2 2 -sizeH3 3 -sizeH4 2 -combFile 0&lt;br /&gt;
&lt;br /&gt;
#estimate Z score&lt;br /&gt;
Rscript R/estAvgError.R file=out outfile=outDstat&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This results in multiple output files with all the results.&lt;br /&gt;
&lt;br /&gt;
1) a file outDstatStd.txt, that contains the results of the ABBABABA test without applying neither error correction or ancient transition removal&lt;br /&gt;
&lt;br /&gt;
2) a file outDstatNoErrorNoTrans.txt, that contains the results of the ABBABABA test without applying error correction but removing ancient transitions&lt;br /&gt;
&lt;br /&gt;
=Output=&lt;br /&gt;
1)outDstatStd.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
2)outDstatNoErrorNoTrans.txt&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mean(D)          JK-D          V(JK-D)            Z            pvalue           nABBA         nBABA          nBBAA&lt;br /&gt;
-0.042874       -0.042945       0.000643        -1.690314       0.090968        1697.447933     1704.675933     6374.948767	&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''mean(D)''' The average of test statistics: (nABBA-nBABA)/(nABBA+nBABA), each one calculated for a block of data.&lt;br /&gt;
&lt;br /&gt;
'''JK-D''' The estimated test statistic: (nABBA-nBABA)/(nABBA+nBABA) after being bias corrected. This value should be similar to the one in column 1. A negative value means that H1 is closer to H3 than H2 is. A positive value means that H2 is closer to H3 than H1 is.&lt;br /&gt;
&lt;br /&gt;
'''V(JK-D)''' estimated m-delete blocked Jackknife variance of the estimator of column 2. It's used to compute the Z-value.&lt;br /&gt;
&lt;br /&gt;
'''Z''' Z value that can be used to determine the significance of the test. As in Reich et al. an absolute value of the Z score above 3 is often used as a critical value.&lt;br /&gt;
&lt;br /&gt;
'''pvalue''' p-value corresponding to Z for a double-sided standard test. the critical value 0.001 correspond to the value of Z=3.&lt;br /&gt;
&lt;br /&gt;
'''nABBA''' the total counts of ABBA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBABA''' the total counts of BABA patterns&lt;br /&gt;
&lt;br /&gt;
'''nBBAA''' the total counts of BBAA patterns&lt;/div&gt;</summary>
		<author><name>Samuele</name></author>
	</entry>
</feed>