ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Abbababa2: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
No edit summary
Line 34: Line 34:
-enhance 0 only analyze sites where outgroup H4 is non poly
-enhance 0 only analyze sites where outgroup H4 is non poly
-Aanc         0 set H4 outgroup allele as A in each site
-Aanc         0 set H4 outgroup allele as A in each site
         -useLast                        0      include fasta file defined by -anc in the analysis
         -useLast                        0      set fasta file defined by -anc as outgroup in the analysis


</pre>
</pre>
This function will counts the number of ABBA and BABA sites.
This function will counts the number of ABBA and BABA sites of the possible 4-population trees that can be built from the data.
=Output=
=Output=
;1)*.abbbababa2 (used for the 4-population test)
;1)*.abbbababa2 (used for the 4-population test)
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to do create from the data
Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to create from the data (with one population fixed as outgroup)
This file is used as input for the R script estAvgError.R.
This file is used as input for the R script estAvgError.R.


Line 53: Line 53:
Include an outgroup in fasta format.
Include an outgroup in fasta format.
; -useLast [int]
; -useLast [int]
1: include the file defined by -anc in the D-stat analysys. Default: 0
1: use the file defined by -anc as outgroup of the D-stat analysys. Default: 0 (use last population as a fixed outgroup)
; -doCounts 1
; -doCounts 1
use -doCounts 1 in order to count the bases at each sites after filters.
use -doCounts 1 in order to count the bases at each sites after filters.
Line 70: Line 70:


=Tutorial for the ABBABABA (Multipop) test=
=Tutorial for the ABBABABA (Multipop) test=
This tutorial require having Samtools previously installed, and the library 'pracma' previously installed in R.
== Prepare BAM and FASTA files ==
== Prepare BAM and FASTA files ==
Download the latest version of angsd in your working folder from the github repository
Download the latest version of angsd in your working folder from the github repository
Line 104: Line 105:
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
</pre>
</pre>
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test. One can use a bam file as well (see in one of the other examples after the tutorial how to do it).
Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test.
<pre>
<pre>
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
Line 131: Line 132:


==4-population test==
==4-population test==
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals (so that there are 360 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use by the option -useLast 1. Create a file named sizeFile.size and write the size of each population (skip the last referred to the -anc fasta file):
In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals, where the last population is fixed as outgroup (so that there are 30 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use as outgroup by the option -useLast 1. One can use the last population of .bam files as outgroup with the option -useLast 0. Create a file named sizeFile.size and write the size of each population (it is NOT necessary to define the size of the -anc fasta outgroup):
<pre>
<pre>
1
1
Line 145: Line 146:
3:10000000-15000000
3:10000000-15000000
</pre>
</pre>
After running ANGSD we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.
After running ANGSD to count ABBA and BABA combinations, we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.
<pre>
<pre>
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3
./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3
Line 151: Line 152:
[[The output file is]]
[[The output file is]]
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]
[[bam.Angsd.abbbababa2 (used for the 4-population test)]]
Each line represents a block with a chromsome name (Column 1) for one of the possible 360 combinations (so each block is written on 360 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.
Each line represents a block with a chromsome name (Column 1) for one of the possible 30 combinations (so each block is written on 30 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T.
This file is used as input for the R script estAvgError.R.
This file is used as input for the R script estAvgError.R.


We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file. If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written
We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file (including a line for the outgroup population). If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written
<pre>
<pre>
NA
NA
Line 163: Line 164:
NA
NA
</pre>
</pre>
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 with step 0.001 and involving transition A->T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written
We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 of steps 0.001 and involving transition A->T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written
<pre>
<pre>
Population1
Population1
Line 176: Line 177:
Rscript DSTAT angsdFile="bam.Angsd" out="result" sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr="-0.005,0.005,0.001;A;T"
Rscript DSTAT angsdFile="bam.Angsd" out="result" sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr="-0.005,0.005,0.001;A;T"
</pre>
</pre>
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all combinations.
The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all 30 4-populations trees.
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files. The other files are used by the Rscript for the necessary computations and do not contain useful information.
The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files (only if you have provided an error rate file and added error rates in the options for the R script). The other text files are used by the Rscript for the necessary computations and do not contain useful information.
The D-statistics are contained in four distinct files for each combination of populations. For example, for the populations Population1, Chimpanzee, Population2, Population3 the files containing the results are:
The D-statistics are contained in four distinct files for each combination of populations. For example, for the tree (((Population1, Population2)PopWithError)Chimpanzee) the files containing the results are:
;[[1)result.Population1.Chimpanzee.Population2.Population3.Observed.txt]]
;[[1)result.Population1.Population2.PopWithError.Chimpanzee.Observed.txt]]
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal
D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal
;[[2) result.Population1.Chimpanzee.Population2.Population3.ErrorCorr.txt]]
;[[2) result.Population1.Population2.PopWithError.Chimpanzee.ErrorCorr.txt]]
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal
D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal
;[[3) result.Population1.Chimpanzee.Population2.Population3.TransRem.ErrorCorr.txt]]
;[[3) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.ErrorCorr.txt]]
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal
D-statistic calculated WITH Error Correction and WITH Ancient Transition removal
;[[4) result.Population1.Chimpanzee.Population2.Population3.RemTrans.txt]]
;[[4) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.txt]]
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal
D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal


Line 191: Line 192:


In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:
In case of error correction, the R script also creates the folder [[result.errorDataFolder]] containing:
;-the file barPlotErrors.Population1.Chimpanzee.Population2.Population3.pdf showing a barplot of the error rates
;-the file barPlotErrors.Population1.Population2.PopWithError.Chimpanzee.pdf showing a barplot of the error rates
;-the file plotAddErr.A2T.Population1.Chimpanzee.Population2.Population3.pdf showing the effect of error correction on transition A-->T
;-the file plotAddErr.A2T.Population1.Population2.PopWithError.Chimpanzee.pdf showing the effect of error correction on transition A-->T

Revision as of 13:03, 15 February 2017

The ABBABABA (multipop) compute the abbababa test (aka D-statistic), that means testing for an ancient admixture (or wrong tree topology). Differently from ABBABABA (D_stat) multiple individuals for each one of the groups are allowed. As all methods in ANGSD we require that the header of the BAM files are the same.

some of the options only works for the developmental version availeble from github
if you use -rf to specify regions. These MUST appear in the same ordering as your fai file.

<classdiagram type="dir:LR">

[*.bam and/or *.cram| NGS genome datasets{bg:orange}]->[Sequence data|All bases or Random bases]

[Sequence data]->[Elaborate multiple genomes per population] [Elaborate multiple genomes per population]->[*.abbababa2|weighted ABBA and BABA counts file {bg:blue}] </classdiagram>

<classdiagram type="dir:LR"> [*.abbababa2|weighted ABBA and BABA counts file {bg:blue}]->estAvgError.R[D stat and Z scores{bg:blue}] </classdiagram>

Method

Brief Overview

> ./angsd -doAbbababa2

--------------
abcDstat2.cpp:
	-doAbbababa2	                0	run the abbababa analysis
	-rmTrans		        0       remove transitions
	-blockSize		       5000000	size of each block in bases
	-anc			       (null)	fasta file with outgroup
	-sample			        0	sample a single base in each individual
	-maxDepth		        100	max depth of each site allowed
	-sizeFile		       (null)   file with sizes of the populations	
	-enhance			0	only analyze sites where outgroup H4 is non poly
	-Aanc			        0	set H4 outgroup allele as A in each site
        -useLast                        0       set fasta file defined by -anc as outgroup in the analysis

This function will counts the number of ABBA and BABA sites of the possible 4-population trees that can be built from the data.

Output

1)*.abbbababa2 (used for the 4-population test)

Each line represents a block with a chromsome name (Column 1), a start position (Column 2), an end postion (Column 3). The new columns are the counts of all 256 counted combination of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. Every block is repeated a number of times corresponding to the combinations of 4 populations that are possible to create from the data (with one population fixed as outgroup) This file is used as input for the R script estAvgError.R.

Options

-doAbbababa2 1

take all bases at each position.

-rmTrans [int]

0; use all reads (default), 1 Remove ancient transitions (important for ancient DNA)

-blockSize [int]

Size of each block. Choose a number that is higher than the LD in the populations. For human 5Mb (5000000) is usually used.

-anc [fileName.fa]

Include an outgroup in fasta format.

-useLast [int]

1: use the file defined by -anc as outgroup of the D-stat analysys. Default: 0 (use last population as a fixed outgroup)

-doCounts 1

use -doCounts 1 in order to count the bases at each sites after filters.

-enhance [int]

1: use only sites where the reads for the outgroup has the same base for all reads.

-sample [int]

1: sample only one base at each position for every individual 0: all bases at each position are used for the ABBABABA test

-maxDepth [int]

allows for a maximum depth in each site to avoid overflow of the ABBA BABA counts. Default 100.

-sizeFile [fileName]

file that specifies number of individuals in each population (more than 4 populations can be defined)

-Aanc [int]

1: H4 allele is A in each site.

In order to do fancy filtering of bases based on quality scores see the Allele counts options.

Tutorial for the ABBABABA (Multipop) test

This tutorial require having Samtools previously installed, and the library 'pracma' previously installed in R.

Prepare BAM and FASTA files

Download the latest version of angsd in your working folder from the github repository

https://github.com/ANGSD/angsd.git

Create symbolic links to angsd and the necessary R script

ln -s ./angsd/angsd ANGSD
ln -s ./angsd/R/estAvgError.R DSTAT

Get 10 example .bam datasets, position them in the folder ./bams/ and create a file bam.filelist listing the pathnames of those datasets

wget http://popgen.dk/software/download/angsd/bams.tar.gz
tar xf bams.tar.gz
for i in bams/*.bam;do samtools index $i;done #index bam files
ls bams/*.bam > bam.filelist
rm bams.tar.gz #remove zipped file

This is how the file bam.filelist looks like

cat bam.filelist
bams/smallNA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA07357.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA11829.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA11831.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA11832.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam

Download a fasta file for the chimpanzee. This is going to be used as the outgroup for the 4-population test.

wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz
mv hg19ancNoChr.fa.gz chimpHg19.fa.gz
gunzip chimpHg19.fa.gz
samtools faidx chimpHg19.fa #indexing the fasta file
rm chimpHg19.fa.gz

Now, generate a fasta file for one of our 10 bam file. We assume such a genome has very high quality and we can use it as a reference for estimating error rates in others of our datasets.

./ANGSD -i bams/smallNA11840.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1 -doCounts 1 -out perfectSampleCEU
gunzip perfectSampleCEU.fa.gz
samtools faidx perfectSampleCEU.fa

Generate files for the error correction

We will apply error correction to the group with 3 individuals, using "perfectSampleCEU" as high-quality reference genome. The population containing 3 individuals affected by transition error goes from line 6 to line 8 in the file bam.filelist. We select those individuals and write them in another file.

sed -n 6,8p bam.filelist > bamWithErrors.filelist

and then we use "doAncError" to generate the intermediate files that we will use later as input for the R script that calculates the D-statistic. "doAncError" applies the so called "perfect individual assumption", based on which error rates are estimate using a high quality genome (option -ref) and an outgroup (option -anc), both in fasta format. We have already prepared the two fasta files in our preparation phase.

./ANGSD -doAncError 1 -anc chimpHg19.fa -ref perfectSampleCEU.fa -out errorFile -bam bamWithErrors.filelist

4-population test

In this tutorial we perform the ABBABABA test on all the combinations of 4 populations amongst 6 populations of size 1,2,2,3,2,1 individuals, where the last population is fixed as outgroup (so that there are 30 possible combinations). The last population is represented by the fasta file defined with the option -anc, of which we enable the use as outgroup by the option -useLast 1. One can use the last population of .bam files as outgroup with the option -useLast 0. Create a file named sizeFile.size and write the size of each population (it is NOT necessary to define the size of the -anc fasta outgroup):

1
2
2
3
2

We decide to target three chromosomes with loci in the range 10000000 to 15000000 for chromosomes 1,2 and 3. Thus create a file called regions.txt in which is written

1:10000000-15000000
2:10000000-15000000
3:10000000-15000000

After running ANGSD to count ABBA and BABA combinations, we will call the R script who applies error correction to the ABBA and BABA allele combinations and produces the final output files.

./ANGSD -doAbbababa2 1 -bam bam.filelist -sizeFile sizeFile.size -doCounts 1 -out bam.Angsd -anc chimpHg19.fa -rf regions.txt -useLast 1 -minQ 20 -minMapQ 30 -p 3

The output file is bam.Angsd.abbbababa2 (used for the 4-population test) Each line represents a block with a chromsome name (Column 1) for one of the possible 30 combinations (so each block is written on 30 lines), a start position (Column 2), an end postion (Column 3). Columns 4,5 and 6 are the numerator, denominator and number of sites analyzed. The next 256 columns are the counted patterns of alleles, starting from X0000=AAAA,X0001=AAAC,....,X3333=TTTT, with the correspondence 0=A,1=C,2=G,3=T. This file is used as input for the R script estAvgError.R.

We run the R script specifying the error files for the population with 3 individuals. This is done defining the error files in each populations inside a text file (including a line for the outgroup population). If a population has no error file, it is sufficient to write NA. Create a file called errorList.error with written

NA
NA
NA
./errorFile.ancError
NA
NA

We also want to study the effect of error correction if we add individually to each population an error rate between -0.005 and 0.005 of steps 0.001 and involving transition A->T. You can run more transitions separating the letters with a comma. We need to create a file with the names of the 6 populations. Create a file popNames.name with written

Population1
Population2
Population3
PopWithError
Population4
Chimpanzee

Run the Rscript with the command

Rscript DSTAT angsdFile="bam.Angsd" out="result" sizeFile=sizeFile.size errFile=errorList.error nameFile=popNames.name addErr="-0.005,0.005,0.001;A;T"

The script will show the calculated D statistic along with Z-score, Pvalues, Standard deviation and other quantities for all 30 4-populations trees. The plots of error rates for type specific errors and added errors are in the folder result.errorDataFolder as pdf files (only if you have provided an error rate file and added error rates in the options for the R script). The other text files are used by the Rscript for the necessary computations and do not contain useful information. The D-statistics are contained in four distinct files for each combination of populations. For example, for the tree (((Population1, Population2)PopWithError)Chimpanzee) the files containing the results are:

1)result.Population1.Population2.PopWithError.Chimpanzee.Observed.txt

D-statistic calculated WITHOUT Error Correction and WITHOUT Ancient Transition removal

2) result.Population1.Population2.PopWithError.Chimpanzee.ErrorCorr.txt

D-statistic calculated WITH Error Correction and WITHOUT Ancient Transition removal

3) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.ErrorCorr.txt

D-statistic calculated WITH Error Correction and WITH Ancient Transition removal

4) result.Population1.Population2.PopWithError.Chimpanzee.TransRem.txt

D-statistic calculated WITHOUT Error Correction and WITH Ancient Transition removal

Specifically, the values contained in the four files are: mean(D)=average D-stat, JK-D=jackknife estimate of the D-stat, V(JK-D)=variance of the D-stat, Z=Z score, pvalue=pvalue from the Z score, nABBA=number of ABBA patterns observed, nBABA=number of BABA patterns observed, the names of the four populations. Note that the number of patterns might not be integer because of how ANGSD treats multiple genomes per populations.

In case of error correction, the R script also creates the folder result.errorDataFolder containing:

-the file barPlotErrors.Population1.Population2.PopWithError.Chimpanzee.pdf showing a barplot of the error rates
-the file plotAddErr.A2T.Population1.Population2.PopWithError.Chimpanzee.pdf showing the effect of error correction on transition A-->T