ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

2d SFS Estimation

From angsd
Revision as of 14:34, 11 May 2015 by Thorfinn (talk | contribs)
Jump to navigation Jump to search

Angsd can estimate a 2d site frequency spectrum. This is an extension of the 1d site frequency spectrum method. Never versions of ANGSD can estimate even higher dimensions (upto 4)

And is best explained by a full example.

Example

  • Assume you have a 12 bamfiles for population in the file pop1.list
  • Assume you have a 14 bamfiles for population in the file pop2.list
  • Assume you have a fastafile containing the ancestral state in the anc.fa
  • Assume we are only interested in chr1

Let's start by finding the positions for which we have data in population1 and population2

# as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data.
angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -doSaf 1
angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -doSaf 1

Each run will generate 2 files of interest: pop1.saf,pop1.saf.pos and pop2.saf,pop2.saf.pos

If we were interested in estimating the 1d sfs for each population we could do it like this using the realSFS program. (See more on page )

realSFS pop1.saf 24 -P 24 >pop1.saf.sfs
realSFS pop2.saf 28 -P 24 >pop2.saf.sfs
#first argument is saf file, second argument is the number of chromosomes, -P 24 is the number of cores we want to use

Now we find the positions that occurs both in population1 and population2 using the uniq POSIX program.

gunzip -c pop1.saf.pos.gz pop2.saf.pos.gz|sort  -S 50%|uniq -d|sort -k1,1  -S 50% >intersect.txt

And now we redo the angsd sample allelefrequence calculation by conditioning on the sites that occur in both populations

# as always you can add -minMapQ 1 and -minQ 20 to only keep high quality data.
angsd -GL 1 -b pop1.list -anc anc.fa -r chr1: -P 10 -out pop1 -sites intersect.txt -doSaf 1
angsd -GL 1 -b pop2.list -anc anc.fa -r chr1: -P 10 -out pop2 -sites intersect.txt -doSaf 1

Notice that the last 2 commands will overwrite the: pop1.saf,pop1.saf.pos and pop2.saf,pop2.saf.pos files.


And we now estimate the joint site frequency spectra by using the realSFS program

realSFS 2dsfs pop1.saf pop2.saf  24 28 -P 24 >2dsfs.sfs

The output is then located in a nice matrix format(25x29) in the file: 2dsfs.sfs. Good luck visualising it, some people are using dadi, we have been using heat maps in R.