ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Thetas,Tajima,Neutrality tests: Difference between revisions

From angsd
Jump to navigation Jump to search
Line 118: Line 118:


<pre>
<pre>
(59999,69999)(60000,70000)(60000,70000) chr1 65000  2349.039592    2008.865974    2791.401569    3817.828656    2913.347320    -0.545594       -0.626967       -0.486984       -0.617873       0.195337        10000
## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
(69999,79999)(70000,80000)(70000,80000) chr1  75000  2349.113388    1993.792014    2764.051812    3979.987797    2986.889940    -0.569871       -0.617112       -0.456779       -0.678388       0.220762       10000
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop)  Chr    WinCenter      tW      tP      tF      tH      tL      Tajima fuf    fud    fayh    zeng    nSites
(79999,89999)(80000,90000)(80000,90000) chr1  85000  2349.154140    2035.577279    2649.132059    3902.254435    2968.915852    -0.502912      -0.491556      -0.330221       -0.637555      0.214522       10000
(0,98316)(14000032,14100082)(0,14100082)        1      7050041 51.002623      46.171402      64.683834      51.290955      48.731178      -0.392892       -0.647071       -0.595302       -0.099654       -0.048444      98316
(89999,99999)(90000,100000)(90000,100000)      chr1  95000  2349.462773    2048.143641    2533.193917    3881.554872    2964.849262    -0.483190      -0.388552      -0.202228      -0.626111       0.212980       10000
(0,98474)(13999910,14100060)(0,14100060)       2      7050030 92.689100      88.806005      101.768262      122.422498      105.614255      -0.174701       -0.252477       -0.220588       -0.360944       0.152373       98474
(99999,109999)(100000,110000)(100000,110000)   chr1  105000  2349.306947    2103.402129    2608.611593    3738.658529    2921.030347    -0.394355      -0.404727      -0.285429      -0.558478       0.197881       10000
(0,93269)(14000529,14100095)(0,14100095)       3      7050047 70.757874      76.248087      75.447438      68.354514      72.301301      0.322902        0.020330        -0.148419       0.110921        0.023794       93269
(109999,119999)(110000,120000)(110000,120000)   chr1  115000  2348.965451    1867.325681    2725.815492    4491.310734    3179.318214    -0.772512      -0.687843      -0.414876      -0.896283       0.287438       10000
(0,96339)(13999912,14100064)(0,14100064)       4       7050032 99.748624      107.898618      94.265208      130.283528      119.091076      0.340878        0.247030        0.123956        -0.223386       0.211971       96339
(119999,129999)(120000,130000)(120000,130000)   chr1  125000  2349.437816    2077.636124    2623.517860    3755.631838    2916.633993    -0.435861       -0.437286       -0.301676       -0.573043       0.196304       10000
(0,99659)(13999926,14100063)(0,14100063)       5      7050031 120.941697      132.667821      86.726667      163.908351      148.288088      0.404945        0.688320        0.639821        -0.257254       0.247395       99659
(0,99541)(13999918,14100103)(0,14100103)       6      7050051 96.666344      112.146685      69.740992      143.403712      127.775201      0.667988        0.792499        0.627735        -0.321842       0.351730       99541
(0,99786)(13999926,14100047)(0,14100047)       7      7050023 93.164548      92.023886      92.742574      142.413716      117.218807      -0.051058       -0.013928       0.010201        -0.538288      0.282133        99786
(0,98759)(13999923,14100082)(0,14100082)        8      7050041 133.567125      177.157879      72.197498       204.069028      190.613463      1.363708        1.425567        1.040517        -0.200700       0.467490       98759
 
</pre>
</pre>
Format is:
Format is:

Revision as of 13:19, 30 June 2014

This method will estimate different thetas (population scaled mutation rate) and can based on these thetas calculate Tajima's D and various other neutrality test statistics. Method is described in Korneliussen2013.

  • NB Information on this website is for version 0.551 or higher.
  • NB The Korneliussen2013 covers two methods,
  1. using an ML method
  2. using the emperical Bayes (EB) method. The information on this page relates to the EB method.

For performing the ML method, you should the use the SFS Estimation method and define the region af interest.

Example

Below is a chain of commands used for caculating statistics. These are based on the test files that can be dowloaded on the Quick Start page.

Its a 3 step procedure

  1. Estimate an site frequency spectrum. Output is out.sfs file. This is what is being used as the -pest argument in step2.
  2. Calculate per-site thetas. Output is a .thetas.gz file.
  3. Calculate neutrality tests statistics. Output is a .thetas.gz.pestPG file.


First estimate the site allele frequency likelihood

./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 2 -P 24 -out out


        -> Reading fasta: chimpHg19.fa
        -> Parsing 10 number of samples 
        -> Printing at chr: 20 pos:14095817 chunknumber 3500
        -> Done reading data waiting for calculations to finish
        -> Calling destroy
        -> Done waiting for threads
        -> Output filenames:
                ->"out.arg"
                ->"out.saf"
                ->"out.saf.pos.gz"
        -> Mon Jun 30 12:02:58 2014
        -> Arguments and parameters for all analysis are located in .arg file
        [ALL done] cpu-time used =  47.19 sec
        [ALL done] walltime used =  43.00 sec




Obtain the maximum likelihood estimate of the SFS using the realSFS program found in the misc subfolder. (See more here realSFS)

misc/realSFS out.saf 20 -P 24 > out.sfs

To plot the SFS in R :

s<-exp(scan('out.sfs'))
s<-s[-c(1,length(s))]
s<-s/sum(s)
barplot(s,names=1:length(s),main='SFS')
 


Calculate the thetas for each site

./angsd -bam bam.filelist -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc chimpHg19.fa -GL 2

Estimate Tajimas D

#create a binary version of thete.thetas.gz 
misc/thetaStat make_bed out.thetas.gz
#calculate Tajimas D
misc/thetaStat do_stat out.thetas.gz -nChr 20

Remember that you will need to supply the ancestral state for the SFS Estimation, and you should try to remove the worst data by -minMapQ and -minQ.

Sliding Window example

We can easily do a sliding window analysis by adding -win/-step arguments to the last command. thetaStat

misc/thetaStat do_stat theta.thetas.gz -nChr 20 -win 50000 -step 10000  -outnames theta.thetasWindow.gz

This will calculate the test statistic using a window size of 50kb and a step size of 10kb.

Example Output

.thetas.gz is

#Chromo Pos     Watterson       Pairwise        thetaSingleton  thetaH  thetaL
1       14000032        -9.457420       -10.372069      -8.319252       -13.025778      -10.997194
1       14000033        -9.463637       -10.379368      -8.324414       -13.035780      -11.004670
1       14000034        -9.463740       -10.379488      -8.324500       -13.035942      -11.004793
1       14000035        -9.463603       -10.379328      -8.324386       -13.035725      -11.004629
1       14000036        -9.323246       -10.218453      -8.204848       -12.826627      -10.840519
1       14000037        -9.179270       -10.048883      -8.086425       -12.596436      -10.666670
1       14000038        -9.004664       -9.845473       -7.941453       -12.328274      -10.458416
1       14000039        -9.327033       -10.222983      -8.207914       -12.833007      -10.845176
1       14000040        -9.621554       -10.557563      -8.461745       -13.262415      -11.185971
1       14000041        -9.617449       -10.552869      -8.458225       -13.256257      -11.181185
1       14000042        -7.337841       -8.161756       -204.045433     -5.457443       -6.085818
1       14000043        -9.570405       -10.502160      -8.415195       -13.197596      -11.129976
1       14000044        -9.511097       -10.434558      -8.364249       -13.110037      -11.061100
1       14000045        -9.563664       -10.494371      -8.409489       -13.187203      -11.122022
1       14000046        -9.617690       -10.555402      -8.456395       -13.265004      -11.184107
1       14000047        -9.563722       -10.494438      -8.409538       -13.187292      -11.122090
1       14000048        -9.856578       -10.819096      -8.669691       -13.587898      -11.451396
1. chromosome
2. position
3. ThetaWatterson
4. ThetaD (nucleotide diversity)
5. Theta? (singleton category)
6. ThetaH
7. ThetaL

.thetas.gz.pestPG

The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.

We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L. And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E. The final column is the effetive number of sites with data in the window.

## thetaStat VERSION: 0.01 build:(Jun 30 2014,12:06:12)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop)   Chr     WinCenter       tW      tP      tF      tH      tL      Tajima  fuf     fud     fayh    zeng    nSites
(0,98316)(14000032,14100082)(0,14100082)        1       7050041 51.002623       46.171402       64.683834       51.290955       48.731178       -0.392892       -0.647071       -0.595302       -0.099654       -0.048444       98316
(0,98474)(13999910,14100060)(0,14100060)        2       7050030 92.689100       88.806005       101.768262      122.422498      105.614255      -0.174701       -0.252477       -0.220588       -0.360944       0.152373        98474
(0,93269)(14000529,14100095)(0,14100095)        3       7050047 70.757874       76.248087       75.447438       68.354514       72.301301       0.322902        0.020330        -0.148419       0.110921        0.023794        93269
(0,96339)(13999912,14100064)(0,14100064)        4       7050032 99.748624       107.898618      94.265208       130.283528      119.091076      0.340878        0.247030        0.123956        -0.223386       0.211971        96339
(0,99659)(13999926,14100063)(0,14100063)        5       7050031 120.941697      132.667821      86.726667       163.908351      148.288088      0.404945        0.688320        0.639821        -0.257254       0.247395        99659
(0,99541)(13999918,14100103)(0,14100103)        6       7050051 96.666344       112.146685      69.740992       143.403712      127.775201      0.667988        0.792499        0.627735        -0.321842       0.351730        99541
(0,99786)(13999926,14100047)(0,14100047)        7       7050023 93.164548       92.023886       92.742574       142.413716      117.218807      -0.051058       -0.013928       0.010201        -0.538288       0.282133        99786
(0,98759)(13999923,14100082)(0,14100082)        8       7050041 133.567125      177.157879      72.197498       204.069028      190.613463      1.363708        1.425567        1.040517        -0.200700       0.467490        98759

Format is:

(indexStart,indexStop)(posStart,posStop)(regStat,regStop) chrname wincenter tW tP tF tH tL tajD fulif fuliD fayH zengsE numSites

Most likely you are just interest in the wincenter (column 3) and the column 9 which is the Tajima's D statistic.

The first 3 columns relates to the region. The next 5 columns are 5 different estimators of theta, and the next 5 columns are neutrality test statistics. The final column is the number of sites with data in the region.


The first ()()() er mainly used for debugging the sliding window program. The interpretation is:

  • The posStart and posStop is the first physical position, and last physical postion of sites included in the analysis.
  • The regStat and regStop is the physical region for which the analysis is performed. Therefore the posStat and posStop is always included within the regStart and regStop
  • The indexStart and IndexStop is the position within the internal array.

Unknown ancestral state (folded sfs)

  • Below is for version 0.556 and above

If you don't have the ancestral states, you can still calculate the Watterson and Tajima theta, which means you can perform the Tajima's D neutrality test statistic. But this requires you to use the folded sfs. The output files will have the same format, but only the thetaW and thetaD, and tajimas D is meaningful.

Below is an example based on the earlier example where we now base our analysis on the folded spectrum. Notice the -fold 1 and that the second parameter to the emOptim2 is now 10 instead for 20.



First estimate the folded site allele frequency likelihood

./angsd -bam bam.filelist -doSaf 1 -anc hg19.fa -GL 2 -P 24 -out outFold -fold 1

Obtain the maximum likelihood estimate of the SFS

misc/emOptim2 outFold.saf 10 -P 24 > outFold.sfs

Calculate the thetas (remember to fold)

./angsd -bam bam.filelist -out outFold -doThetas 1 -doSaf 1 -pest outFold.sfs -anc hg19.fa -GL 2 -fold 1

Estimate Tajimas D

#create a binary version of thete.thetas.gz 
misc/thetaStat make_bed outFold.thetas.gz
#calculate Tajimas D
misc/thetaStat do_stat outFold.thetas.gz -nChr 10

Citation

Korneliussen2013