ANGSD: Analysis of next generation Sequencing Data
Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.
Allele Counts: Difference between revisions
Line 123: | Line 123: | ||
</pre> | </pre> | ||
=qscore Distribution= | ==qscore Distribution== | ||
Column 1 is the qscore value, and column 2 are the corresponding count. | Column 1 is the qscore value, and column 2 are the corresponding count. | ||
<pre> | <pre> |
Revision as of 15:47, 29 November 2013
Sometimes we want or need the frequency of the different bases. This is what -doCounts does.
You can refine which bases to be included using the filter parameters -minMapQ/-minQ/-trim. Based on the total depth for each you can discard sites for further analysis if the total depth is below/above some threshold -setMaxDepth/setMinDepth, and you can discard a site if the effective sample size is below some threshold -minInd.
You can dump summary statistics such as qscore distribution -doQsDist, depth distribution -doDepth, or various per site counts -dumpCounts. All output files has a nice header which should make the interpretation straightforward.
Brief Overview
./angsd -doCounts -> angsd version: 0.560 build(Nov 28 2013 16:47:03) -> Analysis helpbox/synopsis information: --------------- analysisCount.cpp: -doCounts 0 (Count the number A,C,G,T. All sites, All samples) -minQ 13 (remove bases with qscore<minQ) -setMaxDepth -1 (If total depth is larger then site is removed from analysis. -1 indicates no filtering) -setMinDepth -1 (If total depth is smaller then site is removed from analysis. -1 indicates no filtering) -trim 0 (trim ends of reads) -minInd 0 (Discard site if effective sample size below value. 0 indicates no filtering) Filedumping: -doDepth 0 (dump distribution of seqdepth) .depthSample,.depthGlobal -maxDepth 100 (bin together high depths) -doQsDist 0 (dump distribution of qscores) .qs -dumpCounts 0 1: total seqdepth for site .pos.gz 2: seqdepth persample .pos.gz,.counts.gz 3: A,C,G,T sum all samples .pos.gz,.counts.gz 4: A,C,G,T sum every sample .pos.gz,.counts.gz
Options
Filtering
- -minQ [int]
Default 13, Discard bases with a qscore below this threshold.
- -trim [0]
Default 0. Trim ends of the reads. Useful for aDNA.
- -setMinDepth
Default -1. If the total depth is below this value, then discard the site for all analysis.
- -setMaxDepth
Default -1. If the total depth is above this value, then discard the site for all analysis.
File dumping
- -dumpCounts [int]
Default 0. See examples below. Output files are called .pos,.counts.gz.
- -doQsDist [int]
Default 0. Dump the distribution of scores. Output files are called .qs.
- -doDepth [int]
Default 0. Dump the distribution of sequencing depths. Sites with depth above> -maxDepth, will be binned. Output files are called .depthSample,depthGlobal'.
- -maxDepth [int]
Default 100. See -doDepth parameter.
Output formats
Printing Counts per site
- -dumpCounts [int]
1: Print overall depth in the .pos file. This depth is the sum of reads covering a sites for all individuals. The first column is the chromosome, the second it the position the third is the total depth.
chr pos totDepth 1 13999902 1 1 13999903 1 1 13999904 1 1 13999905 2 1 13999906 2 1 13999907 2 1 13999908 2 1 13999909 2 1 13999910 2
2: prints the depth of each individual. Example of the depth of 10 individuals. Each line corresponce to the same line in the postion file.
ind0TotDepth ind1TotDepth ind2TotDepth ind3TotDepth ind4TotDepth 0 0 0 7 0 0 3 0 0 0 0 0 4 4 0 0 0 0 0 1 5 0 0 0 0 0 0 10 0 0 0 0 0 0 1 0 4 0 10 0 0 0 0 2 0
3: Prints the depth for each of the four bases across all individuals. Each line corresponce to the same line in the postion file.
totA totC totG totT 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 2 2 0 0 0 0 2 0 0 0 0 0 2 0 2 0 0 0 0 2 0 0 0 2 0 0 0 2 0 2 0 0 0 0 0 2 0
4: Prints the depth for each of the four bases for each indivial for each site. Example with the first four column belonging to the first individuals the counts of the number of A C G and Ts. Only two indivduals are shown. Each line corresponce to the same line in the postion file.
ind0_A ind0_C ind0_G ind0_T ind1_A ind1_C ind1_G ind1_T ind2_A ind2_C ind2_G ind2_T 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
Example
Print the individuals depth from bam files
./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist
qscore Distribution
Column 1 is the qscore value, and column 2 are the corresponding count.
qscore counts 13 87501 14 102888 15 113625 16 130494 17 145577 18 163049 19 180678 20 209447 21 247044 22 279325 23 332391 24 401459 25 484744 26 554127 27 609758 28 772123 29 1041218 30 1204349 31 1516248 32 1934112 33 2210498 34 2269812 35 2083536 36 1901735 37 1151146 38 441422 39 78625 40 21617 41 5870 42 1577 43 551 44 183 45 55 46 23 47 13 48 2
Depth Distribution
Column1 in the .depthSample,.depthGlobal contains the number of sites with sequencing depth of 1. Column2 is the number of sites with a sequencing depth of 2, etc.
The .depthSample contains depth per sample. Line one corresponds to individual 1. Column2 corresponds to individual 2 etc.
29403 87426 162912 229726 267115 259774 222153 170894 114295 71777 41654 22149 11030 5305 2425 1037 419 257 84 60 31 18 19 16 25 16 10 26318 88728 171544 244276 275342 263071 217952 162616 107571 65839 37466 20070 10150 4828 2237 1110 531 253 111 31 3 0 0 0 0 0 00 211936 393333 422459 322225 191564 95488 39672 15427 5220 1658 460 157 90 71 53 38 24 60 32 7 1 2 2 4 2
The .depthGlobal file contains the depth distribution across all individuals.
395 4299 7207 13203 23358 37489 56976 80588 107748 131669 150595 160482 161650 153690 138321 118217 96207 75735 57501 41561 29112 19549 12818 8200 5114 3247 1936 1123 646 378 238 165 105 75 71 43 43 33 27 19 15 17 17 21 24 11 7 7 14 5 1 3 3 3 1 1 3 2 3 1 1 5 4 5 6 11 4 2 1 2 0