ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Allele Counts

From angsd
Jump to navigation Jump to search

Sometimes we want or need the frequency of the different bases. This is what -doCounts does.

You can refine which bases to be included using the filter parameters -minMapQ/-minQ/-trim. Based on the total depth for each you can discard sites for further analysis if the total depth is below/above some threshold -setMaxDepth/setMinDepth, and you can discard a site if the effective sample size is below some threshold -minInd.

You can dump summary statistics such as qscore distribution -doQsDist, depth distribution -doDepth, or various per site counts -dumpCounts. All output files has a nice header which should make the interpretation straightforward.

Brief Overview

./angsd -doCounts 
	-> angsd version: 0.560	 build(Nov 28 2013 16:47:03)
	-> Analysis helpbox/synopsis information:
---------------
analysisCount.cpp:
	-doCounts	0	(Count the number A,C,G,T. All sites, All samples)
	-minQ		13	(remove bases with qscore<minQ)
	-setMaxDepth	-1	(If total depth is larger then site is removed from analysis.
				 -1 indicates no filtering)
	-setMinDepth	-1	(If total depth is smaller then site is removed from analysis.
				 -1 indicates no filtering)
	-trim		0	(trim ends of reads)
	-minInd		0	(Discard site if effective sample size below value.
				 0 indicates no filtering)
Filedumping:
	-doDepth	0	(dump distribution of seqdepth)	.depthSample,.depthGlobal
	  -maxDepth	100	(bin together high depths)
	-doQsDist	0	(dump distribution of qscores)	.qs
	-dumpCounts	0
	  1: total seqdepth for site	.pos.gz
	  2: seqdepth persample		.pos.gz,.counts.gz
	  3: A,C,G,T sum all samples	.pos.gz,.counts.gz
	  4: A,C,G,T sum every sample	.pos.gz,.counts.gz

Options

Filtering

-minQ [int]

Default 13, Discard bases with a qscore below this threshold.

-trim [0]

Default 0. Trim ends of the reads. Useful for aDNA.

-setMinDepth

Default -1. If the total depth is below this value, then discard the site for all analysis.

-setMaxDepth

Default -1. If the total depth is above this value, then discard the site for all analysis.

File dumping

-dumpCounts [int]

Default 0. See examples below. Output files are called .pos,.counts.gz.

-doQsDist [int]

Default 0. Dump the distribution of scores. Output files are called .qs.

-doDepth [int]

Default 0. Dump the distribution of sequencing depths. Sites with depth above> -maxDepth, will be binned. Output files are called .depthSample,depthGlobal'.

-maxDepth [int]

Default 100. See -doDepth parameter.

Output formats

Printing Counts per site

-dumpCounts [int]

1: Print overall depth in the .pos file. This depth is the sum of reads covering a sites for all individuals. The first column is the chromosome, the second it the position the third is the total depth.

chr	pos	totDepth
1	13999902	1
1	13999903	1
1	13999904	1
1	13999905	2
1	13999906	2
1	13999907	2
1	13999908	2
1	13999909	2
1	13999910	2

2: prints the depth of each individual. Example of the depth of 10 individuals. Each line corresponce to the same line in the postion file.

ind0TotDepth	ind1TotDepth	ind2TotDepth	ind3TotDepth	ind4TotDepth
0	0	0	7	0
0	3	0	0	0
0	0	4	4	0
0	0	0	0	1
5	0	0	0	0
0	0	10	0	0
0	0	0	0	1
0	4	0	10	0
0	0	0	2	0

3: Prints the depth for each of the four bases across all individuals. Each line corresponce to the same line in the postion file.

totA    totC    totG    totT
1       0       0       0
0       0       1       0
0       1       0       0
0       0       0       2
2       0       0       0
0       2       0       0
0       0       0       2
0       2       0       0
0       0       2       0
0       0       2       0
0       0       2       0
2       0       0       0
0       0       2       0


4: Prints the depth for each of the four bases for each indivial for each site. Example with the first four column belonging to the first individuals the counts of the number of A C G and Ts. Only two indivduals are shown. Each line corresponce to the same line in the postion file.

ind0_A ind0_C ind0_G ind0_T ind1_A ind1_C ind1_G ind1_T ind2_A ind2_C ind2_G ind2_T 
0 1 0 0 0 0 0 0 0 0 0 0 
0 0 1 0 0 0 0 0 0 0 0 0 
1 0 0 0 0 0 0 0 0 0 0 0 
0 0 1 0 0 0 0 0 0 0 0 0 
1 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 1 0 0 0 0 0 0 0 0 
0 0 0 1 0 0 0 0 0 0 0 1 
0 0 1 0 0 0 0 0 0 0 1 0 
0 1 0 0 0 0 0 0 0 0 0 0 

Example

Print the individuals depth from bam files

./angsd -out out -doCounts 1 -dumpCounts 2 -bam bam.filelist

qscore Distribution

Column 1 is the qscore value, and column 2 are the corresponding count.

qscore	counts
13	87501
14	102888
15	113625
16	130494
17	145577
18	163049
19	180678
20	209447
21	247044
22	279325
23	332391
24	401459
25	484744
26	554127
27	609758
28	772123
29	1041218
30	1204349
31	1516248
32	1934112
33	2210498
34	2269812
35	2083536
36	1901735
37	1151146
38	441422
39	78625
40	21617
41	5870
42	1577
43	551
44	183
45	55
46	23
47	13
48	2

Depth Distribution