ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Genotype likelihoods from alignments new

From angsd
Jump to navigation Jump to search

Genotype likelihoods are the likelihood of the data given the genotype. In angsd we have implemented four different genotype likelihood models.

  1. SAMtools
  2. GATK (Simplefied)
  3. SOAPsnp
  4. Su Yeon Kim

Brief Overview

calcGL.cpp:
	-calcGL=0: 
	1: SAMtools
	2: GATK
	3: SOAPsnp
	4: SYK
	-minQ		13		(remove bases with qscore<minQ)
	-trim		0		(zero means no trimming)
	-tmpdir		angsd_tmpdir/	(used by SOAPsnp)
	-errors		(null)		(used by SYK)
	-minInd		-1		(-1 indicates no filtering)

Filedumping:
	-writeGL	0
	1: binary glf (10 log likes)	.glf
	2: beagle likelihood file	.beagle.gz
	3: binary 3 times likelihood	.glf
	4: text version (10 log likes)	.glf

Genotype likelihoods from alignments

<classdiagram> // [input|bam files;SOAP files{bg:orange}]->[sequence data]

[sequence data]->[genotype likelihoods|SAMtools;GATK;SOAPsnp;Kim et.al]
</classdiagram>


-GL [int]

If your input is sequencing file you can estimate genotype likelhoods from the mapped reads. Four different methods are available.

Samtools

-GL 1

This methods has a random component. In same tools there is a stocastic component so to get the exact same results as samtools use nThreads=1. However, the method is still the same with multiple threads but some sites will have small differences compared to the samtools output bacause of the stocastic component.

options

-minQ [int]

default 13. The minimum allowed base quality score.

example

./angsd -bam bam.filelist -GL 1 -out outfile

GATK

-GL 2

options

-minQ [int]

default 13. The minimum allowed base quality score.

example

./angsd -bam bam.filelist -GL 2 -out outfile

soapSNP

-GL 3 When estimating GL with soapSNP we need to generate a calibration matrix. This is done automaticly if these doesn't exist. These are located in angsd_tmpdir/basenameNUM.count,angsd_tmpdir/basenameNUM.qual

options

-minQ [int]

default 13. The minimum allowed base quality score.

-tmpdir [int]

default angsd_tmpdir; The directory of the recalibration matrix.

example

./angsd -bam bam.filelist -GL 3 -out outfile -ref hg19.fa 

This first loop doesn't estimate anything else than the calibration matrix. So now we can do the analysis we want

./angsd -bam bam.filelist -GL 3 -out outfile -ref hg19.fa -doGlf 1

NB internally the max readlength is not allowed to exceed 256.

Kim et al.

-GL 4 Citation Citation

options

-error [filename]

A file with the estimated type specific error rates (see Error_estimation).

example

./angsd -bam bam.filelist -GL 4 -out outfile -error error.file 

output genotype likelihoods

-doGlf [int]

Output the log genotype likelihoods to a file

0. don't dump anything (default)
1. binary all 10 llh
2. beagle text
3. beagle binary
4. textoutput of all 10 llhs.


binary

Glf file in binary doubles. All 10 genotype likelihoods are printed to a file. For each printed site there are 10*N doubles where N is the number of individuals. The order of the 10 genotypes are alphabetical AA AC AG AT CC CG CT GG GT TT.


Beagle format

Beagle haplotype imputation and be performed directly on genotype likelhoods. To generate beagle input file use

-doGlf 2

In order to make this file the major and minor allele has the be inferred (-doMajorMinor). It is also a good idea to only use the polymorphic sites.


Example

In this example our input files are bam files. We use the samtools genotype likelihood methods. We use 10 threads. We infer the major and minor allele from the likelihoods and estimate the allele frequencies. We test for polymorphic sites and only outbut the ones with are likelhood ratio test statistic of minimum 24 (ca. p-value<1e-6).

./angsd -GL 1 -out genolike -nThreads 10 -doGlf 2 -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 2 -bam bam.filelist

output

The above command generates the file genolike.beagle.gz that can be use as input for the beagle software

marker  allele1 allele2 Ind0    Ind0    Ind0    Ind1    Ind1    Ind1    Ind2    Ind2    Ind2    Ind3    Ind3    Ind3 
1_14000023      1       0       0.941177        0.058822        0.000001        0.799685        0.199918        0.000397        0.666316        0.333155        0.000529 
1_14000072      2       3       0.709983        0.177493        0.112525        0.941178        0.058822        0.000000        0.665554        0.332774        0.001672
1_14000113      0       2       0.855993        0.106996        0.037010        0.333333        0.333333        0.333333        0.799971        0.199989        0.000040 
1_14000202      2       0       0.835380        0.104420        0.060201        0.799685        0.199918        0.000397        0.333333        0.333333        0.333333
...

Note that the above values sum to one per sites for each individuals. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software it does not mean that they are genotype probabilities.

simple text format