ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

ANGSD: Difference between revisions

From angsd
Jump to navigation Jump to search
Line 63: Line 63:
==Error rates==
==Error rates==
;-errorFile:
;-errorFile:
===Association study===
;-doAsso:
*-adjust
*-yBin
*-yQuant
*-cov
*-assoCutoff
*-sitePerm


===Outfile name===
===Outfile name===

Revision as of 17:34, 18 June 2012


About

ANGSD is a software for analyzing next generation sequencing data. The software can handle a number of different input types from mapped reads to imputed genotype probabilities. All methods talk genotype uncertainty into account instead of basis the analysis on called genotypes. This is especially useful for low and medium depth data. The software is written in C++ and can handle thousands of samples

Overview of input and intermediary data

<classdiagram type="dir:LR"> [sequence data]->[genotype;likelihoods] [genotype;likelihoods]->[genotype;probabilities] [sequence files|bam files;SOAP files{bg:orange}]->[sequence data] [glf files|glfv3;soapSNP{bg:orange}]->[genotype;likelihoods] [genotype prob|beagle output{bg:orange}]->[genotype;probabilities] </classdiagram>

Analysis from sequencing data

<classdiagram> // [input|bam files;SOAP files{bg:orange}]->[sequence data]

[sequence data]->[output|summary stats;phat estimates;error estimates{bg:blue}]
</classdiagram>

Analysis from genotype likelihoods

<classdiagram> //[input data|glf files{bg:orange}]->[genotype;likelihoods] [genotype;likelihoods]->[output|glf files;beagle files;MAF estimates;MAF associations;SNP Calling;realSFS;error estimates;Inbreeding{bg:blue}]

</classdiagram>


Analysis from genotype probabilities

<classdiagram> //[input data|beagle output{bg:orange}]->[genotype;probabilities] [genotype;probabilities]->[output|genotype calling;MAF estimates;associations{bg:blue}]

</classdiagram>


Commands

gatk

angsd.g++ -outfiles samtools -GL 1 uppile -b ceu5.mapped.list soapsnp If no recalibration matrix exists these will be created first. angsd.g++ -outfiles soapsnp -GL 0 uppile -b ceu5.mapped.list gatk

angsd.g++ -outfiles gatk -GL 2 uppile -b ceu5.mapped.list


angsd.g++ -outfiles suyeon -doCounts 1 -qs 20 -doLike 1 uppile -b ceu5.mapped.list

All options

Error rates

-errorFile

Outfile name

  • -outfiles

Use subset of data

  • -start
  • -stop
  • -target
  • -strand
  • -minHigh
  • -minCount
  • -cutOff
  • -downsample
  • -minDepth
  • -minInd

The minimum number of individuals with genotype data to analyze and print results for.

  • -minMaf

The minimum estimated MAF to use in analysis and to print results for.

Frequency estimation

  • -doMaf
  • -doSNP

Run options

  • -chunkSize
  • -nLines
  • -nThreads

Estimate genotype likelihoods

  • -doGLF INT

INT=1 binary output

INT=2 text beagle output

INT>2 textoutput

Estimate the covariance matrix

  • -getCovar


  • -emIter



  • -doError


  • -eps
  • -nInd

Estimate SFS realSFS

-realSFS 1

Run Examples

Using glfv3

../dirty -samglf ceu.glf.list -outfiles test.glf -doMaf 2 -fai numSort.Fai -nLines 50000 -chunkSize 500 -nThreads 16

Using mpileup

./dirty.g++ -chunkSize 200000 -outfiles ASDF -doMaf 2 -nThreads 10 mpileup -g -r 21:1-20000000 -I ~/sample/*.chr21 >bcfoutput

First is programname. Followed by the arguments used for dirty Followed by mpileup and the arguments that will be bassed directly to SAMtools

From version 0.25, we can now get the nucleotide count for every site, for every sample. This is done by omitting the -g parameter

Using tglf files

cd into the teststuff subfolder

../dirty.g++ -nThreads 1 -tglf lct.list -posfile lct.pos -nLines 100000 -outfiles GG -doMaf 15 -doSNP 1

If we want to estimate the SFS

../dirty.g++ -nThreads 5 -tglf lct.list -posfile lct.pos -nLines 100000 -outfiles KK -realSFS 1

Using soapfiles

../dirty.g++ -soap tsk.sub10.list -doMaf 2 -outfiles NEW10 -chunkSize 1000 -nLines 10000 -nThreads 4

-soap is filelist containing the soapfiles, each soapfile must be sorted according the chromosomename (lexical ordering), and position.


Using simulated files

These are .glf.gz files generated from simnextgen in misc subfolder, NB REMEMBER TO SUPPLY -nInd argument since these can't be inferred from the binary file. ./dirty.g++ -sim1 misc/small.glf.gz -nInd 15 -outfiles results -doMaf 2

Example for getting the depths with bamfiles

./angsd.g++ -dumpCounts 1 -outfiles tspr3 mpileup -rchr11:130060569-130060569 -b /space/lucampBam/lucampHighBam.filelist

sfstools

Simple example for getting S, pi and tajima

1) no prior ./misc/sfstools.g++ -nChr 50 -sfsFile lctoutput.sfs -tajima tajimafile >moded

2) with prior ./misc/sfstools.g++ -nChr 50 -priorFile LCT.data/all.25.sfs.ml -sfsFile lctoutput.sfs -tajima tajimafile >moded


stdout is the normalized output from .sfs

Version notes

  • 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
  • 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
  • 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
  • 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
  • 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
  • 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
  • 0.22 Well this update was a mixture of edits from user:albrecht and BGI so its difficult to give a concise description
  • .0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
  • 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
  • 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup

ANGSD below

  • 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
  • 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
  • 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
  • 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior

Possible Bugs

0.18 If we don't get any reliable genotypelikelihoods for a site, the site wont be included in the .mafs file. If this is want we want I can't say.

0.20 the major minor, doesn't work for the 3genotypes likelihood format (beagle)

0.21 When using soap files do we want to infer major/minor from generated likes or from counts?

0.24 Should we plugin the keepInd vector for the realSFS

angsd

0.02 modify the -doGeno to set -doMaf to avoid a segfault with pars->results->asso->freq

0.03 when dumping beagle we need to -doMaf 2 otherwise segfault

Wish list

1. start stop to work with soap input (Not gonna happen I think)