ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

ANGSD: Difference between revisions

From angsd
Jump to navigation Jump to search
 
(82 intermediate revisions by 2 users not shown)
Line 1: Line 1:




ANGSD is a software for analyzing next generation sequencing data. The software can handle a number of different input types from mapped reads to imputed genotype probabilities. Most methods take genotype uncertainty into account instead of basing the analysis on called genotypes. This is especially useful for low and medium depth data. The software is written in C++ and has been used on large sample sizes.


=About=
This program is not for manipulating BAM/CRAM files, but solely a tool to perform various kinds of analysis. We recommend the excellent program [http://samtools.sourceforge.net/ SAMtools] for outputting and modifying bamfiles.
 
ANGSD is a software for analyzing next generation sequencing data. The software can handle a number of different input types from mapped reads to imputed genotype probabilities. All methods talk genotype uncertainty into account instead of basis the analysis on called genotypes. This is especially useful for low and medium depth data. The software is written in C++ and can handle thousands of samples


ANGSD is also on github: https://github.com/ANGSD/angsd
<!-- **
==Overview of input and intermediary data==
==Overview of input and intermediary data==
The input and intermediary data structures of angsd.


<classdiagram type="dir:LR">
<classdiagram type="dir:LR">
Line 33: Line 35:
<classdiagram>
<classdiagram>
//[input data|beagle output{bg:orange}]->[genotype;probabilities]
//[input data|beagle output{bg:orange}]->[genotype;probabilities]
[genotype;probabilities]->[output|genotype calling;MAF estimates;associations{bg:blue}]
[genotype;probabilities]->[output|genotype calling;MAF estimates;associations;SFS{bg:blue}]
  </classdiagram>
  </classdiagram>
-->


 
=Synopsis=
===Commands===
gatk
 
<code lang=sh>
<code lang=sh>
angsd.g++ -outfiles samtools  -GL 1 uppile -b ceu5.mapped.list
./angsd [OPTIONS]
</code>
soapsnp
If no recalibration matrix exists these will be created first.
<code lang=sh>
angsd.g++ -outfiles soapsnp  -GL 0 uppile -b ceu5.mapped.list
</code>
gatk
 
<code lang=sh>
angsd.g++ -outfiles gatk  -GL 2 uppile -b ceu5.mapped.list
</code>
</code>


example of allele frequency estimated from genotype likelihoods with bam files as input using 10 threads


<code lang=sh>
<code lang=sh>
angsd.g++ -outfiles suyeon  -doCounts 1 -qs 20 -doLike 1 uppile -b ceu5.mapped.list
./angsd -out outFileName -bam bam.filelist -GL 1 -doMaf 1 -doMajorMinor 1 -nThreads 10
</code>
</code>


=Version notes=
=Platform=
* 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
The program is developed on tested on a Linux system with gcc compiler. It compiles on OSX with clang, but OSX is not really that tested.
* 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
* 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
* 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
* 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
* 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
* 0.22 Well this update was a mixture of edits from [[user:albrecht]] and BGI so its difficult to give a concise description
*.0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
* 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
* 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup
ANGSD below
* 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
* 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
* 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
* 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior
 
=Possible Bugs=
0.18 If we don't get any reliable genotypelikelihoods for a site, the site wont be included in the .mafs file. If this is want we want I can't say.
 
0.20 the major minor, doesn't work for the 3genotypes likelihood format (beagle)
 
0.21 When using soap files do we want to infer major/minor from generated likes or from counts?
 
0.24 Should we plugin the keepInd vector for the realSFS
 
angsd
 
0.02 modify the -doGeno to set -doMaf to avoid a segfault with pars->results->asso->freq
 
0.03 when dumping beagle we need to -doMaf 2 otherwise segfault
 
0.3 clean version with less features. The lost features will be reintroduced later.
 
=Wish list=
1. start stop to work with soap input (Not gonna happen I think)

Latest revision as of 15:07, 4 December 2015


ANGSD is a software for analyzing next generation sequencing data. The software can handle a number of different input types from mapped reads to imputed genotype probabilities. Most methods take genotype uncertainty into account instead of basing the analysis on called genotypes. This is especially useful for low and medium depth data. The software is written in C++ and has been used on large sample sizes.

This program is not for manipulating BAM/CRAM files, but solely a tool to perform various kinds of analysis. We recommend the excellent program SAMtools for outputting and modifying bamfiles.

ANGSD is also on github: https://github.com/ANGSD/angsd

Synopsis

./angsd [OPTIONS]

example of allele frequency estimated from genotype likelihoods with bam files as input using 10 threads

./angsd -out outFileName -bam bam.filelist -GL 1 -doMaf 1 -doMajorMinor 1 -nThreads 10

Platform

The program is developed on tested on a Linux system with gcc compiler. It compiles on OSX with clang, but OSX is not really that tested.