|
|
(96 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
|
| |
|
|
| |
|
| | ANGSD is a software for analyzing next generation sequencing data. The software can handle a number of different input types from mapped reads to imputed genotype probabilities. Most methods take genotype uncertainty into account instead of basing the analysis on called genotypes. This is especially useful for low and medium depth data. The software is written in C++ and has been used on large sample sizes. |
|
| |
|
| =About=
| | This program is not for manipulating BAM/CRAM files, but solely a tool to perform various kinds of analysis. We recommend the excellent program [http://samtools.sourceforge.net/ SAMtools] for outputting and modifying bamfiles. |
| | |
| ANGSD is a software for analyzing next generation sequencing data. The software can handle a number of different input types from mapped reads to imputed genotype probabilities. All methods talk genotype uncertainty into account instead of basis the analysis on called genotypes. This is especially useful for low and medium depth data. The software is written in C++ and can handle thousands of samples
| |
|
| |
|
| | ANGSD is also on github: https://github.com/ANGSD/angsd |
| | <!-- ** |
| ==Overview of input and intermediary data== | | ==Overview of input and intermediary data== |
| | The input and intermediary data structures of angsd. |
|
| |
|
| <classdiagram type="dir:LR"> | | <classdiagram type="dir:LR"> |
Line 33: |
Line 35: |
| <classdiagram> | | <classdiagram> |
| //[input data|beagle output{bg:orange}]->[genotype;probabilities] | | //[input data|beagle output{bg:orange}]->[genotype;probabilities] |
| [genotype;probabilities]->[output|genotype calling;MAF estimates;associations{bg:blue}] | | [genotype;probabilities]->[output|genotype calling;MAF estimates;associations;SFS{bg:blue}] |
| </classdiagram> | | </classdiagram> |
| | --> |
|
| |
|
| | | =Synopsis= |
| ===Commands=== | |
| gatk
| |
| | |
| <code lang=sh>
| |
| angsd.g++ -outfiles samtools -GL 1 uppile -b ceu5.mapped.list
| |
| </code>
| |
| soapsnp
| |
| If no recalibration matrix exists these will be created first.
| |
| <code lang=sh>
| |
| angsd.g++ -outfiles soapsnp -GL 0 uppile -b ceu5.mapped.list
| |
| </code>
| |
| gatk
| |
| | |
| <code lang=sh> | | <code lang=sh> |
| angsd.g++ -outfiles gatk -GL 2 uppile -b ceu5.mapped.list | | ./angsd [OPTIONS] |
| </code> | | </code> |
|
| |
|
| | example of allele frequency estimated from genotype likelihoods with bam files as input using 10 threads |
|
| |
|
| <code lang=sh> | | <code lang=sh> |
| angsd.g++ -outfiles suyeon -doCounts 1 -qs 20 -doLike 1 uppile -b ceu5.mapped.list
| | ./angsd -out outFileName -bam bam.filelist -GL 1 -doMaf 1 -doMajorMinor 1 -nThreads 10 |
| </code>
| |
| | |
| =Download and Installation=
| |
| | |
| ==Download==
| |
| http://popgen.dk/dirty_source
| |
| | |
| Misc folder contains:
| |
| | |
| 1. simnextgen a program for simulating genotype likelihoods
| |
| 2. getUnfolded, simnextgen only outputs the true folded frequency spectra. getUnfolded uses the "true" genotypes to estimate the unfolded spectra.
| |
| 3. testfolded/optimSFS uses the .sfs output from -realSFS 1 to estimate the site frequency spectra polarized to the ancestral type.
| |
| | |
| | |
| samtools-0.1.17 contains:
| |
| | |
| The sourcecode for samtools http://samtools.sourceforge.net/, this used for reading mpileup from within dirty.
| |
| | |
| ==Compile everything like==
| |
| <code lang="sh">
| |
| cd misc/;make;cd ../samtools-0.1.17;make;cd ..;make
| |
| </code>
| |
| | |
| | |
| =Methods=
| |
| ==Heterozygosity==
| |
| | |
| | |
| ==HWE and Inbreeding estimates==
| |
| For method details see [[inbreed|here]]
| |
| | |
| =Output data=
| |
| ==.mafs==
| |
| <pre>
| |
| chromo position major minor ref knownEM unknownEM nInd
| |
| 21 9719788 T A 0.000001 -0.000012 3
| |
| 21 9719789 G A 0.000000 -0.000001 3
| |
| 21 9719790 A C 0.000000 -0.000004 3
| |
| 21 9719791 G A 0.000000 -0.000001 3
| |
| 21 9719792 G A 0.000000 -0.000002 3
| |
| 21 9719793 G T 0.498277 41.932766 3
| |
| 21 9719794 T A 0.000000 -0.000001 3
| |
| 21 9719795 T A 0.000000 -0.000001 3
| |
| | |
| </pre>
| |
| This pretty explanatory, nInd is the number of individuals where we have "reliable" reads (see bugs section)
| |
| Depending on -doMaf INT, and -ref FILENAME and -anc FILENAME, extra column will be input.
| |
| | |
| ==Association==
| |
| | |
| ===Score statistic (prefix lrt*)===
| |
| {| border="1"
| |
| |Chromosome || Position || Frequency || N || LRT
| |
| |}
| |
| *Chromosome
| |
| The Chromosome
| |
| *Position
| |
| The physical Position
| |
| * Frequency
| |
| The frequency estimate. The choice of estimation is determined by the *doMaf option.
| |
| *N
| |
| The number of individuals with non-missing data. That is the individuals who have both some sequencing data for the given site and have phenotype data
| |
| *LRT
| |
| The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000
| |
| | |
| example:
| |
| <pre>
| |
| 1 711153 0.012228 3200 -999.000000
| |
| 1 713682 0.047357 3200 0.133145
| |
| 1 713754 0.047357 3200 1.018738
| |
| 1 742429 0.096592 3200 0.174977
| |
| 1 743404 0.043796 3200 1.003485
| |
| 1 744055 0.097272 3200 2.334205
| |
| 1 751595 0.055826 3200 0.300824
| |
| 1 758311 0.054249 3200 1.242375
| |
| 1 765522 0.097715 3200 2.667515
| |
| 1 766409 0.345465 3200 0.162817
| |
| </pre>
| |
| | |
| | |
| =All options=
| |
| | |
| ==Input sequencing data==
| |
| | |
| ===Alignment data===
| |
| The input data is given as a file with the full path to all the individual filenames.
| |
| ;-soap:
| |
| Sorted soap alignment files.
| |
| | |
| ;-qs:
| |
| | |
| ;-maxHits:
| |
| | |
| ===Genotype likelihood data===
| |
| ;-samglf [filename]:
| |
| Samtools glf format (binary output). use the -g options in samtools to generate the files. This format is deprecated in newer versions of samtools.
| |
| ;-samglfclean [filename]:
| |
| Samtools glf format (text output). use the -g options in samtools to generate binary files followed the use of the samtools glfview. This format is deprecated in newer versions of samtools.
| |
| ;-tglf [filename]:
| |
| A simple format for genotype likelihoods
| |
| For fetching a number of sites, use -nLines INT
| |
| ;-fai:
| |
| | |
| ==Error rates==
| |
| ;-errorFile:
| |
| | |
| ===Association study===
| |
| | |
| ;-doAsso:
| |
| | |
| *-adjust
| |
| | |
| *-yBin
| |
| | |
| *-yQuant
| |
| | |
| *-cov
| |
| | |
| *-assoCutoff
| |
| | |
| *-sitePerm
| |
| | |
| ===Outfile name===
| |
| | |
| *-outfiles
| |
| | |
| ==Use subset of data==
| |
| *-start
| |
| | |
| *-stop
| |
| | |
| *-target
| |
| | |
| *-strand
| |
| | |
| *-minHigh
| |
| | |
| *-minCount
| |
| | |
| *-cutOff
| |
| | |
| *-downsample
| |
| | |
| *-minDepth
| |
| | |
| *-minInd
| |
| The minimum number of individuals with genotype data to analyze and print results for.
| |
| | |
| *-minMaf
| |
| The minimum estimated MAF to use in analysis and to print results for.
| |
| | |
| ==Frequency estimation==
| |
| *-doMaf
| |
| *-doSNP
| |
| | |
| ==Run options==
| |
| *-chunkSize
| |
| *-nLines
| |
| *-nThreads
| |
| | |
| ==Estimate genotype likelihoods==
| |
| *-doGLF INT
| |
| | |
| INT=1 binary output
| |
| | |
| INT=2 text beagle output
| |
| | |
| INT>2 textoutput
| |
| | |
| ==Estimate the covariance matrix==
| |
| *-getCovar
| |
| | |
| | |
| *-emIter
| |
| | |
| | |
| | |
| | |
| *-doError
| |
| | |
| | |
| | |
| *-eps
| |
| | |
| *-nInd
| |
| | |
| ==Estimate SFS realSFS==
| |
| | |
| -realSFS 1 | |
| | |
| =Run Examples=
| |
| ==Using glfv3==
| |
| <code lang="sh">
| |
| ../dirty -samglf ceu.glf.list -outfiles test.glf -doMaf 2 -fai numSort.Fai -nLines 50000 -chunkSize 500 -nThreads 16
| |
| </code>
| |
| | |
| ==Using mpileup==
| |
| <code lang="sh">
| |
| ./dirty.g++ -chunkSize 200000 -outfiles ASDF -doMaf 2 -nThreads 10 mpileup -g -r 21:1-20000000 -I ~/sample/*.chr21 >bcfoutput
| |
| </code>
| |
| | |
| First is programname.
| |
| Followed by the arguments used for dirty
| |
| Followed by mpileup and the arguments that will be bassed directly to SAMtools
| |
| | |
| From version 0.25, we can now get the nucleotide count for every site, for every sample. This is done by omitting the -g parameter
| |
| | |
| ==Using tglf files==
| |
| cd into the teststuff subfolder
| |
| | |
| <code lang="sh">
| |
| ../dirty.g++ -nThreads 1 -tglf lct.list -posfile lct.pos -nLines 100000 -outfiles GG -doMaf 15 -doSNP 1
| |
| </code>
| |
| | |
| If we want to estimate the SFS
| |
| | |
| <code lang="sh">
| |
| ../dirty.g++ -nThreads 5 -tglf lct.list -posfile lct.pos -nLines 100000 -outfiles KK -realSFS 1
| |
| </code>
| |
| | |
| ==Using soapfiles==
| |
| <code lang="sh">
| |
| ../dirty.g++ -soap tsk.sub10.list -doMaf 2 -outfiles NEW10 -chunkSize 1000 -nLines 10000 -nThreads 4
| |
| </code>
| |
| | |
| -soap is filelist containing the soapfiles, each soapfile must be sorted according the chromosomename (lexical ordering), and position.
| |
| | |
| | |
| ==Using simulated files==
| |
| These are .glf.gz files generated from simnextgen in misc subfolder, NB REMEMBER TO SUPPLY -nInd argument since these can't be inferred from the binary file.
| |
| <code lang="sh">
| |
| ./dirty.g++ -sim1 misc/small.glf.gz -nInd 15 -outfiles results -doMaf 2
| |
| </code> | | </code> |
|
| |
|
| = Example for getting the depths with bamfiles = | | =Platform= |
| | | The program is developed on tested on a Linux system with gcc compiler. It compiles on OSX with clang, but OSX is not really that tested. |
| <pre>
| |
| ./angsd.g++ -dumpCounts 1 -outfiles tspr3 mpileup -rchr11:130060569-130060569 -b /space/lucampBam/lucampHighBam.filelist
| |
| </pre>
| |
| | |
| = sfstools =
| |
| Simple example for getting S, pi and tajima
| |
| | |
| 1) no prior
| |
| ./misc/sfstools.g++ -nChr 50 -sfsFile lctoutput.sfs -tajima tajimafile >moded
| |
| | |
| 2) with prior
| |
| ./misc/sfstools.g++ -nChr 50 -priorFile LCT.data/all.25.sfs.ml -sfsFile lctoutput.sfs -tajima tajimafile >moded
| |
| | |
| | |
| stdout is the normalized output from .sfs
| |
| | |
| = Full example for simulating nextgen data and estimating SFS (somewhat deprecated should use sfstools now) =
| |
| First we simulate a small dataset with very high probability of being variabe.
| |
| | |
| Number of individuals =15,number of sites=10000, pvar=0.9
| |
| | |
| <pre>
| |
| ./simnextgen.gcc -nind 15 -nsites 10000 -outfiles small -pvar 0.9
| |
| ->Using args: -nind 15 -errate 0.007500 -depth 5.000000 -pvar 0.900000 -nsites 10000 -F 0.000000 -model 1
| |
| ->Dumping files: sequencefile: small.seq.gz glffile: small.glf.gz truefreq: small.frq args:small.args geno:small.geno
| |
| </pre>
| |
| | |
| The .frq file contains the folded spectra, we use getUnfolded on the "true" genotypes, to get the unfolded spectra
| |
| | |
| <pre>
| |
| ./getUnfolded.g++ -infile small.geno -nChr 30
| |
| -> opening filehandle for: small.geno
| |
| -> opening filehandle for: small.geno.var
| |
| -> opening filehandle for: small.geno.unfolded
| |
| </pre>
| |
| | |
| The second line in small.geno.unfolded now contains the unfolded spectra.
| |
| | |
| We now run dirty on the glf.gz file.
| |
| | |
| <pre>
| |
| ./dirty.g++ -sim1 misc/small.glf.gz -nInd 15 -outfiles results -realSFS 1
| |
| ##screen output removed
| |
| </pre>
| |
| | |
| And we now run the optimSFS program to get the spectra, we supply the filename, the number of chromosomes, and the number of threads we want to use
| |
| <pre>
| |
| ./optimSFS.gcc -binput ../results.sfs -ncat 30 -nThread 4
| |
| ##screen output removed
| |
| </pre>
| |
| | |
| The final output is in results.sfs.ml
| |
| | |
| We start R and plot our estimated SFS along with the true SFS
| |
| | |
| <pre>
| |
| mytrue <- as.numeric(read.table("small.geno.unfolded",skip=1))
| |
| myest <-scan("../results.sfs.ml")
| |
| barplot(rbind(mytrue,myest),beside=T)
| |
| </pre>
| |
| | |
| [[Image:SfsExample.png]]
| |
| | |
| =Version notes=
| |
| * 0.16 Is now bundled with SAMtools-0.1.17 and the mpileup (and friends) command be used for passing data to dirty.
| |
| * 0.17 added extra options -minInd and -minMaf, for only printing and using sites above a threshold
| |
| * 0.18 added option to pass reference and ancestral allele as fasta files.(using faidx format) (doMaf is now encoded internally as a MAF_(UN)KNOWN_TYPE)
| |
| * 0.19 added support for tglf inputfiles, -tglf -posfile see runexamples, also added the likeratio test for snp calling
| |
| * 0.20 Added the check for missing data, before the major/minor. included -realSFS, changed the deallocation of the -doMAF results, such that its proper cleaned up.
| |
| * 0.21 refactored pml.cpp into pml_estError_genLikes.cpp and pml_freq_asso.cpp (fixed a bug that preventede -samglf and samglfclean from working)
| |
| * 0.22 Well this update was a mixture of edits from [[user:albrecht]] and BGI so its difficult to give a concise description
| |
| *.0.23 Program can now read simulated files (single pop only) An example can be seen in "full example ... sfs" and input types.
| |
| * 0.24 added the tajima estimator. This should go in tandem with some R scripts. Had to modify parseargs, shared, and pml_freq_asso
| |
| * 0.25 the depth is now being populated when using mpileup -g. The program can now get the counts from mpileup
| |
| ANGSD below
| |
| * 0.01.a - 0.01.b The bfgs now supports threading, maybe anders implemented a heteorzygosity estimator.
| |
| * 0.01.c A problem if we didn't observe any llh, caused the MAF estimator to 'nan'.
| |
| * 0.02 Fixed small bug in bfgs optimization of sfs optimization. When choosing a region bigger than what was covered by the .sfs file the program would hang. Added genotypecaller, added -sfsEst to the realSFS part of the program.
| |
| * 0.03 added and documented genotypecaller, can dump counts,-realSFS 1 dumps positions, -realSFS 2 is deprecated,S,pi and tajima has been added to sfstools along with possibility to do prior
| |
| | |
| =Possible Bugs=
| |
| 0.18 If we don't get any reliable genotypelikelihoods for a site, the site wont be included in the .mafs file. If this is want we want I can't say.
| |
| | |
| 0.20 the major minor, doesn't work for the 3genotypes likelihood format (beagle)
| |
| | |
| 0.21 When using soap files do we want to infer major/minor from generated likes or from counts?
| |
| | |
| 0.24 Should we plugin the keepInd vector for the realSFS
| |
| | |
| angsd
| |
| | |
| 0.02 modify the -doGeno to set -doMaf to avoid a segfault with pars->results->asso->freq
| |
| | |
| 0.03 when dumping beagle we need to -doMaf 2 otherwise segfault
| |
| | |
| =Wish list=
| |
| 1. start stop to work with soap input (Not gonna happen I think)
| |