ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Input: Difference between revisions

From angsd
Jump to navigation Jump to search
Line 55: Line 55:
;-samglf [filename]:
;-samglf [filename]:
Samtools glf format (binary output). use the pileup -g options in samtools to generate the files. This format is deprecated in newer versions of samtools.  
Samtools glf format (binary output). use the pileup -g options in samtools to generate the files. This format is deprecated in newer versions of samtools.  
Samtools glf format (text output). use the pileup -g options in samtools to generate binary files followed the use of the samtools glfview. This format is deprecated in newer versions of samtools.
;-samglfclean [filename]:
;-samglfclean [filename]:
Samtools glf format (text output). use the pileup -g options in samtools to generate binary files followed the use of the samtools glfview. This format is deprecated in newer versions of samtools.
==tglf==
;-tglf [filename]:
A simple format for genotype likelihoods:
A simple format for genotype likelihoods:
Every sample is in seperate files, Every genotype is saved as binary double log10 scaled. in the following order. AA,CC,GG,TT, etc
Every sample is in seperate files, Every genotype is saved as binary double log10 scaled. in the following order. AA,CC,GG,TT, etc
;-tglf [filename]:


=genotype probability files=
=genotype probability files=

Revision as of 11:27, 19 June 2012

ANSGD currently supports various mapped data, genotype likelihood formats and imputed genotype probability files.

Mapped sequence file

bam files

ANGSD accepts bam files for mapped sequences. For imformation on the file specification and file creation see the samtools website [1]

arguments

-bam [filelist]

The filelist is a file containing the full path for each bam file with one filename per row.

Example of a filelist with 6 individuals

/home/software/angsd/test/smallBam/smallNA12763.bam
/home/software/angsd/test/smallBam/smallNA11830.bam
/home/software/angsd/test/smallBam/smallNA12004.bam
/home/software/angsd/test/smallBam/smallNA06985.bam
/home/software/angsd/test/smallBam/smallNA11993.bam
/home/software/angsd/test/smallBam/smallNA12761.bam

Example of estimating allele frequencies from bam files

./angsd -out out -doMaf 2 -bam bam.filelist

optional arguments

-r [region]

Specify a region with in a chromosome using the syntax [chr]:[start-stop]. examples

chr1:1-10000             \\ first 10000 based for chr1
chr2:50000-               \\chr2 but exclude the first 50000 bases
chr11:1-                  \\all of chr11
-only_proper_pairs [int]=0

Include only proper pairs (pairs of read with both mates mapped correctly). 0: include only proper (default), 1: use all reads. If your data is not paired end you have to choose 1

-rf [region file]

specify multiple regions in a file.

-nLines [int]=50

Number of lines to read per file at a time. Reducing this number will decrease the RAM usage with a small cost to the speed.

-uniqueOnly [int]=0

remove reads that have multiple best hits.. 0 no (default), 1 remove

-remove_bads [int]=1

Same as the samtools flags -x which removes read with a flag above 255 (not primary, failure and duplicate reads)

genotype likelihood files

For historical reasons the program can use binary glfv3 files, and the text representations. These were generated from old versions of SAMtools, and is deprecated in newer versions of SAMtools. Futhermore for internal use we can read the 'inhouse' tglf format files.

These formats are likely to be deprecated in future versions.

glfv3

-samglf [filename]

Samtools glf format (binary output). use the pileup -g options in samtools to generate the files. This format is deprecated in newer versions of samtools. Samtools glf format (text output). use the pileup -g options in samtools to generate binary files followed the use of the samtools glfview. This format is deprecated in newer versions of samtools.

-samglfclean [filename]

tglf

A simple format for genotype likelihoods: Every sample is in seperate files, Every genotype is saved as binary double log10 scaled. in the following order. AA,CC,GG,TT, etc

-tglf [filename]

genotype probability files

beagle format

Genotype probabilities in beagle format can be used as input. The format used is the haplotype imputation format outputted from beagle [2].

options

To include a beagle file us the option

-beagle [file]

example

The file format is a single linje per site. The first 3 coloums are

  • markerName
  • alleleA
  • alleleB

For each individual 3 coloums are added. These three colums should sum to one.

Example of a file with two individuals

marker alleleA alleleB NA06984 NA06984 NA06984 NA06986 NA06986 NA06986
chr9_95759065 G A 0.6563 0.3078 0.0358 0.5357 0.4016 0.0627
chr9_95759152 C A 1 0 0 0 1 0
chr9_95762332 G A 0.925 0.0734 0.0015 0.894 0.1031 0.0029
chr9_95762333 A T 0.8903 0.1067 0.003 0.811 0.1797 0.0093
chr9_95762343 G T 0.9149 0.0835 0.0017 0.8396 0.1541 0.0064

Example of estimating allele frequencies from beagle files

./angsd -out out -doMaf 16 -beagle file.beagle