ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Fasta

From angsd
Revision as of 17:00, 26 September 2017 by Thorfinn (talk | contribs)
Jump to navigation Jump to search

This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.

<classdiagram type="dir:LR">

[Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3); write iupac (-doFasta 4)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>


<classdiagram type="dir:LR">

[Multiple BAM files{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>


This can be used as input for the ANGSD analysis:

  1. Error estimation
  2. ABBA-BABA


Brief Overview

./angsd -doFasta 	-> Wed Nov 16 16:58:52 2016
--------------
abcWriteFasta.cpp:
	-doFasta	0
	1: use a random (non N) base (needs -doCounts 1)
	2: use the most common (non N) base (needs -doCounts 1)
	3: use the base with highest ebd (under development) 
	-basesPerLine	50	(Number of bases perline in output file)
	-explode	0	 print chromosome where we have no data (0:no,1:yes)
	-rmTrans	0	 remove transitions as different from -ref bases (0:no,1:yes)
	-ref	(null)	 reference fasta, only used with -rmTrans 1
	-seed	0	 use non random seed of value 1

This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.

Options

-doFasta 1
sample a random base at each position. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
-doFasta 2
use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
-doFasta 3
use the base with thie highest effective depth (EBD). This only works for one individual
-basesPerLine [INT]

Number of bases perline in output fasta file (default is 50)

-explode [INT]

0 (default) only output chromosomes with data. 1: write out all chromosomes

-rmTrans [INT]

0 (default) all sites are used. 1: Remove transition. Here transitions are determined using a fasta file such as a reference genome.

-ref [fileName]

a fasta file used to determine if a site is a transitions (needed when using -rmTrans 1 is used)

-seed [INT]

Use a seed in order to replicate results

For filters see Filters

Output

Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.

Example

Create a fasta file bases from a random samples of bases.

./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1


EBD

For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration. The EBD is the effective base depth, as defined by [1]:

where is a certain base, is the number of reads with base