ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Fasta: Difference between revisions

From angsd
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Available from version 0.559+.
 
<pre>
Newer githubversions will limit the output to the chrs with data. Earlier versions was printout 'N"
</pre>
This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.  
This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.  



Revision as of 17:19, 8 February 2016

This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.

<classdiagram type="dir:LR">

[Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>

This can be used as input for the ANGSD analysis:

  1. Error estimation
  2. ABBA-BABA


Brief Overview

> ./angsd -doFasta
--------------
analysisFasta.cpp:
	-doFasta	0
	1: use a random base
	2: use the most common base (needs -doCounts 1)
	3: use the base with highest ebd (under development) 
	-basesPerLine	50	(Number of bases perline in output file)
	-explode	0	(Should we include chrs with no data?)
	-rmTrans	0	(remove transitions (as different from -ref bases)?)
	-ref	(null)	(reference fasta, only used with -rmTrans 1)

This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.


Options

-doFasta 1
sample a random base at each position.
-doFasta 2
use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. The "-doCounts 1" options for allele counts is needed in order to determine the most common base.
-doFasta 3
use the base with thie highest effective depth (EBD).
-basesPerLine [INT]

Number of bases perline in output fasta file (default is 50)

-explode [INT]

0 (default) only output chromosome with data. 1: write out all chromosomes

For filters see Filters

Output

Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.

Example

Create a fasta file bases from a random samples of bases.

./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1


EBD

For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration. The EBD is the effective base depth, as defined by [1]: