ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Fasta: Difference between revisions

From angsd
Jump to navigation Jump to search
(Created page with "This option create a fasta file from a sequencing data file. <classdiagram type="dir:LR"> [One bam file{bg:orange}]->[sequencing data|random base (-doFasta 1);consensus base...")
 
 
(58 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This option create a fasta file from a sequencing data file.
 
This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.  
 
<classdiagram type="dir:LR">
[Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3); write iupac (-doFasta 4)]
[sequence data]->doFasta[fasta file{bg:blue}]
</classdiagram>
 
 


<classdiagram type="dir:LR">
<classdiagram type="dir:LR">
  [One bam file{bg:orange}]->[sequencing data|random base (-doFasta 1);consensus base (-doFasta 2)]
  [Multiple BAM files{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);write iupac (-doFasta 4)]
[sequencing data]->doFasta[fasta file{bg:blue}]
[sequence data]->doFasta[fasta file{bg:blue}]
  </classdiagram>
  </classdiagram>


This can be used as input for the ANGSD analysis:
# [[Error estimation]]
# [[ABBA-BABA]]
The iupac output code was kindly provided by Kristian Ullrich.
=Brief Overview=
=Brief Overview=
<pre>
<pre>
./angsd -dofasta -> Tue Sep 26 17:02:07 2017
--------------
--------------
analysisFasta.cpp:
abcWriteFasta.cpp:
-doFasta 0
-doFasta 0
1: use a random base
1: use a random (non N) base (needs -doCounts 1)
2: use the most common base (needs -doCounts 1)
2: use the most common (non N) base (needs -doCounts 1)
-minQ 13 (remove bases with qscore<minQ)
3: use the base with highest ebd (under development)
4: output iupac codes (under development)  
-basesPerLine 50 (Number of bases perline in output file)
-explode 0 print chromosome where we have no data (0:no,1:yes)
-rmTrans 0 remove transitions as different from -ref bases (0:no,1:yes)
-ref (null) reference fasta, only used with -rmTrans 1
-seed 0 use non random seed of value 1
</pre>
</pre>


=options=
This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.
;-realSFS 1: an sfs file will be generated.


;-realSFS 2:(version above 0.503) Taking into account perIndividual inbreeding coefficients. This is the work of Filipe G. Vieira
=Options=
;-doFasta 1: sample a random base at each position. N's or filtered based are ignored. The "-doCounts 1" options for [[Alleles_counts|allele counts]] is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.  


;-realSFS 4: snpcalling (not implemented, in this angsd)
;-doFasta 2: use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. N's or filtered based are ignored. The "-doCounts 1" options for [[Alleles_counts|allele counts]] is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.


;-realSFS 8: genotypecalling (not implemented, int this angsd)
;-doFasta 3: use the base with thie highest effective depth (EBD). This only works for one individual


For the inbreeding version you need to supply a file containing all the inbreeding coefficients. -indF
;-basesPerLine [INT]
Number of bases perline in output fasta file (default is 50)


;-explode [INT]
0 (default) only output chromosomes with data. 1: write out all chromosomes
;-rmTrans [INT]
0 (default) all sites are used. 1: Remove transition. Here transitions are determined using a fasta file such as a reference genome.
;-ref [fileName]
a fasta file used to determine if a site is a transitions (needed when using -rmTrans 1 is used)
;-seed [INT]
Use a seed in order to replicate results ( relevant when using random sample -dofasta 1 )


;-underFlowProtect [INT]  
For filters see [[Filters]]
a very basic underflowprotection


=Output=
Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.


==Example==
==Example==
A full example is shown below, here we use GATK genotype likelihoods and hg19.fa as the ancestral. The [[emOptim2]] can be found in the misc subfolder.
Create a fasta file bases from a random samples of bases.


<pre>
<pre>
#first generate .sfs file
./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1
./angsd -bam smallBam.filelist -realSFS 1 -out small -anc  hg19.fa -GL 2 [options]
#now try the EM optimization with 4 threads
./emOptim2 small.sfs 20 -maxIter 100 -P 4 >small.sfs.em.ml
</pre>
We always recommend that you filter out the bad qscore bases and meaningless mapQ reads. eg '''-minMapQ 1 -minQ 20'''.
<pre>
If you have say 10 diploid samples, then you should do -nChr 20
if you have say 12 diploid samples, then you should do -nChr 24.
</pre>
</pre>


=Interpretation of output file=


The outpiles are then called small.em.ml. This will be the SFS in logscale.
This is to be interpreted as:


column1 := probabilty of sampling zero derived alleles
=EBD=
For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration.
The EBD is the effective base depth, as defined by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3638139/]:


column2 := probabilty of sampling one derived allele
<math>
EBD_b = \sum_{i=1}^{N_b} (phred(mapq_i)*phred(qscore_i)),\qquad phred(q) =10^{-q/10} \qquad b \in \{A,C,G,T\}
</math>


column3 := probabilty of sampling two derived allele
where <math>b</math> is a certain base, <math>N_b</math> is the number of reads with base <math>b</math>


column4 := probabilty of sampling three derived allele


etc
=Ancestral fasta using multiple outgroups=
If you have outgroup species map to your reference genome and you want to use them to make a fasta file with ancestral alleles. You can use one or more outgroup individuals e.g. for human you could have a four outgroup bam file from a chimp, a bonobo, a gorrilla and a orangotan. Assuming you want to make a fasta file where the alleles is the same for all outgroup species then you can use a command like


==NB==
The generation of the .sfs file is done on a persite basis, whereas the optimization requires information for a region of the genome. The optimization will therefore use large amounts of memory.
The program defaults to 50megabase regions, and will loop over the genome using 50 megebase chunks. You can increase this adding -nSites 500000000. Which will then use 500megabase regions.
=Folded spectra=
<pre>
<pre>
Below is for version 0.556 and above
./angsd -b fourOutgroup.bamlist -out myFasta -doCounts 1 -snp_pval 0.01 -domaf 1 -domajorminor 1 -gl 2 -rmSNPs 1 -minind 4 -setMinDepthInd 10 -explode 1
</pre>
If you don't have the ancestral state, you can instead estimate the folded SFS. This is done by supplying the -anc with the reference and also supply -fold 1.
Then you should remember to change the second parameter to the emOptim2 function as the number of individuals instead of the number of chromosomes.


The above example would then be
<pre>
#first generate .sfs file
./angsd -bam smallBam.filelist -realSFS 1 -out small -anc  hg19.fa -GL 2 -fold 1 [options]
#now try the EM optimization with 4 threads
./emOptim2 small.sfs 10 -maxIter 100 -P 4 >small.sfs.em.ml
</pre>
</pre>


=Posterior per-site distributions of the sample allele frequency=
'''-b fourOutgroup.bamlist''' contains the bam files for four outgroup individuals.
This is assuming version 0.556 or higher.
If you supply a prior for the -realSFS analysis, the output of the .sfs file will no longer be loglikeratios of the different categories but log posteriors of the different categories.
 
=Format specification of binary .sfs file=
The information in this section is only usefull for people who wants to work with the "multisample GL"/"sample allele frequencies" for the individual sites.
 
Assuming 'n' individuals we have (2n+1) categories for each site, each value is encoded as ctype double which on 'all known' platforms occupies 8 bytes. The (2n+1) values are log scaled like ratios, which means that the most likely category will have a value of 0.


The information for the first site is the first (2n+1)sizeof(double) bytes. The information for the next site is the next (2n+1) bytes etc. The positions are given by the ascii file called '.sfs.pos'
'''-out myFasta''' the output name


'''-doCounts 1''' counts bases accross individuals to determine the concensus allele


<pre>
'''-snp_pval 0.01''' p-value threshold for defining a SNP. A lower threshold will need more evidence to call a SNPs.  
#sample code to read .sfs in c/c++ assuming 10 individuals
FILE *fp = fopen("mySFS.sfs","rb");
int nInd = 10;
double persite[2*nInd+1];
fread(persite,sizeof(double)*(2*nInd+1),1,fp);
for(int i=0;i<2*nind+1;i++)
  fprintf(stderr,"cat: %d = %f\n",i,persite[i]);
</pre>
 
* If the -fold 1 has been set, then the dimension is no longer 2*nInd+1 but nInd+1
* If the -pest parameter has been supplied the output is no longer like ratios to the most likely but log posts of the different categories.
=Theory=
<pre>
We will try to elaborate on the theory behind the methods. Below is only a preliminary version of the theory.
</pre>
This method is described in [[Nielsen2012]].
==SFS definition==
For 'n' diploid samples, the site frequency spectrum '''(SFS)''' is the (2n+1) vector containing the proportion of site carrying 'k'-mutations. This means that the first element in the SFS is the proportion of sites where we don't observe any mutations, The second value is the proportion of sites where we observe 1 mutations. The last value is the proportion of sites we only observe mutations. It follows that the first and last column are the invariable categories and assuming that the SFS contains relative frequencies the variability in the sample can be estimated by:


<math>pvar=1-sfs_0-sfs_{2n}=\sum_{i=1}^{2n-1}sfs_i</math>
''' -domaf 1''' estimate allele frequency (use to call SNPs) with that the major and minor alleles inferred from  data


==Sample allele frequency/Multisample GL==
'''-domajorminor 1''' infer the major and minor allele from data
By supplying the -realSFS 1, flag to angsd. Angsd will calculate the likelihood of the sample allele frequency for each site and dump these into the .sfs file. The likelihood of the sample allele frequency are in this context the likelihood of sampling k-derived alleles. This is estimated on the basis of the 10 possible genotype likelihoods for all individuals by summing over all combinations. This is done using the recursive algorithm described in [[Nielsen2012]]. This we write as <math>p(X^s\mid j)</math> meaning the likelihood of sampling j derived alleles for site s. And we calculate the folded as


<math>
'''-gl 2'''  use genotype likelihoods based on the GATK model
p_{fold}(x^s\mid j) =p(x^s\mid j) + p(x^s\mid 2n- j),\qquad j\in\{0,1,3,\ldots,n-1\},
</math>


<math>
''' -rmSNPs 1''' remove polymorphic sites. instead of keeping sites that are polymorphic then we remove them such that all outgroups have the same allele. 
p_{fold}(x^s\mid j) =2p(x^s\mid j) ,\qquad j=n
</math>


==Likelihood of the SFS==
''' -minind 4''' remove site where you don't have data for all four individuals
The likelihood of the sfs is then given as:
'''-setMinDepthInd 10''' require at least 10 read for each individual


<math>
''' -explode 1''' make the fasta file for the whole genome not just the chromosomes/scaffolds were you have data
p(X|\theta) = \prod_{s=0}^S\sum_{i=0}^{2n} p(X^s\mid i )\theta_i
</math>


Here <math>\theta</math> is our sfs. In the case of the folded sfs, we use n instead of 2n in the summation. We can find the MLE of the SFS by using either an BFGS approach that uses derivatives or by using en EM algorithm. Both is implemented in the emOptim2 program.
The sites that are polymorphic or do not have enough data will be labeled as 'N' in the fasta file

Latest revision as of 18:45, 6 February 2023

This option creates a fasta.gz file from a sequencing data file (BAM file). The function uses genome information in the BAM header to determine the length and chromosome names. For the sites without data an "N" is written.

<classdiagram type="dir:LR">

[Single BAM file{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);Highest EBD (-doFasta 3); write iupac (-doFasta 4)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>


<classdiagram type="dir:LR">

[Multiple BAM files{bg:orange}]->[Sequence data|Random base (-doFasta 1);Consensus base (-doFasta 2);write iupac (-doFasta 4)]

[sequence data]->doFasta[fasta file{bg:blue}]

</classdiagram>


This can be used as input for the ANGSD analysis:

  1. Error estimation
  2. ABBA-BABA

The iupac output code was kindly provided by Kristian Ullrich.

Brief Overview

./angsd -dofasta 	-> Tue Sep 26 17:02:07 2017
--------------
abcWriteFasta.cpp:
	-doFasta	0
	1: use a random (non N) base (needs -doCounts 1)
	2: use the most common (non N) base (needs -doCounts 1)
	3: use the base with highest ebd (under development) 
	4: output iupac codes (under development) 
	-basesPerLine	50	(Number of bases perline in output file)
	-explode	0	 print chromosome where we have no data (0:no,1:yes)
	-rmTrans	0	 remove transitions as different from -ref bases (0:no,1:yes)
	-ref	(null)	 reference fasta, only used with -rmTrans 1
	-seed	0	 use non random seed of value 1

This function will dump a fasta file, the full header information from the SAM/BAM file will be used. This means that a fasta will be generated for the entire chromosome even if '-r/-rf -sites' is used.

Options

-doFasta 1
sample a random base at each position. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
-doFasta 2
use the most common base. In the case of ties a random base is chosen among the bases with the same maximum counts. N's or filtered based are ignored. The "-doCounts 1" options for allele counts is needed in order to determine the most common base. If multiple individuals are used the four bases are counted across individuals.
-doFasta 3
use the base with thie highest effective depth (EBD). This only works for one individual
-basesPerLine [INT]

Number of bases perline in output fasta file (default is 50)

-explode [INT]

0 (default) only output chromosomes with data. 1: write out all chromosomes

-rmTrans [INT]

0 (default) all sites are used. 1: Remove transition. Here transitions are determined using a fasta file such as a reference genome.

-ref [fileName]

a fasta file used to determine if a site is a transitions (needed when using -rmTrans 1 is used)

-seed [INT]

Use a seed in order to replicate results ( relevant when using random sample -dofasta 1 )

For filters see Filters

Output

Output is a fasta file, a normal looking fast file. Nothing special about this. For -doFasta 1, sometimes its big letters sometime small letters. This is due to the results being copied directly from the sequencing data. So small/big letters correspond to which strand for the original data. For the consensus fasta all letters are capital letters.

Example

Create a fasta file bases from a random samples of bases.

./angsd -i bams/smallNA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20111114.bam -doFasta 1


EBD

For four bases we have 4 different EBD, each EBD is the product of the mapping quality and scores for the base under consideration. The EBD is the effective base depth, as defined by [1]:

where is a certain base, is the number of reads with base


Ancestral fasta using multiple outgroups

If you have outgroup species map to your reference genome and you want to use them to make a fasta file with ancestral alleles. You can use one or more outgroup individuals e.g. for human you could have a four outgroup bam file from a chimp, a bonobo, a gorrilla and a orangotan. Assuming you want to make a fasta file where the alleles is the same for all outgroup species then you can use a command like

./angsd -b fourOutgroup.bamlist -out myFasta -doCounts 1 -snp_pval 0.01 -domaf 1 -domajorminor 1 -gl 2 -rmSNPs 1 -minind 4 -setMinDepthInd 10 -explode 1

-b fourOutgroup.bamlist contains the bam files for four outgroup individuals.

-out myFasta the output name

-doCounts 1 counts bases accross individuals to determine the concensus allele

-snp_pval 0.01 p-value threshold for defining a SNP. A lower threshold will need more evidence to call a SNPs.

-domaf 1 estimate allele frequency (use to call SNPs) with that the major and minor alleles inferred from data

-domajorminor 1 infer the major and minor allele from data

-gl 2 use genotype likelihoods based on the GATK model

-rmSNPs 1 remove polymorphic sites. instead of keeping sites that are polymorphic then we remove them such that all outgroups have the same allele.

-minind 4 remove site where you don't have data for all four individuals

-setMinDepthInd 10 require at least 10 read for each individual

-explode 1 make the fasta file for the whole genome not just the chromosomes/scaffolds were you have data

The sites that are polymorphic or do not have enough data will be labeled as 'N' in the fasta file