ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

MsToGlf: Difference between revisions

From angsd
Jump to navigation Jump to search
Line 53: Line 53:
</pre>
</pre>


And lets generate genotype likelihoods corresponding to the above command. This will take some time and fill up considerable amounts of diskspace. Because its the full data for a 100mb region for 25 samples.
And lets generate genotype likelihoods corresponding to the above command. This will take some time and fill up considerable amounts of diskspace. Because its the full data for a 100mb region for 25 samples. We here assume 2x data with 0.5% errors.
 
<pre>
./msToGlf -in msoutput -out withselection.gl -err 0.005 -depth 2 -nind 25 -singleOut 0 -regLen 1000000
</pre>


==Two populations==
==Two populations==

Revision as of 23:24, 10 February 2014

For the Korneliussen2013 paper, we simulated data according to genotypes simulated from ms/msms output. For this we used the msToGlf program found in the 'misc/' subfolder of the angsd tar ball.

Sorry about the total lack of information for this program.

This program assumes diploid samples.

Brief Overview

 ./msToGlf 
Probs with args, supply -in -out
also -err -depth -depthFile -singleOut -regLen -nind
-in ms/msms outputfilename
-out prefix output filename
-regLen [int] Number of base pairs the ms/msms output is supposed to represent
-singleOut [zero or one] ms/msms can generate multiple replicates of the same scenario '-singleOut 1' will generate a single output file
-depth average sequencing depth
-nind Number of individuals in the ms/msms file
-err errorrate, a value 0.005 corresponds to a 0.5% errorrate.
-depthFile filename, This is useful if you want to force a different mean depth between individuals

Output format

The program will dump a binary compressed file. It will calculate all 10 possible genotype likelihoods for each individual for all sites. The genotypes are in the order AA,AC,AG,AT,CC,CG,CT,GG,GT,TT. These are encoded as ctype 'double'. So the size requirements for a single site for N individuals are 'N*10*sizeof(double)'.

Examples

Standard neutral model

This ms/msms command will generate haplotypes assuming human recombination/mutation rates for a 1mb region. We will make 50 haplotypes (25 diploids) and do 14 repetitions.

msms -ms 50 14 -t 900 -r 400 -oTPi 0.05 0.05 -oAFS >msoutput

Now we will simulate genotype likelihoods assuming an errorate of 1.5% and a sequencing depth of 8x, but only for the variable/informative sites contained in the msoutputfile

./msToGlf -in msoutput -out msoutputNoInvar.gl -err 0.015 -depth 8 -nind 25 -singleOut 1

The output is single, very small file called 'msoutputNoInvar.gl.glf.gz'.

Now lets do a more realistic example, where we don't limit ourselves to the informative sites but also simulate all the invariable sites for our 1mb region.

./msToGlf -in msoutput -out msoutputWithInvar.gl -err 0.015 -depth 8 -nind 25 -singleOut 1 -regLen 1000000

These can be feed into angsd using -sim1 argument as input

../angsd -sim1 msoutputNoInvar.gl.glf.gz -nind 25 -doMajorMinor 1 -doMaf 1

If you do sample allele frequency based analysis '-doSaf' then the ancestral states are assumed to be 'A'.

With Selection

The below command will generate 100 replicates of a scenario with strong positive selection in the center of 1mb region, assuming 25 diploids.

msms -ms 50 100 -t 900 -r 400 -SAA 1000 -SaA 500 -N 10000 -SF 0 -Sp .5 -oTPi 0.05 0.05 -oAFS >msoutput

And lets generate genotype likelihoods corresponding to the above command. This will take some time and fill up considerable amounts of diskspace. Because its the full data for a 100mb region for 25 samples. We here assume 2x data with 0.5% errors.

./msToGlf -in msoutput -out withselection.gl -err 0.005 -depth 2 -nind 25 -singleOut 0 -regLen 1000000

Two populations