software - User contributions [en]

NgsAdmixTutorial

2022-06-06T22:59:21Z

Thorfinn: /* The output */

We will go through a simple and more complex example on how to use NGSadmix with visualization of the data.

User need to to add -printInfo 1 in order to get information about sites retained for analysis.

==Example of NGSadmix - very small data set==

In our first example, we will infer admixture proportions for low depth NGS data using a small dataset from 30 human samples.

===Set paths to software===
Every time you open a new terminal window, set directories to all required programs and the data you will use depending on where you installed them

Set the path to NGSadmix, for example:

<code>NGSADMIX=~/Software/NGSadmix</code>

Test the link

<code>ls $NGSADMIX</code>

===Create directories===
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

<code>cd ..</code>

===Set the paths to your local directories===

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

===The test data===

We will use a very reduced data set.

*10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
*a very reduced genome 30 x 100k random regions across the autosomes
*each individual is sequenced at 2-6X

<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>Europeans (mostly of British ancestry)</td>
</tr>
<tr>
<td>JPT</td>
<td>East Asian - Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>West African - Nigerian Yoruba individuals</td>
</tr>
</table>

You can either download the beagle input files or create them yourself from bam files. Therefore you should choose either 'Download the beagle genotype likelihood input file' or 'Create the the beagle genotype likelihood input file using ANGSD'.
====Download the beagle genotype likelihood input file====
NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.
A file with the information about the population is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

====Create the the beagle genotype likelihood input file using ANGSD====
first set the path to ANGSD (change ~/github/angsd/ to the path on your system)

<code> ANGSD=~/github/angsd/angsd </code>

test that you have the rigth path and that ANGSD is installed

<code> $ANGSD</code>

Download small BAM files and extract files

<code> wget http://popgen.dk/software/download/NGSadmix/data/smallerbams.tar </code>

<code> tar xf smallerbams.tar </code>

Make a list with the bam file names

<code>find smallerbams/ | grep bam$ > all.files</code>

calculate genotype likelihoods for polymorphic sites using ANGSD (see ANGSD website for further information)

<code>$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/Demo1input.gz -P 5 </code>

create population information file

<code> paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info </code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

=== Let’s analyze the input file ===
To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:

<code>$NGSADMIX -likes $IN_DIR/Demo1input.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix</code>

For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

=== The output ===

The analysis performed by NGSadmix produces 4 files:

*'''Log likelihood of the estimates'''
A .log file that summarizes the run.

Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:

<code>cat $OUT_DIR/Demo1NGSadmix.log</code>

*'''Estimated allele frequency'''

::A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).

::We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:

::<code>zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5</code>

*'''Estimated admixture proportions'''
::A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).

::To obtain the estimated admixture proportions for the first 5 individuals, type the following command:

::<code>head -n 5 Demo1NGSadmix.qopt</code>

=== Plot the results in R ===

Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:

Make sure you stand on the output directory

<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# Get ID and pop info for each individual
pop<-scan("poplabel",what="theFuck")

# Read inferred admixture proportions file
q<-read.table("Demo1NGSadmix.qopt")

# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75)
</pre>

The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.

Each color represents a different ancestral population.

The proportion of each color shows the different admixture of the individuals for each ancestral population.

The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast).

==Example of NGSadmix with admixed populations==

Now that you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like, let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on 50000 sites from the 1000 genomes project from the following populations:

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>ASW</td>
<td>HapMap African Americans from SW US</td>
</tr>
<tr>
<td>CEU</td>
<td>European individuals</td>
</tr>
<tr>
<td>CHB</td>
<td>Han Chinese in Beijing</td>
</tr>
<tr>
<td>JPT</td>
<td>Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>Yoruba individuals from Nigeria</td>
</tr>
<tr>
<td>MXL</td>
<td>Mexican individuals from LA California</td>
</tr>
:</table>

The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given.
''Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.''

===Download Files===
Download and copy the files to your input folder (for example, $IN_DIR)
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
mv Demo2input.gz $IN_DIR
mv Demo2pop.info $OUT_DIR</pre>

make sure you are back in your original folder (not the $OUT_DIR folder)

===Take a quick look at the population data===
make a summary by cutting the first column, sorting and counting
<code>cut -f 1 -d " " $OUT_DIR/Demo2pop.info | sort | uniq -c</code>

===Run NGSadmix===
Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3</code>

===Plot===
Plot the estimated admixture proportions by running the following code in R:

Make sure you stand on the output directory
<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK3.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

Note that like in the previous example, the order of the individuals in the plot is not the same as in the .qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.

===Choose a differnt K===
Try to run NGSadmix with K=4 and compare the plots.

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK4.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=4")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

NgsAdmixTutorial

2022-06-06T22:59:03Z

Thorfinn: /* The output */

We will go through a simple and more complex example on how to use NGSadmix with visualization of the data.

User need to to add -printInfo 1 in order to get information about sites retained for analysis.

==Example of NGSadmix - very small data set==

In our first example, we will infer admixture proportions for low depth NGS data using a small dataset from 30 human samples.

===Set paths to software===
Every time you open a new terminal window, set directories to all required programs and the data you will use depending on where you installed them

Set the path to NGSadmix, for example:

<code>NGSADMIX=~/Software/NGSadmix</code>

Test the link

<code>ls $NGSADMIX</code>

===Create directories===
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

<code>cd ..</code>

===Set the paths to your local directories===

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

===The test data===

We will use a very reduced data set.

*10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
*a very reduced genome 30 x 100k random regions across the autosomes
*each individual is sequenced at 2-6X

<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>Europeans (mostly of British ancestry)</td>
</tr>
<tr>
<td>JPT</td>
<td>East Asian - Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>West African - Nigerian Yoruba individuals</td>
</tr>
</table>

You can either download the beagle input files or create them yourself from bam files. Therefore you should choose either 'Download the beagle genotype likelihood input file' or 'Create the the beagle genotype likelihood input file using ANGSD'.
====Download the beagle genotype likelihood input file====
NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.
A file with the information about the population is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

====Create the the beagle genotype likelihood input file using ANGSD====
first set the path to ANGSD (change ~/github/angsd/ to the path on your system)

<code> ANGSD=~/github/angsd/angsd </code>

test that you have the rigth path and that ANGSD is installed

<code> $ANGSD</code>

Download small BAM files and extract files

<code> wget http://popgen.dk/software/download/NGSadmix/data/smallerbams.tar </code>

<code> tar xf smallerbams.tar </code>

Make a list with the bam file names

<code>find smallerbams/ | grep bam$ > all.files</code>

calculate genotype likelihoods for polymorphic sites using ANGSD (see ANGSD website for further information)

<code>$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/Demo1input.gz -P 5 </code>

create population information file

<code> paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info </code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

=== Let’s analyze the input file ===
To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:

<code>$NGSADMIX -likes $IN_DIR/Demo1input.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix</code>

For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

=== The output ===

The analysis performed by NGSadmix produces 4 files:

*'''Log likelihood of the estimates'''
A .log file that summarizes the run.

Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:

<code>cat $OUT_DIR/Demo1NGSadmix.log</code>

*'''Estimated allele frequency'''

::A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).

::We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:

::<code>zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5</code>

*'''Estimated admixture proportions'''
::A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).

::To obtain the estimated admixture proportions for the first 5 individuals, type the following command:

::<code>head -n 5 Demo1NGSadmix.qopt</code>

#*'''DemoNGSadmix.filter''' if the filter was used, it will show the sites that were left out.
#
#::To see the header file, type:
#
#::<code>head -n 5 $OUT_DIR/Demo1NGSadmix.filter</code>
#
#::no lines means that all where used

=== Plot the results in R ===

Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:

Make sure you stand on the output directory

<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# Get ID and pop info for each individual
pop<-scan("poplabel",what="theFuck")

# Read inferred admixture proportions file
q<-read.table("Demo1NGSadmix.qopt")

# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75)
</pre>

The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.

Each color represents a different ancestral population.

The proportion of each color shows the different admixture of the individuals for each ancestral population.

The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast).

==Example of NGSadmix with admixed populations==

Now that you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like, let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on 50000 sites from the 1000 genomes project from the following populations:

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>ASW</td>
<td>HapMap African Americans from SW US</td>
</tr>
<tr>
<td>CEU</td>
<td>European individuals</td>
</tr>
<tr>
<td>CHB</td>
<td>Han Chinese in Beijing</td>
</tr>
<tr>
<td>JPT</td>
<td>Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>Yoruba individuals from Nigeria</td>
</tr>
<tr>
<td>MXL</td>
<td>Mexican individuals from LA California</td>
</tr>
:</table>

The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given.
''Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.''

===Download Files===
Download and copy the files to your input folder (for example, $IN_DIR)
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
mv Demo2input.gz $IN_DIR
mv Demo2pop.info $OUT_DIR</pre>

make sure you are back in your original folder (not the $OUT_DIR folder)

===Take a quick look at the population data===
make a summary by cutting the first column, sorting and counting
<code>cut -f 1 -d " " $OUT_DIR/Demo2pop.info | sort | uniq -c</code>

===Run NGSadmix===
Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3</code>

===Plot===
Plot the estimated admixture proportions by running the following code in R:

Make sure you stand on the output directory
<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK3.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

Note that like in the previous example, the order of the individuals in the plot is not the same as in the .qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.

===Choose a differnt K===
Try to run NGSadmix with K=4 and compare the plots.

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK4.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=4")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

NgsAdmixTutorial

2022-06-06T22:55:35Z

Thorfinn: /* The test data */

We will go through a simple and more complex example on how to use NGSadmix with visualization of the data.

User need to to add -printInfo 1 in order to get information about sites retained for analysis.

==Example of NGSadmix - very small data set==

In our first example, we will infer admixture proportions for low depth NGS data using a small dataset from 30 human samples.

===Set paths to software===
Every time you open a new terminal window, set directories to all required programs and the data you will use depending on where you installed them

Set the path to NGSadmix, for example:

<code>NGSADMIX=~/Software/NGSadmix</code>

Test the link

<code>ls $NGSADMIX</code>

===Create directories===
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

<code>cd ..</code>

===Set the paths to your local directories===

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

===The test data===

We will use a very reduced data set.

*10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
*a very reduced genome 30 x 100k random regions across the autosomes
*each individual is sequenced at 2-6X

<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>Europeans (mostly of British ancestry)</td>
</tr>
<tr>
<td>JPT</td>
<td>East Asian - Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>West African - Nigerian Yoruba individuals</td>
</tr>
</table>

You can either download the beagle input files or create them yourself from bam files. Therefore you should choose either 'Download the beagle genotype likelihood input file' or 'Create the the beagle genotype likelihood input file using ANGSD'.
====Download the beagle genotype likelihood input file====
NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.
A file with the information about the population is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

====Create the the beagle genotype likelihood input file using ANGSD====
first set the path to ANGSD (change ~/github/angsd/ to the path on your system)

<code> ANGSD=~/github/angsd/angsd </code>

test that you have the rigth path and that ANGSD is installed

<code> $ANGSD</code>

Download small BAM files and extract files

<code> wget http://popgen.dk/software/download/NGSadmix/data/smallerbams.tar </code>

<code> tar xf smallerbams.tar </code>

Make a list with the bam file names

<code>find smallerbams/ | grep bam$ > all.files</code>

calculate genotype likelihoods for polymorphic sites using ANGSD (see ANGSD website for further information)

<code>$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/Demo1input.gz -P 5 </code>

create population information file

<code> paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info </code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

=== Let’s analyze the input file ===
To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:

<code>$NGSADMIX -likes $IN_DIR/Demo1input.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix</code>

For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

=== The output ===

The analysis performed by NGSadmix produces 4 files:

*'''Log likelihood of the estimates'''
A .log file that summarizes the run.

Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:

<code>cat $OUT_DIR/Demo1NGSadmix.log</code>

*'''Estimated allele frequency'''

::A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).

::We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:

::<code>zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5</code>

*'''Estimated admixture proportions'''
::A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).

::To obtain the estimated admixture proportions for the first 5 individuals, type the following command:

::<code>head -n 5 Demo1NGSadmix.qopt</code>

*'''DemoNGSadmix.filter''' if the filter was used, it will show the sites that were left out.

::To see the header file, type:

::<code>head -n 5 $OUT_DIR/Demo1NGSadmix.filter</code>

::no lines means that all where used

=== Plot the results in R ===

Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:

Make sure you stand on the output directory

<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# Get ID and pop info for each individual
pop<-scan("poplabel",what="theFuck")

# Read inferred admixture proportions file
q<-read.table("Demo1NGSadmix.qopt")

# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75)
</pre>

The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.

Each color represents a different ancestral population.

The proportion of each color shows the different admixture of the individuals for each ancestral population.

The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast).

==Example of NGSadmix with admixed populations==

Now that you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like, let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on 50000 sites from the 1000 genomes project from the following populations:

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>ASW</td>
<td>HapMap African Americans from SW US</td>
</tr>
<tr>
<td>CEU</td>
<td>European individuals</td>
</tr>
<tr>
<td>CHB</td>
<td>Han Chinese in Beijing</td>
</tr>
<tr>
<td>JPT</td>
<td>Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>Yoruba individuals from Nigeria</td>
</tr>
<tr>
<td>MXL</td>
<td>Mexican individuals from LA California</td>
</tr>
:</table>

The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given.
''Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.''

===Download Files===
Download and copy the files to your input folder (for example, $IN_DIR)
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
mv Demo2input.gz $IN_DIR
mv Demo2pop.info $OUT_DIR</pre>

make sure you are back in your original folder (not the $OUT_DIR folder)

===Take a quick look at the population data===
make a summary by cutting the first column, sorting and counting
<code>cut -f 1 -d " " $OUT_DIR/Demo2pop.info | sort | uniq -c</code>

===Run NGSadmix===
Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3</code>

===Plot===
Plot the estimated admixture proportions by running the following code in R:

Make sure you stand on the output directory
<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK3.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

Note that like in the previous example, the order of the individuals in the plot is not the same as in the .qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.

===Choose a differnt K===
Try to run NGSadmix with K=4 and compare the plots.

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK4.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=4")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

NgsAdmixTutorial

2022-06-06T22:49:51Z

Thorfinn: /* Create directories */

We will go through a simple and more complex example on how to use NGSadmix with visualization of the data.

User need to to add -printInfo 1 in order to get information about sites retained for analysis.

==Example of NGSadmix - very small data set==

In our first example, we will infer admixture proportions for low depth NGS data using a small dataset from 30 human samples.

===Set paths to software===
Every time you open a new terminal window, set directories to all required programs and the data you will use depending on where you installed them

Set the path to NGSadmix, for example:

<code>NGSADMIX=~/Software/NGSadmix</code>

Test the link

<code>ls $NGSADMIX</code>

===Create directories===
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

<code>cd ..</code>

===Set the paths to your local directories===

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

===The test data===

We will use a very reduced data set.

*10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
*a very reduced genome 30 x 100k random regions across the autosomes
*each individual is sequenced at 2-6X

<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>Europeans (mostly of British ancestry)</td>
</tr>
<tr>
<td>JPT</td>
<td>East Asian - Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>West African - Nigerian Yoruba individuals</td>
</tr>
</table>

You can either download the beagle input files or create them yourself from bam files
====Download the beagle genotype likelihood input file====
NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.
A file with the information about the population is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

====Create the the beagle genotype likelihood input file using ANGSD====
first set the path to ANGSD (change ~/github/angsd/ to the path on your system)

<code> ANGSD=~/github/angsd/angsd </code>

test that you have the rigth path and that ANGSD is installed

<code> $ANGSD</code>

Download small BAM files and extract files

<code> wget http://popgen.dk/software/download/NGSadmix/data/smallerbams.tar </code>

<code> tar xf smallerbams.tar </code>

Make a list with the bam file names

<code>find smallerbams/ | grep bam$ > all.files</code>

calculate genotype likelihoods for polymorphic sites using ANGSD (see ANGSD website for further information)

<code>$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/Demo1input.gz -P 5 </code>

create population information file

<code> paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info </code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

=== Let’s analyze the input file ===
To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:

<code>$NGSADMIX -likes $IN_DIR/Demo1input.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix</code>

For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

=== The output ===

The analysis performed by NGSadmix produces 4 files:

*'''Log likelihood of the estimates'''
A .log file that summarizes the run.

Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:

<code>cat $OUT_DIR/Demo1NGSadmix.log</code>

*'''Estimated allele frequency'''

::A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).

::We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:

::<code>zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5</code>

*'''Estimated admixture proportions'''
::A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).

::To obtain the estimated admixture proportions for the first 5 individuals, type the following command:

::<code>head -n 5 Demo1NGSadmix.qopt</code>

*'''DemoNGSadmix.filter''' if the filter was used, it will show the sites that were left out.

::To see the header file, type:

::<code>head -n 5 $OUT_DIR/Demo1NGSadmix.filter</code>

::no lines means that all where used

=== Plot the results in R ===

Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:

Make sure you stand on the output directory

<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# Get ID and pop info for each individual
pop<-scan("poplabel",what="theFuck")

# Read inferred admixture proportions file
q<-read.table("Demo1NGSadmix.qopt")

# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75)
</pre>

The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.

Each color represents a different ancestral population.

The proportion of each color shows the different admixture of the individuals for each ancestral population.

The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast).

==Example of NGSadmix with admixed populations==

Now that you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like, let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on 50000 sites from the 1000 genomes project from the following populations:

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>ASW</td>
<td>HapMap African Americans from SW US</td>
</tr>
<tr>
<td>CEU</td>
<td>European individuals</td>
</tr>
<tr>
<td>CHB</td>
<td>Han Chinese in Beijing</td>
</tr>
<tr>
<td>JPT</td>
<td>Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>Yoruba individuals from Nigeria</td>
</tr>
<tr>
<td>MXL</td>
<td>Mexican individuals from LA California</td>
</tr>
:</table>

The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given.
''Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.''

===Download Files===
Download and copy the files to your input folder (for example, $IN_DIR)
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
mv Demo2input.gz $IN_DIR
mv Demo2pop.info $OUT_DIR</pre>

make sure you are back in your original folder (not the $OUT_DIR folder)

===Take a quick look at the population data===
make a summary by cutting the first column, sorting and counting
<code>cut -f 1 -d " " $OUT_DIR/Demo2pop.info | sort | uniq -c</code>

===Run NGSadmix===
Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3</code>

===Plot===
Plot the estimated admixture proportions by running the following code in R:

Make sure you stand on the output directory
<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK3.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

Note that like in the previous example, the order of the individuals in the plot is not the same as in the .qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.

===Choose a differnt K===
Try to run NGSadmix with K=4 and compare the plots.

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK4.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=4")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

NgsAdmixTutorial

2020-05-02T03:18:26Z

Thorfinn:

We will go through a simple and more complex example on how to use NGSadmix with visualization of the data.

User need to to add -printInfo 1 in order to get information about sites retained for analysis.

==Example of NGSadmix - very small data set==

In our first example, we will infer admixture proportions for low depth NGS data using a small dataset from 30 human samples.

===Set paths to software===
Every time you open a new terminal window, set directories to all required programs and the data you will use depending on where you installed them

Set the path to NGSadmix, for example:

<code>NGSADMIX=~/Software/NGSadmix</code>

Test the link

<code>ls $NGSADMIX</code>

===Create directories===
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

===Set the paths to your local directories===

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

===The test data===

We will use a very reduced data set.

*10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
*a very reduced genome 30 x 100k random regions across the autosomes
*each individual is sequenced at 2-6X

<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>Europeans (mostly of British ancestry)</td>
</tr>
<tr>
<td>JPT</td>
<td>East Asian - Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>West African - Nigerian Yoruba individuals</td>
</tr>
</table>

You can either download the beagle input files or create them yourself from bam files
====Download the beagle genotype likelihood input file====
NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.
A file with the information about the population is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

====Create the the beagle genotype likelihood input file using ANGSD====
first set the path to ANGSD (change ~/github/angsd/ to the path on your system)

<code> ANGSD=~/github/angsd/angsd </code>

test that you have the rigth path and that ANGSD is installed

<code> $ANGSD</code>

Download small BAM files and extract files

<code> wget http://popgen.dk/software/download/NGSadmix/data/smallerbams.tar </code>

<code> tar xf smallerbams.tar </code>

Make a list with the bam file names

<code>find smallerbams/ | grep bam$ > all.files</code>

calculate genotype likelihoods for polymorphic sites using ANGSD (see ANGSD website for further information)

<code>$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/Demo1input.gz -P 5 </code>

create population information file

<code> paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info </code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

=== Let’s analyze the input file ===
To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:

<code>$NGSADMIX -likes $IN_DIR/Demo1input.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix</code>

For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

=== The output ===

The analysis performed by NGSadmix produces 4 files:

*'''Log likelihood of the estimates'''
A .log file that summarizes the run.

Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:

<code>cat $OUT_DIR/Demo1NGSadmix.log</code>

*'''Estimated allele frequency'''

::A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).

::We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:

::<code>zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5</code>

*'''Estimated admixture proportions'''
::A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).

::To obtain the estimated admixture proportions for the first 5 individuals, type the following command:

::<code>head -n 5 Demo1NGSadmix.qopt</code>

*'''DemoNGSadmix.filter''' if the filter was used, it will show the sites that were left out.

::To see the header file, type:

::<code>head -n 5 $OUT_DIR/Demo1NGSadmix.filter</code>

::no lines means that all where used

=== Plot the results in R ===

Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:

Make sure you stand on the output directory

<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# Get ID and pop info for each individual
pop<-scan("poplabel",what="theFuck")

# Read inferred admixture proportions file
q<-read.table("Demo1NGSadmix.qopt")

# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75)
</pre>

The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.

Each color represents a different ancestral population.

The proportion of each color shows the different admixture of the individuals for each ancestral population.

The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast).

==Example of NGSadmix with admixed populations==

Now that you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like, let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on 50000 sites from the 1000 genomes project from the following populations:

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>ASW</td>
<td>HapMap African Americans from SW US</td>
</tr>
<tr>
<td>CEU</td>
<td>European individuals</td>
</tr>
<tr>
<td>CHB</td>
<td>Han Chinese in Beijing</td>
</tr>
<tr>
<td>JPT</td>
<td>Japanese individuals</td>
</tr>
<tr>
<td>YRI</td>
<td>Yoruba individuals from Nigeria</td>
</tr>
<tr>
<td>MXL</td>
<td>Mexican individuals from LA California</td>
</tr>
:</table>

The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given.
''Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.''

===Download Files===
Download and copy the files to your input folder (for example, $IN_DIR)
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
mv Demo2input.gz $IN_DIR
mv Demo2pop.info $OUT_DIR</pre>

make sure you are back in your original folder (not the $OUT_DIR folder)

===Take a quick look at the population data===
make a summary by cutting the first column, sorting and counting
<code>cut -f 1 -d " " $OUT_DIR/Demo2pop.info | sort | uniq -c</code>

===Run NGSadmix===
Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3</code>

===Plot===
Plot the estimated admixture proportions by running the following code in R:

Make sure you stand on the output directory
<code>cd $OUT_DIR</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK3.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

Note that like in the previous example, the order of the individuals in the plot is not the same as in the .qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.

===Choose a differnt K===
Try to run NGSadmix with K=4 and compare the plots.

<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4</code>

Type “R” in the terminal and press enter and paste the following code into R:

<pre>
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK4.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=4")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>

NgsRelate

2018-10-05T13:59:09Z

Thorfinn:

= NEW VERSION =
For the NEW version of ngsRelate that coestimates relatedness and inbreeding go to this link https://github.com/ANGSD/NgsRelate

= OLD VERSION =
For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694

NgsRelate

2018-10-05T13:58:33Z

Thorfinn: Replaced content with "For the NEW version of ngsRelate that coestimates relatedness and inbreeding go to this link https://github.com/ANGSD/NgsRelate For the old version please use this link: h..."

For the NEW version of ngsRelate that coestimates relatedness and inbreeding go to this link https://github.com/ANGSD/NgsRelate
For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694

NgsRelate

2018-10-03T12:48:00Z

Thorfinn:

NgsRelate

2018-10-03T12:46:54Z

Thorfinn: /* Run example 3: using frequencies from 1000genomes vcf files */

NgsRelate

2018-10-03T12:46:26Z

Thorfinn: /* Run examples */

This pages refers to the new v2 of ngsRelate which coestimates relatedness and inbreeding. For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are examples of how NgsRelate can be used to coestimate relatedness and inbreeding from NGS data.
<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals.

== Run example 0: using only VCF/BCF files==

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-10-03T12:45:56Z

Thorfinn: /* Run examples */

This pages refers to the new v2 of ngsRelate which coestimates relatedness and inbreeding. For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are examples of how NgsRelate can be used to coestimate relatedness and inbreeding from NGS data.
<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 0: using only VCF/BCF files==

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-10-03T12:44:24Z

Thorfinn: /* Run example 0: using only BCF files */

This pages refers to the new v2 of ngsRelate which coestimates relatedness and inbreeding. For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 0: using only VCF/BCF files==

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-10-03T12:43:56Z

Thorfinn:

This pages refers to the new v2 of ngsRelate which coestimates relatedness and inbreeding. For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 0: using only BCF files==

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-10-03T12:42:11Z

Thorfinn:

This pages refers to the new v2 of ngsRelate which coestimates relatedness and inbreeding. For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 0: using only BCF files==

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq >newres
</pre>
The final relatedness estimates will then be available in the file called "newres" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

Main Page

2018-10-03T12:39:45Z

Thorfinn:

= [http://www.popgen.dk/angsd ANGSD] =
Analysis of Next Generation Sequencing Data

<classdiagram type="dir:LR">
[sequence data]->[genotype;likelihoods]
[genotype;likelihoods]->[genotype;probabilities]
[sequence files|bam files;SOAP files{bg:orange}]->[sequence data]
[glf files|glfv3;soapSNP{bg:orange}]->[genotype;likelihoods]
[genotype prob|beagle output{bg:orange}]->[genotype;probabilities]
</classdiagram>

= [[NgsAdmix]] =
[[File:NgsAdmix.png|thumb|NGSadmix]]
Infer the ancestry proportions from low depth NGS data. The principal is the same as other softwares such as FRAPPE and ADMIXTURE however, ngsAdmix also works when you have uncertainty in your data. This makes it ideal for medium and low depth sequencing data where many genotypes cannot be called without introducing errors or ascertainment bias.

= [[NgsRelate]] =
A software package for inferring relatedness and inbreeding between pairs of individuals from NGS data. This is a new version.

= [[Relate]]=
[[File:relate.png|thumb|left|Relate. Infered IBD sharing across a chromosome for a sib pair estimated using affy 500k data]]

This method estimates the probability of sharing alleles identity by descent (IBD) across the genome and can also be used for mapping disease loci using distantly related individuals. To accommodate LD the methods need SNP for several individuals in order to estimate the allele frequencies and the pairwise LD. The method return the posterior probabilities of the IBD states across the genome and the overall IBD sharing.

= [[RelateAdmix]] =
A software package for inferring relatedness between pairs of individuals even if the individuals are admixed.

= [[BAMSE]] =

= [[CATS]] =

[[File:catsFig2.png|thumb|left|joint power of a non-symetric two stage GWA design]]
R package for power estimation for a two-stage genome-wide association design. This is a modification of the code from Skol et al 2006, nat genet. so that the relative risk, case-control ratios and allele frequencies are allowed to vary between stages.

= [[LDadmix]] =
= [[R scripts]] =

Main Page

2018-10-03T12:38:23Z

Thorfinn: /* fastNGSadmix */

= [http://www.popgen.dk/angsd ANGSD] =
Analysis of Next Generation Sequencing Data

<classdiagram type="dir:LR">
[sequence data]->[genotype;likelihoods]
[genotype;likelihoods]->[genotype;probabilities]
[sequence files|bam files;SOAP files{bg:orange}]->[sequence data]
[glf files|glfv3;soapSNP{bg:orange}]->[genotype;likelihoods]
[genotype prob|beagle output{bg:orange}]->[genotype;probabilities]
</classdiagram>

= [[NgsAdmix]] =
[[File:NgsAdmix.png|thumb|NGSadmix]]
Infer the ancestry proportions from low depth NGS data. The principal is the same as other softwares such as FRAPPE and ADMIXTURE however, ngsAdmix also works when you have uncertainty in your data. This makes it ideal for medium and low depth sequencing data where many genotypes cannot be called without introducing errors or ascertainment bias.

= [[NgsRelate] ggg] =
A software package for inferring relatedness and inbreeding between pairs of individuals from NGS data.

= [[NgsRelate]] =
A software package for inferring relatedness and inbreeding between pairs of individuals from NGS data.

= [[Relate]]=
[[File:relate.png|thumb|left|Relate. Infered IBD sharing across a chromosome for a sib pair estimated using affy 500k data]]

This method estimates the probability of sharing alleles identity by descent (IBD) across the genome and can also be used for mapping disease loci using distantly related individuals. To accommodate LD the methods need SNP for several individuals in order to estimate the allele frequencies and the pairwise LD. The method return the posterior probabilities of the IBD states across the genome and the overall IBD sharing.

= [[RelateAdmix]] =
A software package for inferring relatedness between pairs of individuals even if the individuals are admixed.

= [[BAMSE]] =

= [[CATS]] =

[[File:catsFig2.png|thumb|left|joint power of a non-symetric two stage GWA design]]
R package for power estimation for a two-stage genome-wide association design. This is a modification of the code from Skol et al 2006, nat genet. so that the relative risk, case-control ratios and allele frequencies are allowed to vary between stages.

= [[LDadmix]] =
= [[R scripts]] =

Main Page

2018-10-03T12:36:55Z

Thorfinn: /* NgsRelate */

= [http://www.popgen.dk/angsd ANGSD] =
Analysis of Next Generation Sequencing Data

<classdiagram type="dir:LR">
[sequence data]->[genotype;likelihoods]
[genotype;likelihoods]->[genotype;probabilities]
[sequence files|bam files;SOAP files{bg:orange}]->[sequence data]
[glf files|glfv3;soapSNP{bg:orange}]->[genotype;likelihoods]
[genotype prob|beagle output{bg:orange}]->[genotype;probabilities]
</classdiagram>

= [[NgsAdmix]] =
[[File:NgsAdmix.png|thumb|NGSadmix]]
Infer the ancestry proportions from low depth NGS data. The principal is the same as other softwares such as FRAPPE and ADMIXTURE however, ngsAdmix also works when you have uncertainty in your data. This makes it ideal for medium and low depth sequencing data where many genotypes cannot be called without introducing errors or ascertainment bias.

= [[NgsRelate] ggg] =
A software package for inferring relatedness and inbreeding between pairs of individuals from NGS data.

= [[fastNGSadmix]] =
Infers the ancestry proportions from NGS data of a single indivudal, even with low-depth NGS data. It is based on ngsAdmix, however it uses a referenece panel of population specific frequencies, and works when you have uncertainty in your data. This makes it ideal for quick and easy analyses of medium and low depth sequencing data samples where many genotypes cannot be called without introducing errors or ascertainment bias. There is also a PCA method for doing PCA of a NGS sample, where population structure is taken into into account.

= [[Relate]]=
[[File:relate.png|thumb|left|Relate. Infered IBD sharing across a chromosome for a sib pair estimated using affy 500k data]]

This method estimates the probability of sharing alleles identity by descent (IBD) across the genome and can also be used for mapping disease loci using distantly related individuals. To accommodate LD the methods need SNP for several individuals in order to estimate the allele frequencies and the pairwise LD. The method return the posterior probabilities of the IBD states across the genome and the overall IBD sharing.

= [[RelateAdmix]] =
A software package for inferring relatedness between pairs of individuals even if the individuals are admixed.

= [[BAMSE]] =

= [[CATS]] =

[[File:catsFig2.png|thumb|left|joint power of a non-symetric two stage GWA design]]
R package for power estimation for a two-stage genome-wide association design. This is a modification of the code from Skol et al 2006, nat genet. so that the relative risk, case-control ratios and allele frequencies are allowed to vary between stages.

= [[LDadmix]] =
= [[R scripts]] =

Main Page

2018-10-03T12:36:16Z

Thorfinn:

= [http://www.popgen.dk/angsd ANGSD] =
Analysis of Next Generation Sequencing Data

<classdiagram type="dir:LR">
[sequence data]->[genotype;likelihoods]
[genotype;likelihoods]->[genotype;probabilities]
[sequence files|bam files;SOAP files{bg:orange}]->[sequence data]
[glf files|glfv3;soapSNP{bg:orange}]->[genotype;likelihoods]
[genotype prob|beagle output{bg:orange}]->[genotype;probabilities]
</classdiagram>

= [[NgsAdmix]] =
[[File:NgsAdmix.png|thumb|NGSadmix]]
Infer the ancestry proportions from low depth NGS data. The principal is the same as other softwares such as FRAPPE and ADMIXTURE however, ngsAdmix also works when you have uncertainty in your data. This makes it ideal for medium and low depth sequencing data where many genotypes cannot be called without introducing errors or ascertainment bias.

= [[NgsRelate]] =
A software package for inferring relatedness between pairs of individuals from NGS data.

= [[fastNGSadmix]] =
Infers the ancestry proportions from NGS data of a single indivudal, even with low-depth NGS data. It is based on ngsAdmix, however it uses a referenece panel of population specific frequencies, and works when you have uncertainty in your data. This makes it ideal for quick and easy analyses of medium and low depth sequencing data samples where many genotypes cannot be called without introducing errors or ascertainment bias. There is also a PCA method for doing PCA of a NGS sample, where population structure is taken into into account.

= [[Relate]]=
[[File:relate.png|thumb|left|Relate. Infered IBD sharing across a chromosome for a sib pair estimated using affy 500k data]]

This method estimates the probability of sharing alleles identity by descent (IBD) across the genome and can also be used for mapping disease loci using distantly related individuals. To accommodate LD the methods need SNP for several individuals in order to estimate the allele frequencies and the pairwise LD. The method return the posterior probabilities of the IBD states across the genome and the overall IBD sharing.

= [[RelateAdmix]] =
A software package for inferring relatedness between pairs of individuals even if the individuals are admixed.

= [[BAMSE]] =

= [[CATS]] =

[[File:catsFig2.png|thumb|left|joint power of a non-symetric two stage GWA design]]
R package for power estimation for a two-stage genome-wide association design. This is a modification of the code from Skol et al 2006, nat genet. so that the relative risk, case-control ratios and allele frequencies are allowed to vary between stages.

= [[LDadmix]] =
= [[R scripts]] =

NgsRelate

2018-07-30T09:39:58Z

Thorfinn: /* Run example 0: using only NGS data */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 0: using only BCF files==

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq >newres
</pre>
The final relatedness estimates will then be available in the file called "newres" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-07-30T09:39:41Z

Thorfinn: /* Run example 1: using only NGS data */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 0: using only NGS data==

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq >newres
</pre>
The final relatedness estimates will then be available in the file called "newres" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-07-30T09:39:17Z

Thorfinn: Undo revision 842 by Thorfinn (talk)

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq >newres
</pre>
The final relatedness estimates will then be available in the file called "newres" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-07-30T09:38:55Z

Thorfinn: /* Run example 1: using only NGS data */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 0: using BCF files==

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq >newres
</pre>
The final relatedness estimates will then be available in the file called "newres" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-07-30T09:35:53Z

Thorfinn: /* How to download and install */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>

git clone https://github.com/SAMtools/htslib

git clone https://github.com/ANGSD/ngsRelate

cd htslib/;make;cd ../ngsRelate;make HTSSRC=../htslib/
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq >newres
</pre>
The final relatedness estimates will then be available in the file called "newres" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-01-12T10:30:16Z

Thorfinn: /* How to download and install */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/76b03fad2da1db3148759ccdce1ab6508c2e00bd/ngsRelate.cpp >ngsRelate.cpp
g++ ngsRelate.cpp -O3 -lz -o ngsRelate
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq >newres
</pre>
The final relatedness estimates will then be available in the file called "newres" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2018-01-08T11:54:04Z

Thorfinn: /* How to download and install */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>
curl https://github.com/ANGSD/NgsRelate/blob/76b03fad2da1db3148759ccdce1ab6508c2e00bd/ngsRelate.cpp >ngsRelate.cpp
g++ ngsRelate.cpp -O3 -lz -o ngsRelate
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq > newres
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/newres.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (numbered from 0, so 0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

'''NEW''': Note that if you want you also input a file with the IDs of the individuals (on ID per line) in the same order as in the file 'filelist' used to make the genotype likelihoods. If you do the output will also contain these IDs and not just the numbers of the samples (one can actually just use that exact file, however the IDs then tend to be a bit long). This can be done with the optional flag -z followed by the filename.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq >newres
</pre>
The final relatedness estimates will then be available in the file called "newres" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
NEW: Example of output of analysis of two samples run without the optional -z:

<pre>
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

And the same analysis run with the optional flag -z followed by name of file with IDs (where the first two IDs are S1 and S42):
<pre>
a b ida idb nSites k0 k1 k2 loglh nIter coverage
0 1 S1 S42 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
</pre>

Example of output with 6 samples:
<pre>
cat newres
a b nSites k0 k1 k2 loglh nIter coverage
0 1 1128677 0.673213 0.326774 0.000013 -1710940.769938 19 0.813930
0 2 1121594 0.448790 0.548298 0.002912 -1666189.356801 25 0.808822
0 3 1131917 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
0 4 1135509 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
0 5 1043719 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
1 2 1118945 0.006249 0.993750 0.000001 -1580989.961356 13 0.806912
1 3 1129152 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
1 4 1132778 1.000000 0.000000 0.000000 -1744055.210286 -1 0.816887
1 5 1041298 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
2 3 1122253 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
2 4 1125729 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
2 5 1035731 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
3 4 1136091 0.566552 0.433054 0.000393 -1743752.158759 36 0.819276
3 5 1046456 0.265831 0.482954 0.251214 -1467343.087558 11 0.754637
4 5 1047977 0.004653 0.995347 0.000000 -1473415.049864 94 0.755734
</pre>

The first two columns contain the information of about what two individuals was used for the analysis. The third column contains information about how many sites were used in the analysis. The following three columns are the maximum likelihood (ML) estimates of the relatedness coefficients. The seventh column is the log of the likelihood of the ML estimate. The eigth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the ninth column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

For OLD versions of the program (from before June 28 2017):
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
Important recent changes:

#We have made -s 1 default (flips all allele frequencies from freq to 1-freq), since this is needed in almost all analyses. If you do not want the frequencies flipped then simply run the program with -s 0
#The output format has been changed to a more R friendly format (no ":" and parenthesis)
#The option -z has been added so one can get the sample IDs printed in the output (if one run the program with -z idfilename)
#We have fixed -m 1 so the estimates can no longer be negative

See github for the full change log.

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2017-01-30T13:16:26Z

Thorfinn: /* How to download and install */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >ngsRelate.cpp
g++ ngsRelate.cpp -O3 -lz -o ngsRelate
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotpe likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>
The final relatedness estimates will then be available in the file called "res" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/. Note that we here used the option -s 1 to flip the allele frequencies (i.e. set them to 1 minus the frequencies in the freq file).

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
See github for log

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2017-01-27T11:10:51Z

Thorfinn: /* How to download and install */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ ngsRelate.cpp -O3 -lz -o ngsRelate
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotpe likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>
The final relatedness estimates will then be available in the file called "res" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/. Note that we here used the option -s 1 to flip the allele frequencies (i.e. set them to 1 minus the frequencies in the freq file).

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
See github for log

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2017-01-16T11:54:54Z

Thorfinn: /* Run example 3: using frequencies from 1000genomes vcf files */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotpe likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>
The final relatedness estimates will then be available in the file called "res" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/. Note that we here used the option -s 1 to flip the allele frequencies (i.e. set them to 1 minus the frequencies in the freq file).

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz

#Then we extract and match the freqs from the reference population with the sites where we had data. The parser expects a header, so make a dummy file

echo "header" |gzip -c >new
cat EUR_AF/ALL.frq.gz >>new
ngsRelate extract_freq new angsdput.glf.pos.gz >myfreq

</pre>

=Output format=
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
See github for log

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2017-01-03T12:49:35Z

Thorfinn: /* Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotpe likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq_bim angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>
The final relatedness estimates will then be available in the file called "res" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/. Note that we here used the option -s 1 to flip the allele frequencies (i.e. set them to 1 minus the frequencies in the freq file).

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz
</pre>

=Output format=
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
See github for log

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2016-12-27T07:33:24Z

Thorfinn: /* Run example 3: using frequencies from 1000genomes vcf files */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotpe likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>
The final relatedness estimates will then be available in the file called "res" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/. Note that we here used the option -s 1 to flip the allele frequencies (i.e. set them to 1 minus the frequencies in the freq file).

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz
</pre>

=Output format=
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
See github for log

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2016-12-27T07:32:39Z

Thorfinn: /* Run example 3: using frequencies from 1000genomes vcf files */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the examples below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start.

Method is published here: http://bioinformatics.oxfordjournals.org/content/early/2015/08/29/bioinformatics.btv509.abstract

=How to download and install=
The source code for NgsRelate is deposited on github: https://github.com/ANGSD/NgsRelate. On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

=Run examples=
Below are two examples of how NgsRelate can be used to estimate relatedness from NGS data. Note that to be able to run all steps of the examples you need to have the programs ANGSD and PLINK installed and you also need to download large data files from both HapMap3 and 1000 Genomes webpages. Furthermore, the examples take several hours to run all in all. They are therefore just meant as illustrations of how NgsRelate can be run. '''If you want to quickly try out NgsRelate, e.g. to check if your installation works, you can download the final input data for NgsRelate used in the very last command in run example 2 here: http://www.popgen.dk/ida/NgsRelateExampleData/web/input/. Using that data you can try out NgsRelate by running that last command, i.e.'''

<pre>
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>

The output should be a file called res that contains relatedness estimates for all pairs between 6 individuals. A copy of this file can be found here http://www.popgen.dk/ida/NgsRelateExampleData/web/output/.

== Run example 1: using only NGS data==
Assume we have file containing paths to 100 BAM/CRAM files; one line per BAN/CRAM file. Then we can use ANGSD to estimate frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the input files needed for the NgsRelate program as follows:
<pre>
### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotpe likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq
</pre>
Once we have these files we can use NgsRelate to estimate relatedness between any pairs of individuals. E.g. if we want to estimate relatedness between the first two individuals (0 and 1) we can do it using the following command:
<pre>
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify the name of our file with genotype likelihoods after the option "-g", the number of individuals in the file after the option "-n", the name of the file with allele frequencies after the option "-f" and the number of the two individuals after the options "-a" and "-b" . If -a and -b are not specified NgsRelate will loop through all pairs of individuals in the input file.

== Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files ==
In this example we show how you can estimate relatedness between a number of individuals which you have NGS data from (in bam files) using genetic data from PLINK files for frequency estimation.
Assume the individuals we want to estimate relatedness from are from the population called LWK and assume we have files with genetic data from individuals from LWK as well as other populations in binary PLINK format (e.g. hapmap3_r2_b36_fwd.consensus.qc.polyHg19.*) and a file, LWK.fam, with the IDs of the LWK individuals in this dataset. Then using PLINK we can produce allele frequency information in a format that NgsRelate can use as follows:
<pre>
### extract individuals from LWK from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for this population
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
</pre>
Afterwards we can use ANGSD to calculate genotype likelihoods for the sites for which we have frequency info for as follows:
<pre>
### extract the chr,pos,major,minor information about the sites we have frequency info from into a file
### (so we can extract data from these sites from the NGS data files)
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index this file for angsd
./angsd sites index forAngsd.txt

### calculate genotype likelihoods for the six individuals for the sites we have frequency info on based on the bam files
### (assuming the paths to the bam files are listed in the file 'list'):
./angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
### this generates the output files angsdput.glf.gz and a angsdput.glf.pos.gz.
</pre>

Finally we can use NgsRelate to estimate relatedness for the six individuals from which we have NGS data in bam files:
<pre>
### extract the frequencies and sync it to the angsd output
./ngsrelate extract_freq angsdput.glf.pos.gz hapmap3Hg19LWK.bim LWKsub.frq >freq

### run ngsrelate
./ngsrelate -g angsdput.glf.gz -n 6 -f freq -s 1 >res
</pre>
The final relatedness estimates will then be available in the file called "res" which can be found here: http://www.popgen.dk/ida/NgsRelateExampleData/web/output/. Note that we here used the option -s 1 to flip the allele frequencies (i.e. set them to 1 minus the frequencies in the freq file).

== Run example 3: using frequencies from 1000genomes vcf files==
We want to run ngsRelate using population frequencies from europe. We will extract the frequencies from the 1000genomes project vcf.

<pre>
#Assuming that we have perchr called: ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
#We dump output in EUR_AF/*.frq
#We only use diallelic sites, we extract CHROM,POS,REF,ALT,EUR_AF tags from the vcf
#We then pulled out the unique sites.
for f in `seq 1 22`
do
IF=/storage/data_shared/callsets/1000genomes/phase3/vcf/ALL.chr${f}.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
echo "bcftools view -m2 -M2 -v snps ${IF} | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%EUR_AF\n' |awk '{if(\$5>0) print \$0 }'|sort -S 50% -u -k1,2 >EUR_AF/${f}.frq"
done|parallel

##We merge into one file
cat EUR_AF/1.frq >EUR_AF/ALL.frq
for i in `seq 2 22`
do
cat EUR_AF/${i}.frq >>EUR_AF/ALL.frq
done
gzip EUR_AF/ALL.frq

#we extract the first 4 columns, which is the sites input for angsd
gunzip -c EUR_AF/ALL.frq.gz |cut -f1-4 |gzip -c >EUR_AF/sites.txt.gz

./angsd/angsd sites index EUR_AF/sites.txt.gz
./angsd/angsd -b list -gl 1 -domajorminor 3 -C 50 -ref /storage/data_shared/reference_genomes/hs37d5/hs37d5.fa -doglf 3 -minmapq 30 -minq 20 -sites EUR_AF/sites.txt.gz
</pre>

=Output format=
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat res
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
</pre>

The first column contains the information of about which individuals was used for the analysis. The next three columns are the maximum likelihood (ML) estimate of the relatedness coefficients. The fifth column is the log of the likelihood of the ML estimate. The sixth column is the number of iterations of the maximization algorithm that was used to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and where the minor allele frequency (MAF) above the threshold (default is 0.05 but the user may specify a different threshold). Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

= Input file format =
NgsAdmix takes two files as input: a file with genotype likelihoods and a file with frequencies for the sites there are genotype likelihoods for.
The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed.
The frequency file needs to contain a line per site with the allele frequency of the site in it.

= Help and additional options =
To get help and a list of all options simply type

<pre>
./ngsrelate
</pre>

= Citing and references =

= Changelog =
See github for log

=Bugs/Improvements=
-Make better output message if files doesn't exists when using the extract_freq option

NgsRelate

2016-12-27T07:27:46Z

Thorfinn: /* Run example 3: using frequencies from 1000genomes vcf files */

NgsRelate

2016-12-27T07:25:26Z

Thorfinn: /* Run example 2: using NGS data with population frequencies estimated from genetic data from PLINK files */

NgsRelate

2015-08-31T08:06:41Z

Thorfinn: /* Brief description */

NgsRelate

2015-08-04T17:39:15Z

Thorfinn:

NgsRelate

2015-06-23T17:38:50Z

Thorfinn: /* Run example using NGS data with plink population frequencies */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.

=Download and Installation=
Primary repository is github. https://github.com/ANGSD/NgsRelate
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

= Run example using only NGS data=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

= Run example using NGS data with plink population frequencies=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
### extract population from huge binary plink file
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb

### calculate frequencies for the subpopulation
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub

### we are only interested in the output from the snpsites with the seq data. so we extract the chr,pos,major,minor
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt

### index the file for angsd
angsd sites index forAngsd.txt

##assuming 'list' contains path to bams
angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
#this generates an angsdput.glf.gz and a angsdput.glf.pos.gz.

#extract the frequencies and sync it to the angsd output
./NgsRelate/a.out extract_freq angsdput.glf.pos.gz files/hapmap3Hg19LWK.bim files/LWKsub.frq >freq

#run ngsrelate
ngsrelate -g angsdput.glf.gz -n 6 -f freq >resi
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

=Output=
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat resi
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
<pre>

The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).

= Input file format =
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
The freq file is allowed to be gz compressed.

= Citing and references =

= Changelog =
See github for log

NgsRelate

2015-06-23T17:34:24Z

Thorfinn: /* Run example using NGS data with plink population frequencies */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.

=Download and Installation=
Primary repository is github. https://github.com/ANGSD/NgsRelate
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

= Run example using only NGS data=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

= Run example using NGS data with plink population frequencies=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
### extract plink
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb
### calculate frequencies
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
### find sites from plink files
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt
### index the file
angsd sites index forAngsd.txt
##assuming 'list' contains path to bams
angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
#this generates an angsdput.glf.gz and a angsdput.glf.pos.gz.

#extract the frequencies and sync it to the angsd output
./NgsRelate/a.out extract_freq angsdput.glf.pos.gz files/hapmap3Hg19LWK.bim files/LWKsub.frq >freq

#run ngsrelate
ngsrelate -g angsdput.glf.gz -n 6 -f freq >resi
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

=Output=
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat resi
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
<pre>

The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).

= Input file format =
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
The freq file is allowed to be gz compressed.

= Citing and references =

= Changelog =
See github for log

NgsRelate

2015-06-23T17:33:02Z

Thorfinn: /* Output */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.

=Download and Installation=
Primary repository is github. https://github.com/ANGSD/NgsRelate
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

= Run example using only NGS data=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

= Run example using NGS data with plink population frequencies=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
### extract plink
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb
### calculate frequencies
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
### find sites from plink files
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt
### index the file
angsd sites index forAngsd.txt
##assuming 'list' contains path to bams
angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
#this generates an angsdput.glf.gz and a angsdput.glf.pos.gz.

./NgsRelate/a.out extract_freq angsdput.glf.pos.gz files/hapmap3Hg19LWK.bim files/LWKsub.frq >freq

</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

=Output=
Example of output of with two samples
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

Example of output with 6 samples:
<pre>
cat resi
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930
(0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822
(0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266
(0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856
(0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663
(1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912
(1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272
(1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887
(1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917
(2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297
(2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804
(2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903
(3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276
(3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637
(4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734
<pre>

The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).

= Input file format =
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
The freq file is allowed to be gz compressed.

= Citing and references =

= Changelog =
See github for log

NgsRelate

2015-06-23T15:42:33Z

Thorfinn: /* Run example using NGS data with plink population frequencies */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.

=Download and Installation=
Primary repository is github. https://github.com/ANGSD/NgsRelate
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

= Run example using only NGS data=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

= Run example using NGS data with plink population frequencies=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
### extract plink
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb
### calculate frequencies
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
### find sites from plink files
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt
### index the file
angsd sites index forAngsd.txt
##assuming 'list' contains path to bams
angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20
#this generates an angsdput.glf.gz and a angsdput.glf.pos.gz.

./NgsRelate/a.out extract_freq angsdput.glf.pos.gz files/hapmap3Hg19LWK.bim files/LWKsub.frq >freq

</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

=Output=
Example of output
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).

= Input file format =
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
The freq file is allowed to be gz compressed.

= Citing and references =

= Changelog =
See github for log

NgsRelate

2015-06-23T15:26:50Z

Thorfinn: /* Run example using NGS data with plink population frequencies */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.

=Download and Installation=
Primary repository is github. https://github.com/ANGSD/NgsRelate
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

= Run example using only NGS data=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

= Run example using NGS data with plink population frequencies=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
### extract plink
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb
### calculate frequencies
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
### find sites from plink files
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt
### index the file
angsd sites index forAngsd.txt
##assuming 'list' contains path to bams
angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 10 -minMapQ 30 -minQ 20
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

=Output=
Example of output
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).

= Input file format =
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
The freq file is allowed to be gz compressed.

= Citing and references =

= Changelog =
See github for log

NgsRelate

2015-06-23T12:32:00Z

Thorfinn: /* Run example using NGS data with plink population frequencies */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.

=Download and Installation=
Primary repository is github. https://github.com/ANGSD/NgsRelate
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

= Run example using only NGS data=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

= Run example using NGS data with plink population frequencies=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
### extract plink
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb
### calculate frequencies
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
### find sites from plink files
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt
### index the file
angsd sites index forAngsd.txt
##assuming
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

=Output=
Example of output
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).

= Input file format =
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
The freq file is allowed to be gz compressed.

= Citing and references =

= Changelog =
See github for log

NgsRelate

2015-06-23T12:30:39Z

Thorfinn: /* Run example using NGS data with plink population frequencies */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.

=Download and Installation=
Primary repository is github. https://github.com/ANGSD/NgsRelate
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

= Run example using only NGS data=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

= Run example using NGS data with plink population frequencies=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
### extract plink
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb
### calculate frequencies
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub
### find sites from plink files
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt
##assuming
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

=Output=
Example of output
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).

= Input file format =
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
The freq file is allowed to be gz compressed.

= Citing and references =

= Changelog =
See github for log

NgsRelate

2015-06-23T12:27:04Z

Thorfinn: /* Run example using NGS data with plink population frequencies */

=Brief description=
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.

=Download and Installation=
Primary repository is github. https://github.com/ANGSD/NgsRelate
<pre>
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o ngsrelate
</pre>

= Run example using only NGS data=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

= Run example using NGS data with plink population frequencies=
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
<pre>
### extract plink
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb
### calculate frequencies
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub

##assuming
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
</pre>
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1.
If no -a and -b are specified it will loop through all pairs

=Output=
Example of output
<pre>
Pair k0 k1 k2 loglh nIter coverage
(0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
</pre>

The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).

= Input file format =
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample.
The freq file is allowed to be gz compressed.

= Citing and references =

= Changelog =
See github for log

NgsRelate

2015-06-23T12:15:15Z

Thorfinn: /* Run example using only NGS data */

NgsRelate

2015-06-23T10:55:12Z

Thorfinn: /* Run example using only NGS data */

NgsRelate

2015-06-20T15:05:01Z

Thorfinn: /* Download and Installation */

MediaWiki:Sidebar

2015-06-20T07:07:03Z

Thorfinn:

MediaWiki:Sidebar

2015-06-20T07:05:51Z

Thorfinn:

MediaWiki:Sidebar

2015-06-20T07:04:17Z

Thorfinn:

MediaWiki:Sidebar

2015-06-19T22:05:03Z

Thorfinn: