NgsRelate

From software
Jump to navigation Jump to search

Brief description

This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.

Download and Installation

Primary repository is github. https://github.com/ANGSD/NgsRelate

curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp
g++ NgsRelate.cpp -O3 -lz -o NgsRelate

Run example using only NGS data

Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program

./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res

Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1. If no -a and -b are specified it will loop through all pairs

Output

Example of output

Pair	k0	k1	k2	loglh	nIter	coverage
(0,1)	0.673213	0.326774	0.000013	-1710940.769941	19	0.814658


The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient, the likelihood of the MLE, the number of iterations required to find optima, and finally the fraction of non missing sites. This is the fraction of sites where we have data for both samples, and a MAF above the default threshold.

Input file format

The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample. The freq file is allowed to be gz compressed.

Citing and references

Changelog

See github for log