NgsRelate: Difference between revisions

From software
Jump to navigation Jump to search
Line 18: Line 18:
cut -f5 angsdput.mafs.gz |sed 1d >freq
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq >gl.res
./ngsrelate -g angsdput.glf.gz -n 100 -f freq >gl.res
 
</pre>


== Output file format==
== Output file format==

Revision as of 20:30, 19 June 2015

Brief description

This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage nags data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be done e.g. using the program ANGSD as shown in the example.

Installation

Primary repository is github.

Download Installation of C program

curl https://raw.githubusercontent.com/ANGSD/fastlate/master/fastlate.cpp >fastlate.cpp
g++ fastlate.cpp -O3 -lz -o fastlate

Run example using C

Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program

./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3
#this generates an angsdput.mafs.gz and a angsdput.glf.gz.
#we will need to extract the frequency column from the mafs file and remove the header
cut -f5 angsdput.mafs.gz |sed 1d >freq
./ngsrelate -g angsdput.glf.gz -n 100 -f freq >gl.res

Output file format

Example of output

ind1    ind2    k0      k1      k2      nIter
0       1       0.999941        0.000038        0.000021        26
0       2       0.999979        0.000010        0.000011        29
0       3       0.999953        0.000029        0.000018        26
0       4       0.999952        0.000023        0.000025        26
0       5       0.999972        0.000020        0.000007        26
0       6       0.999995        0.000003        0.000002        26
0       7       0.999995        0.000003        0.000002        26
0       8       0.999894        0.000069        0.000038        32
0       9       0.999894        0.000069        0.000038        32
0       10      0.999903        0.000071        0.000026        26
0       11      0.999903        0.000071        0.000026        26


The first two columns are the individuals number. The next three columns are the estimated relatedness coefficients and the last column is the number of iterations used.


Input file format

The input consists of three files describignt the genotype data, a file with admixture proportions for each individual and a file with allele frequencies for each SNP for each source population. The genotype data files are plink bed/bim/fam files. And the remaining two files are in the output format for the program ADMIXTURE:

Example of the content of an admixture proportion file (for 3 populations)

0.531631 0.468359 0.000010
0.564461 0.435529 0.000010
0.850660 0.149330 0.000010
0.630527 0.369463 0.000010
0.747429 0.219346 0.033225
0.999980 0.000010 0.000010
0.999980 0.000010 0.000010
0.682072 0.317918 0.000010
0.000010 0.999980 0.000010
0.793133 0.206857 0.000010

Each row is an individual and each column is a population. The admixture proportions for each individual must sum to 1

Example of the allele frequency file (for 3 populations)

0.312722 0.208605 0.999990
0.881352 0.999990 0.966966
0.708206 0.838869 0.932119
0.427789 0.620694 0.532966
0.411998 0.622253 0.534072
0.427789 0.620694 0.532966
0.440817 0.581630 0.618751
0.733733 0.985281 0.953523
0.724083 0.451452 0.784607
0.811161 0.578612 0.787782

Each row is an SNP and each column is a population. When using plink files the allele frequency is the MAJOR allele frequency.

Citing and references

relateAdmix

Moltke, I, Albrechtsen, A (2013). RelateAdmix: a software tool for estimating relatedness between admixed individuals. Bioinformatics. pubmed bibtex

ADMIXTURE

D.H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19:1655–1664, 2009.


change log

  • 0.14 made more MAC usable (I think). Thanks to Paul Lott for reporting it and for suggestions and Thorfinn Sand for changing it
  • 0.13 added extra check for file exists to give instant errors + changes all printf to fprintf(stderr,
  • 0.11 changed threading to a fixed pool of threads
  • 0.10 optimized code
  • 0.09 added error for when the number of sites and individuals does not match between files
  • 0.08 fixed a bug that would sometimes print an extra line when multiple threaded
  • 0.07 fixed a small leak