NgsRelate: Difference between revisions
Line 24: | Line 24: | ||
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program | Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program | ||
<pre> | <pre> | ||
### extract plink | ### extract population from huge binary plink file | ||
plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb | plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb | ||
### calculate frequencies | |||
### calculate frequencies for the subpopulation | |||
plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub | plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub | ||
### | |||
### we are only interested in the output from the snpsites with the seq data. so we extract the chr,pos,major,minor | |||
cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt | cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt | ||
### index the file | |||
### index the file for angsd | |||
angsd sites index forAngsd.txt | angsd sites index forAngsd.txt | ||
##assuming 'list' contains path to bams | ##assuming 'list' contains path to bams | ||
angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20 | angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20 |
Revision as of 18:38, 23 June 2015
Brief description
This page contains information about the program called NgsRelate, which can be used to infer relatedness coefficients for pairs of individuals for low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods. To be able to infer the relatedness you will need to know the population frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in the example below.
Download and Installation
Primary repository is github. https://github.com/ANGSD/NgsRelate
curl https://raw.githubusercontent.com/ANGSD/NgsRelate/master/NgsRelate.cpp >NgsRelate.cpp g++ NgsRelate.cpp -O3 -lz -o ngsrelate
Run example using only NGS data
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
./angsd -b filelist -gl 1 -domajorminor 1 -snp_pval 1e-6 - domaf 1 -minmaf 0.05 -doGlf 3 #this generates an angsdput.mafs.gz and a angsdput.glf.gz. #we will need to extract the frequency column from the mafs file and remove the header cut -f5 angsdput.mafs.gz |sed 1d >freq ./ngsrelate -g angsdput.glf.gz -n 100 -f freq -a 0 -b 1 >gl.res
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1. If no -a and -b are specified it will loop through all pairs
Run example using NGS data with plink population frequencies
Assume we have file containing paths to 100 BAM/CRAM files, then we can use ANGSD to estimate frequencies calculate genotype likelihoods while doing SNP calling and dumping the input files needed for the NgsRelate program
### extract population from huge binary plink file plink --bfile hapmap3_r2_b36_fwd.consensus.qc.polyHg19 --keep LWK.fam --make-bed --out hapmap3Hg19LWK --noweb ### calculate frequencies for the subpopulation plink --bfile hapmap3Hg19LWK --freq --noweb --out LWKsub ### we are only interested in the output from the snpsites with the seq data. so we extract the chr,pos,major,minor cut -f1,4-6 hapmap3Hg19LWK.bim >forAngsd.txt ### index the file for angsd angsd sites index forAngsd.txt ##assuming 'list' contains path to bams angsd -gl 1 -doglf 3 -sites forAngsd.txt -b list -domajorminor 3 -P 2 -minMapQ 30 -minQ 20 #this generates an angsdput.glf.gz and a angsdput.glf.pos.gz. #extract the frequencies and sync it to the angsd output ./NgsRelate/a.out extract_freq angsdput.glf.pos.gz files/hapmap3Hg19LWK.bim files/LWKsub.frq >freq #run ngsrelate ngsrelate -g angsdput.glf.gz -n 6 -f freq >resi
Here we specify that our binary genotype likelihood file contains 100 samples, and that we want to run the analysis for the first two samples -a 0 -b 1. If no -a and -b are specified it will loop through all pairs
Output
Example of output of with two samples
Pair k0 k1 k2 loglh nIter coverage (0,1) 0.673213 0.326774 0.000013 -1710940.769941 19 0.814658
Example of output with 6 samples:
cat resi Pair k0 k1 k2 loglh nIter coverage (0,1) 0.675337 0.322079 0.002584 -1710946.832375 10 0.813930 (0,2) 0.458841 0.526377 0.014782 -1666215.528333 10 0.808822 (0,3) 1.000000 0.000000 0.000000 -1743992.363193 -1 0.816266 (0,4) 1.000000 0.000000 0.000000 -1759202.971213 -1 0.818856 (0,5) 1.000000 0.000000 0.000000 -1550475.615322 -1 0.752663 (1,2) 0.007111 0.991020 0.001868 -1580995.130867 10 0.806912 (1,3) 1.000000 0.000000 0.000000 -1728859.988212 -1 0.814272 (1,4) 1.000001 -0.000001 0.000000 -1744055.203870 9 0.816887 (1,5) 1.000000 0.000000 0.000000 -1536858.187440 -1 0.750917 (2,3) 1.000000 0.000000 0.000000 -1705157.832621 -1 0.809297 (2,4) 1.000000 0.000000 0.000000 -1719681.338365 -1 0.811804 (2,5) 1.000000 0.000000 0.000000 -1517388.260612 -1 0.746903 (3,4) 0.547602 0.439423 0.012975 -1743899.789842 10 0.819276 (3,5) 0.265819 0.482953 0.251228 -1467343.087647 10 0.754637 (4,5) 0.004655 0.995345 -0.000000 -1473415.049411 8 0.755734The first column contain the individuals that was used for the analysis . The next three columns are the estimated relatedness coefficient. The fifth column is the log of the likelihood of the MLE. The sixth column is the number of iterations required to find the MLE, and finally the seventh column is fraction of non-missing sites, i.e. the fraction of sites where data was available for both individuals, and the minor allele frequency (MAF) above the threshold (default is 0.05 but may also user specified).Input file format
The input files are binary gz compressed, log like ratios encoded as double. 3 values per sample. The freq file is allowed to be gz compressed.Citing and references
Changelog
See github for log