IBSrelate
This page contains information about the method IBSrelate, a method to identify pairs of related individuals without requiring population allele frequencies.
On this page we will show you how to estimate the R0, R1 and KING-robust kinship statistics for a pair (or more!) of individuals from aligned sequencing data. These statistics are informative about relatedness, but can also be useful for quality-control (QC).
For further details, including the interpretation of R0, R1 and KING-robust kinship, please see our paper in Molecular Ecology at: https://doi.org/10.1111/mec.14954
Calculating statistics from the output of IBS and realSFS
IBS and realSFS are two methods implemented in ANGSD [1] that can be used to estimate the allele sharing genotype distribution for a pair of individuals. The paper describes and examines the differences between the two methods, but we expect they both will perform comparably well for most applications. Below are links to two R scripts that can be used to load the output of IBS and realSFS and produce estimates of R0, R1 and KING-robust kinship.
https://github.com/rwaples/freqfree_suppl/blob/master/read_IBS.R
https://github.com/rwaples/freqfree_suppl/blob/master/read_realSFS.R
Application to the ANGSD example data
The shell commands given below are available in a text file here: https://github.com/rwaples/freqfree_suppl/blob/master/example_data.sh .
They are available in a Jupyter notebook here (slightly older version): https://nbviewer.jupyter.org/github/rwaples/freqfree_suppl/blob/master/example_data.ipynb .
Setup
You will need installations and both ANGSD and samtools, as well as Rscript (part of R).
The commands below will download files to your current directory, as well as create a few sub-directories. In total the analysis will download and generate files with a total size of about 1.5 GB.
Set up shell variables
# set paths to the analysis programs, may need to be replaced your local installation(s) ANGSD="$HOME/programs/angsd/angsd" realSFS="$HOME/programs/angsd/misc/realSFS" IBS="$HOME/programs/angsd/misc/ibs" SAMTOOLS="samtools"
Get the example data
The example data set has small bam files from ten individuals.
# download the example data wget http://popgen.dk/software/download/angsd/bams.tar.gz # unzip/untar and index the bam files tar xf bams.tar.gz for i in bams/*.bam;do samtools index $i;done
realSFS method
Setup
# make a directory for the results mkdir results_realsfs # get the R script to parse the realSFS output wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_realSFS.R
Specify an allele at each site
For the realSFS method, one of the alleles at each site must be specified. Here we will use an ancestral state file (fasta format).
# download and index the ancestral state fasta file wget http://popgen.dk/software/download/angsd/hg19ancNoChr.fa.gz $SAMTOOLS faidx hg19ancNoChr.fa.gz
Generate a saf (site allele frequency likelihood) file for each individual
In the commands below, I apply minMapQ (-minMapQ 30) and minQ (-minQ 20) filters, as well as specify a specific genotype likelihood model (-GL 2). These values worked well for this data set and seem to be reasonable defaults, but the best values may vary by data set. I also generate summaries of sequencing depth (-doDepth) and allele counts (-doCounts). The output of these are not evaluated here, but they should be examined be part of a general QC process for NGS data.
# make a separate bam filelist for each individual # also create a SAMPLES array for use below BAMS=./bams/*.bam SAMPLES=() for b in $BAMS; do # parse out the sample name base="$(basename -- $b)" sample="${base%%.mapped.*}" SAMPLES+=("$sample") echo $sample echo $b > ${sample}.filelist.ind done # run doSAF on each individual for s in "${SAMPLES[@]}"; do $ANGSD -b ${s}.filelist.ind \ -anc hg19ancNoChr.fa.gz \ -minMapQ 30 -minQ 20 -GL 2 \ -doSaf 1 -doDepth 1 -doCounts 1 \ -out ${s} done
run realSFS on each pair of indiviudals
Here we have 10 individuals, and want to consider each pair just once. We index the SAMPLES array created above.
for i in {0..9}; do for j in {0..9}; do if (( i < j)); then sample1=${SAMPLES[i]} sample2=${SAMPLES[j]} $realSFS ${sample1}.saf.idx ${sample2}.saf.idx > ./results_realsfs/${sample1}_${sample2}.2dsfs fi done done
Parse the results for a single pair of individuals
Below shows how to use the read_realSFS() function in R to parse the output 2dsfs file generated by realSFS. It will be easier to open an interactive R session, source read_realSFS.R, and use the function read_realSFS() to parse the file into data frame to then explore, plot, or export.
Rscript \ -e "source('./read_realSFS.R')" \ -e "res = read_realSFS('results_realsfs/smallNA06985_smallNA11830.2dsfs')" \ -e "res['sample1'] = 'smallNA06985'; res['sample2'] = 'smallNA11830'" \ -e "print(res[,c('sample1', 'sample2', 'nSites', 'Kin', 'R0', 'R1') ])"
IBS Method
Setup
# make a directory for the results mkdir results_IBS # get the R script to parse the IBS output wget https://raw.githubusercontent.com/rwaples/freqfree_suppl/master/read_IBS.R ## make a file with paths to the bam(s) for all individuals ls bams/*.bam > all.filelist
make a genotype likelihood file (glf) containing all individuals
see here for a description of the file format produced by -doGLF 1.
$ANGSD -b all.filelist \ -minMapQ 30 -minQ 20 -GL 2 \ -doGlf 1 \ -out example
run IBS, this will analyse each pair of individuals
Here we have specified "model 0", so the command will generate expected values for each of the 100 possible pairwise 2-allele genotypes for each pair of individuals (10 unique genotypes per individual). man page for misc/ibs here.
$IBS -glf example.glf.gz \ -model 0 \ -nInd 10 -allpairs 1 \ -outFileName results_IBS/ibs.model0.results
Examine the results
Rscript \ -e "source('./read_IBS.R')" \ -e "res = do_derived_stats(read_ibspair_model0('results_IBS/ibs.model0.results.ibspair'))" \ -e "print(res[6,c('ind1', 'ind2', 'nSites', 'Kin', 'R0', 'R1') ])" # the IBS method in ANGSD indexes individuals as they appear in the filelist # (zero-indexed) cat all.filelist
Frequently asked Questions
I run out of RAM running IBS or realSFS on my entire data set, what should I do?
You can split the data set up by chromosome (or contig), run the IBS or SFS method on each chromosome, and then combine afterwards by summing the values ['A' - 'I'] across chromosomes.
There a few possible concerns with this approach. 1) The optimization routines work best with more data, so be careful of breaking up your data into too many small groups. This is especially true as R0, R1, and KING-robust kinship are all ratios and can be sensitive to small absolute errors. 2) Also, without chromosomes, it becomes difficult to generate meaningful confidence intervals around R0, R1, and KING-robust kinship, as it important to account for correlations between nearby sites when generating the confidence intervals with something like a jackknife. You can jackknife by leaving one contig out each time, but if the correlations (i.e. the identity-by-descent (IBD) blocks present between relatives) extend beyond the contigs, this procedure will produce confidence intervals that are smaller than they should be.
How can I estimate confidence intervals around my R0, R1, or KING-robust kinship estimates?
Can I estimate R0, R1, or KING-robust kinship even if my reference genome is not assembled into chromosomes?
How can reference genome assembly errors affect R0, R1, or KING-robust kinship estimates?
How will inbreeding affects the estimates produces here?
Citation
Waples, R. K., Albrechtsen, A. and Moltke, I. (2018), Allele frequency‐free inference of close familial relationships from genotypes or low depth sequencing data. Mol Ecol. doi:10.1111/mec.14954
Bibtex
@article{doi:10.1111/mec.14954, author = {Waples, Ryan K and Albrechtsen, Anders and Moltke, Ida}, title = {Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data}, journal = {Molecular Ecology}, volume = {0}, number = {ja}, pages = {}, doi = {10.1111/mec.14954}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.14954}, eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.14954}, }