PCAngsdv2: Difference between revisions

From software
Jump to navigation Jump to search
 
(23 intermediate revisions by 2 users not shown)
Line 11: Line 11:




==Overview==  
=Overview=


Based on population structure inference, PCAngsd is able to detect the number of significant principal components which is then used to estimate individual allele frequencies using genotype dosages in a SVD model. These individual allele frequencies can be used in various population genetic methods for heterogeneous populations, such that PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate individual admixture proportions, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components
Based on population structure inference, PCAngsd is able to detect the number of significant principal components which is then used to estimate individual allele frequencies using genotype dosages in a SVD model. These individual allele frequencies can be used in various population genetic methods for heterogeneous populations, such that PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate individual admixture proportions, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components.
The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
*Covariance matrix
*Covariance matrix
Line 26: Line 26:
The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].
The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].


==Download and Installation==
=Download and Installation=


PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.  
PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.  
Line 32: Line 32:


It is assumed that OpenMP is installed [https://www.openmp.org/].
It is assumed that OpenMP is installed [https://www.openmp.org/].
1. Login to your server using ssh on your terminal window.
2. Create the directory where you will install your software and enter it, such as:
<pre>
mkdir ~/Software
cd ~/Software
</pre>
3. Download the source code:


<pre>
<pre>
git clone https://github.com/Rosemeis/pcangsd.git
git clone https://github.com/Rosemeis/pcangsd.git
</pre>
4. Configure, Compile and Install:
<pre>
cd pcangsd/
cd pcangsd/
python setup.py build_ext --inplace
python setup.py build_ext --inplace
</pre>
</pre>


Install dependencies
5. Install dependencies:


The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.
The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.
Line 45: Line 61:
<code>pip install --user -r requirements.txt</code>
<code>pip install --user -r requirements.txt</code>


==Quick start==
=Quick start=
 
<pre>
<pre>
# See all options in PCAngsd
 
python pcangsd.py -h
# Download the input beagle file with genotype likelihoods
wget popgen.dk/software/download/NGSadmix/data/input.gz
 


# Only estimate covariance matrix using 10 threads
# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle data.beagle.gz -o test -threads 10
python pcangsd.py -beagle input.gz -o test1 -threads 10


# Estimate covariance matrix and individual admixture proportions
# Estimate covariance matrix and individual admixture proportions
python pcangsd.py -beagle data.beagle.gz -admix -o test -threads 10
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10


# Estimate covariance matrix and inbreeding coefficients
# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle data.beagle.gz -inbreed 2 -o test -threads 10
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10


# Estimate covariance matrix and perform selection scan
# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle data.beagle.gz -selection 1 -o test -threads 10
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10
</pre>
</pre>
==Detailed Examples and Tutorial==
Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]


=Input=
=Input=
Line 73: Line 96:


See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.
See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.
=Output=
Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.
In order to read files in python:
<pre>
import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix
S = np.load("output.selection.npy") # Reads results from selection scan
</pre>
R can also read Numpy matrices using the "RcppCNPy" library:
<pre>
library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
S <- npyLoad("output.selection.npy") # Reads results from selection scan
</pre>


=Using PCAngsd=
=Using PCAngsd=


All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.
All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´
 
<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>


==Estimation of individual allele frequencies==
==Estimation of individual allele frequencies==
Line 145: Line 193:
; -inbreed 1
; -inbreed 1
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].
; -inbreed 2
; -inbreed 2
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].
; -inbreed 3
; -inbreed 3
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].   
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].   
; -inbreed_iter [int]
; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)
Maximum number of iterations for the EM algorithm methods. (Default: 200)
; -inbreed_tole [float]
; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:


; -inbreedSites
; -inbreedSites
Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:
Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:


; -hwe [LRT filename]
; -hwe [LRT filename]
; -hwe_tole [float]
; -hwe_tole [float]
Tolerance value for HWE test. (Default: 1e-6)
Tolerance value for HWE test. (Default: 1e-6)


==Selection==
==Selection==
A genome selection scan can be computed using two different methods based on posterior expectations of the genotypes (genotype dosages):
A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):


; -selection 1
; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.
; -selection 2
Using an extended model of [http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12592/abstract PCAdapt]. Outputs the selection statistics and must be converted to p-values by user. Selection statistics are χ²-distributed with '''-e''' degrees of freedom (number of significant eigenvectors).


==Relatedness==
==Relatedness==
Line 187: Line 235:
Our methods for inferring population structure have been published in GENETICS:
Our methods for inferring population structure have been published in GENETICS:


Population structure: [http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]
[http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]
 
 
Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:


HWE test: [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]
[https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

Latest revision as of 12:34, 16 August 2019


PCAngsd is a program that estimates the covariance matrix for low depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using an iterative procedure based on genotype likelihoods.

Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization and is now compatible with any newer Python version.

The method was published in 2018 and can be found here: [1]

Simulated low depth NGS data of 3 populations


Overview

Based on population structure inference, PCAngsd is able to detect the number of significant principal components which is then used to estimate individual allele frequencies using genotype dosages in a SVD model. These individual allele frequencies can be used in various population genetic methods for heterogeneous populations, such that PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate individual admixture proportions, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components. The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:

  • Covariance matrix
  • Genotype calling
  • Admixture
  • Inbreeding coefficients (both per-individual and per-site)
  • HWE test
  • Genome selection scan
  • Kinship matrix


The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [2].

Download and Installation

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended. Installation has only been tested on Linux systems.

It is assumed that OpenMP is installed [3].

1. Login to your server using ssh on your terminal window.

2. Create the directory where you will install your software and enter it, such as:

mkdir ~/Software
cd ~/Software

3. Download the source code:

git clone https://github.com/Rosemeis/pcangsd.git

4. Configure, Compile and Install:

cd pcangsd/
python setup.py build_ext --inplace

5. Install dependencies:

The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.

pip install --user -r requirements.txt

Quick start


# Download the input beagle file with genotype likelihoods
wget popgen.dk/software/download/NGSadmix/data/input.gz 


# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle input.gz -o test1 -threads 10

# Estimate covariance matrix and individual admixture proportions
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10

Detailed Examples and Tutorial

Please refer to the tutorial's page [4]

Input

The only input PCAngsd needs and accepts are genotype likelihoods in Beagle format. New functionality for using PLINK files has been added (version 0.9). Genotypes are automatically converted into a genotype likelihood matrix where the user can incorporate an error model.

ANGSD can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

./angsd -GL 1 -out data -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist

See ANGSD for more information on how to compute the genotype likelihoods and call SNPs.

Output

Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.

In order to read files in python:

import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix
S = np.load("output.selection.npy") # Reads results from selection scan


R can also read Numpy matrices using the "RcppCNPy" library:

library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
S <- npyLoad("output.selection.npy") # Reads results from selection scan

Using PCAngsd

All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´

# See all options in PCAngsd
python pcangsd.py -h

Estimation of individual allele frequencies

-beagle [Beagle filename]

Input file of genotype likelihoods in Beagle format (.beagle.gz).

-plink [Prefix for binary PLINK files]

Path to PLINK files using their prefix (.bed, .bim, .fam).

-plink_error [float]

Incorporate error model for PLINK genotypes.

-minMaf [float]

Minimum minor allele frequency threshold. (Default: 0.05)

-iter [int]

Maximum number of iterations for estimation of individual allele frequencies (Default: 100).

-tole [float]

Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).

-maf_iter [int]

Maximum number of EM iterations for computing the population allele frequencies (Default: 100).

-maf_tole [float]

Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).

-e [int]

Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).

-o [prefix]

Set the prefix for all output files created by PCAngsd (Default: "pcangsd").

-indf_save

Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy).

-dosage_save

Choose to save estimated genotype dosages (Binary). Numpy format (.npy).

-sites_save

Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.

-post_save

Choose to save the posterior genotype probabilities. Beagle format (.beagle).

-threads [int]

Specify the number of thread(s) to use (Default: 1).

Call genotypes

Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.

-geno [float]

Call genotypes with defined threshold.

-genoInbreed [float]

Call genotypes with defined threshold also taking inbreeding into account. -inbreed [int] is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.

Admixture

Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.

-admix

Toggles admixture estimations. Individual ancestry proportions are saved (Binary). Numpy format (.npy).

-admix_alpha [float-list]

Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alpha's in a single run (Default: 0).

-admix_auto [float]

Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.

-admix_seed [int-list]

Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run.

-admix_K [int]

Not recommended. Manually specify the number of ancestral populations to use in admixture estimations (overrides number chosen from -e). Structure explained by individual allele frequencies may therefore not reflect the manually chosen K. It is recommended to adjust -e instead of -admix_K.

-admix_iter [int]

Maximum number of iterations for admixture estimations using NMF. (Default: 200)

-admix_tole [float]

Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)

-admix_batch [int]

Specify the number of batches to use in NMF method. (Default: 10)

-admix_save

Choose to save the population-specific allele frequencies (Binary). Numpy format (.npy).

Inbreeding

Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 1 is recommended for low depth cases.

-inbreed 1

Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on ngsF.

-inbreed 2

A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [5].

-inbreed 3

Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on PC-Relate.

-inbreed_iter [int]

Maximum number of iterations for the EM algorithm methods. (Default: 200)

-inbreed_tole [float]

Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4) Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

-inbreedSites

Use likelihood ratio tests (.lrt.sites.gz) generated from -inbreedSites to filter out variable sites using a given threshold for HWE test p-value:

-hwe [LRT filename]
-hwe_tole [float]

Tolerance value for HWE test. (Default: 1e-6)

Selection

A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):

-selection

Using an extended model of FastPCA. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

Relatedness

Estimate kinship matrix based on method Based on PC-Relate:

-kinship

Automatically estimated if -inbreed 3 has been selected.

Remove related individuals based on kinhsip matrix of previous run:

-relate [Kinship filename]
-relate_tole [float]

Threshold for kinship coefficients for removing individuals (Default: 0.0625).

Citation

Our methods for inferring population structure have been published in GENETICS:

Inferring Population Structure and Admixture Proportions in Low Depth NGS Data


Our method for testing for HWE in structured populations has been published in Molecular Ecology Resources:

Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data