PCAngsd: Difference between revisions

From software
Jump to navigation Jump to search
No edit summary
No edit summary
Line 16: Line 16:
numpy and pandas.
numpy and pandas.


PCAngsd should work on all platforms meeting the requirements but server-use is recommended.
PCAngsd should work on all platforms meeting the requirements but server-side usage is recommended.




Line 28: Line 28:


# Estimate inbreeding coefficients
# Estimate inbreeding coefficients
python pcangsd.py -beagle test.beagle.gz -inbreed 2 -o test
python pcangsd.py -beagle test.beagle.gz -inbreed 1 -o test


# Perform selection scan
# Perform selection scan
Line 35: Line 35:


=Input=
=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in  [http://faculty.washington.edu/browning/beagle/beagle.html BEAGLE] format. [http://popgen.dk/angsd ANGSD] can be easily be used to compute the genotype likelihoods and output them in the required BEAGLE format.
The only input PCAngsd needs and accepts are genotype likelihoods in  [http://faculty.washington.edu/browning/beagle/beagle.html BEAGLE] format. [http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required BEAGLE format.


<pre>
<pre>
Line 45: Line 45:
=Using PCAngsd=
=Using PCAngsd=


All the different options in PCAngsd is listed here.
All the different options in PCAngsd are listed here. All required analyses must be run in the same command! (*For now...*)


==Covariance matrix==
PCAngsd will always compute the covariance matrix. It uses the computed principal components to estimate the individual allele frequencies in an iterative procedure. This procedure is performed until the individual allele frequencies have converged.
PCAngsd will compute the covariance matrix in all available analyses. It uses the principal components to model the individual allele frequencies such that they can be used to estimate another more accurate covariance matrix. This procedure is iterated until convergence for the individual allele frequencies.


; beagle [BEAGLE file path]
; beagle [BEAGLE filename]
Positional argument for the path of the genotype likelihoods in BEAGLE format.
Positional argument for the path of the genotype likelihoods in BEAGLE format.
; beaglelist [filelist]
Parse a filelist with multiple BEAGLE files, e.g. if the genotype likelihoods have been computed separately for each chromosome.
; -M [int]
; -M [int]
Maximum number of iterations for covariance estimation. Only needed in rare cases. (Default: 100)
Maximum number of iterations for covariance estimation. Only needed in rare cases. (Default: 100)
Line 63: Line 64:
Manually select the number of eigenvalues to use in modelling of individual allele frequencies. (Default: Automatically selected)
Manually select the number of eigenvalues to use in modelling of individual allele frequencies. (Default: Automatically selected)
; -reg
; -reg
Toogle to use Tikhonov regularization in modelling of individual allele frequencies to penalize lesser important PCs. May also help on convergence.
(Not fully tested!) Toogle to use Tikhonov regularization in modelling of individual allele frequencies to penalize lesser important PCs. May also help on convergence.
; -o [filename]
; -o [filename]
Set the prefix for all output files created by PCAngsd.
Set the prefix for all output files created by PCAngsd.
Line 74: Line 75:


==Inbreeding==
==Inbreeding==
Per-individual inbreeding coefficients can be computed using three different methods:
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods:


; -inbreed 1
; -inbreed 1
A maximum likelihood estimator computed by an EM algorithm. Only allows F-values between 0 and 1.
A maximum likelihood estimator computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB],[http://genome.cshlp.org/content/23/11/1852.full].
; -inbreed 2
; -inbreed 2
Simple estimator also computed by an EM algorithm described in [http://genome.cshlp.org/content/23/11/1852.full].
Simple estimator also computed by an EM algorithm. Based on [http://genome.cshlp.org/content/23/11/1852.full].
; -inbreed 3
; -inbreed 3
Moment estimator for the allele frequencies based on the model in [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]. Sensitive to low-depth data.  
(Not recommended for low depth NGS data!) Estimator using the kinship matrix. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].  
; -inbreed_iter [int]
; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)
Maximum number of iterations for the EM algorithm methods. (Default: 200)
; -inbreed_tole [float]
; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Per-site inbreeding coefficients incorporating population structure can be computed


==Selection==
==Selection==
Line 91: Line 95:


; -selection 1
; -selection 1
Using the model described in [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Produces a genome-wide selection scan for all significant PCs.
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Produces a genome selection scan for all significant PCs.
; -selection 2
; -selection 2
Using the model described in [http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12592/abstract PCAdapt].
(Not fully tested!) Using an extended model of [http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12592/abstract PCAdapt].  
 


LD can also be taken into account when performing selection scans. LD regression has been implemented in PCAngsd but the functionality is not fully tested.
LD can also be taken into account when performing selection scans. LD regression has been implemented in PCAngsd but the functionality is not fully tested.
Line 100: Line 105:


==Relatedness==
==Relatedness==
Relatedness will also be touched upon in future updates.
Work in progress.
 
Estimate kinship matrix:
 
; -kinship
 
==Example==
 
 
==Citation==

Revision as of 12:32, 10 August 2017

This page contains information about the program PCAngsd, which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Based on the population structure inference PCAngsd is able to estimate individual allele frequencies. By incorporating these allele frequencies in Empirical Bayes approaches, PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components in structured populations. The entire program is written in Python 2.7.

Error creating thumbnail: File missing

Download

The program can be downloaded from Github: https://github.com/Rosemeis/pcangsd

git clone https://github.com/Rosemeis/pcangsd.git;
cd pcangsd/

The following Python packages are needed to run PCAngsd (found in all popular distributions): numpy and pandas.

PCAngsd should work on all platforms meeting the requirements but server-side usage is recommended.


Quick start

# See all options in PCAngsd
python pcangsd.py -h

# Estimate covariance matrix 
python pcangsd.py -beagle test.beagle.gz -o test

# Estimate inbreeding coefficients
python pcangsd.py -beagle test.beagle.gz -inbreed 1 -o test

# Perform selection scan
python pcangsd.py -beagle test.beagle.gz -selection 1 -o test

Input

The only input PCAngsd needs and accepts are genotype likelihoods in BEAGLE format. ANGSD can be easily be used to compute genotype likelihoods and output them in the required BEAGLE format.

./angsd -GL 1 -out genoLikes -nThreads 10 -doGlf 2 -doMajorMinor 1  -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist

See ANGSD for more info on how to compute the genotype likelihoods and call SNPs.

Using PCAngsd

All the different options in PCAngsd are listed here. All required analyses must be run in the same command! (*For now...*)

PCAngsd will always compute the covariance matrix. It uses the computed principal components to estimate the individual allele frequencies in an iterative procedure. This procedure is performed until the individual allele frequencies have converged.

beagle [BEAGLE filename]

Positional argument for the path of the genotype likelihoods in BEAGLE format.

beaglelist [filelist]

Parse a filelist with multiple BEAGLE files, e.g. if the genotype likelihoods have been computed separately for each chromosome.

-M [int]

Maximum number of iterations for covariance estimation. Only needed in rare cases. (Default: 100)

-M_tole [float]

Tolerance value for the iterative covariance matrix estimation. (Default: 1e-4)

-EM [int]

Maximum number of EM iterations for computing the population allele frequencies. (Default: 200)

-EM_tole [float]

Tolerance value in EM algorithm for population allele frequencies estimation. (Default: 1e-4)

-e [int]

Manually select the number of eigenvalues to use in modelling of individual allele frequencies. (Default: Automatically selected)

-reg

(Not fully tested!) Toogle to use Tikhonov regularization in modelling of individual allele frequencies to penalize lesser important PCs. May also help on convergence.

-o [filename]

Set the prefix for all output files created by PCAngsd.

Call genotypes

Genotypes can be called very easily using the individual allele frequencies as prior.

-callGeno

Toggle to call genotypes.

Inbreeding

Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods:

-inbreed 1

A maximum likelihood estimator computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [1],[2].

-inbreed 2

Simple estimator also computed by an EM algorithm. Based on [3].

-inbreed 3

(Not recommended for low depth NGS data!) Estimator using the kinship matrix. Based on PC-Relate.

-inbreed_iter [int]

Maximum number of iterations for the EM algorithm methods. (Default: 200)

-inbreed_tole [float]

Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)


Per-site inbreeding coefficients incorporating population structure can be computed

Selection

A genome-wide selection scan can be computed using two different methods:

-selection 1

Using an extended model of FastPCA. Produces a genome selection scan for all significant PCs.

-selection 2

(Not fully tested!) Using an extended model of PCAdapt.


LD can also be taken into account when performing selection scans. LD regression has been implemented in PCAngsd but the functionality is not fully tested.

-LD [int]

Select the window (in bases) of preceding sites to use in regression.

Relatedness

Work in progress.

Estimate kinship matrix:

-kinship

Example

Citation