File:Pcangsd admix.gif

2017-09-19T14:34:28Z

Jonas2: Jonas2 uploaded a new version of File:Pcangsd admix.gif

File:Pcangsd admix.gif

2017-09-19T14:30:21Z

Jonas2: Jonas2 uploaded a new version of File:Pcangsd admix.gif

PCAngsd

2017-09-19T14:26:03Z

Jonas2:

This page contains information about the program PCAngsd, which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Based on the population structure inference PCAngsd is able to estimate individual allele frequencies. By incorporating these allele frequencies in Empirical Bayes approaches, PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components in structured populations. The entire program is written in Python 2.7.

[[File:Pcangsd_pca.png|thumb|400px]]

=Download=

The program can be downloaded from Github:
https://github.com/Rosemeis/pcangsd

Latest release of PCAngsd: 0.3

<pre>
git clone https://github.com/Rosemeis/pcangsd.git;
cd pcangsd/
</pre>

The following Python packages are needed to run PCAngsd (found in all popular distributions):
'''numpy''' and '''pandas'''.

PCAngsd should work on all platforms meeting the requirements but server-side usage is recommended.

==Quick start==
<pre>
# See all options in PCAngsd
python pcangsd.py -h

# Only estimate covariance matrix
python pcangsd.py -beagle test.beagle.gz -o test

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle test.beagle.gz -inbreed 1 -o test

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle test.beagle.gz -selection 1 -o test
</pre>

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. [http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 1 -out genoLikes -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more info on how to compute the genotype likelihoods and call SNPs.

=Using PCAngsd=

All the different options in PCAngsd are listed here. Usually all the desired analyses must be run in the same command, however PCAngsd can also be run in chunk-mode where per-site estimations are performed on a chunk of the data at a time using a pre-estimated covariance matrix. More information of chunk-mode estimations can be found [[#Chunk-mode estimations|here]].

PCAngsd will always compute the covariance matrix (unless performing in chunk-mode estimations). It uses the computed principal components to estimate individual allele frequencies in an iterative procedure. This procedure is performed until the individual allele frequencies have converged.

; -beagle [Beagle filename]
Path to file of the genotype likelihoods in Beagle format.
; -beaglelist [filelist]
Parse a file with a list of multiple Beagle files, e.g. if the genotype likelihoods have been computed separately for each chromosome.
; -M [int]
Maximum number of iterations for covariance estimation. Only needed in rare cases. (Default: 100)
; -M_tole [float]
Tolerance value for the iterative covariance matrix estimation. (Default: 1e-4)
; -EM [int]
Maximum number of EM iterations for computing the population allele frequencies. (Default: 200)
; -EM_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation. (Default: 1e-4)
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies. (Default: Automatically tested)
; -reg [float]
Add regularization term in the modelling of individual allele frequencies to perform ridge regression. May help on convergence for individual allele frequencies. Must be used when scaling principal components prior to the modelling of individual allele frequencies.
; -scaled
Scale significant principal components in relation to the top principal component using their corresponding eigenvalues prior to modelling individual allele frequencies.
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies in prior.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. ''-inbreed'' is required.

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods:

; -inbreed 1
A maximum likelihood estimator computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].
; -inbreed 2
Simple estimator also computed by an EM algorithm. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].
; -inbreed 3
(Not recommended for low depth NGS data!) Estimator using the kinship matrix. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].
; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)
; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)

Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites

==Selection==
A genome selection scan can be computed using two different methods:

; -selection 1
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs.
; -selection 2
(Not fully tested!) Using an extended model of [http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12592/abstract PCAdapt].

LD can also be taken into account when performing selection scans. LD regression has been implemented in PCAngsd.
; -LD [int]
(Not fully tested!) Select the window (in bases) of preceding sites to use in regression.

==Relatedness==
'''Work in progress...'''

Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if ''-inbreed 3'' has been selected.

==Chunk-mode estimations==
PCAngsd can also be run in chunk-mode, where a chunk of the data is processed at a time. This means that estimations on very large data sets are feasible for per-site parameters. In order to run chunk-mode a pre-estimated covariance matrix must be provided, which can be estimated from a representative subset of the data set such that the estimation of the covariance matrix is feasible. Chunk-mode estimations are enabled by specifying the amount of sites to evaluate at a time:

; -chunksize [int]
Number of sites to read in at a time for chunk-mode estimations.
; -cov [file]
Covariance matrix file needed in order to perform chunk-mode estimations.

The following estimations can be performed in chunk-mode (individual allele frequencies are estimated and saved for all sites automatically):
; -selection 1
; -selection 2
; -inbreedSites
; -geno [float]

Note: Genotypes can also be called incorporating both individual allele frequencies and inbreeding coefficients, however one must also provide pre-estimated per-individual inbreeding coefficients as done with the covariance matrix:

; -F [file]
; -genoInbreed [float]

=Citation=

PCAngsd

2017-09-19T14:25:26Z

Jonas2:

This page contains information about the program PCAngsd, which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Based on the population structure inference PCAngsd is able to estimate individual allele frequencies. By incorporating these allele frequencies in Empirical Bayes approaches, PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components in structured populations. The entire program is written in Python 2.7.

[[File:Pcangsd_pca.png|thumb|200px]]

=Download=

The program can be downloaded from Github:
https://github.com/Rosemeis/pcangsd

Latest release of PCAngsd: 0.3

<pre>
git clone https://github.com/Rosemeis/pcangsd.git;
cd pcangsd/
</pre>

The following Python packages are needed to run PCAngsd (found in all popular distributions):
'''numpy''' and '''pandas'''.

PCAngsd should work on all platforms meeting the requirements but server-side usage is recommended.

==Quick start==
<pre>
# See all options in PCAngsd
python pcangsd.py -h

# Only estimate covariance matrix
python pcangsd.py -beagle test.beagle.gz -o test

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle test.beagle.gz -inbreed 1 -o test

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle test.beagle.gz -selection 1 -o test
</pre>

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. [http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 1 -out genoLikes -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more info on how to compute the genotype likelihoods and call SNPs.

=Using PCAngsd=

All the different options in PCAngsd are listed here. Usually all the desired analyses must be run in the same command, however PCAngsd can also be run in chunk-mode where per-site estimations are performed on a chunk of the data at a time using a pre-estimated covariance matrix. More information of chunk-mode estimations can be found [[#Chunk-mode estimations|here]].

PCAngsd will always compute the covariance matrix (unless performing in chunk-mode estimations). It uses the computed principal components to estimate individual allele frequencies in an iterative procedure. This procedure is performed until the individual allele frequencies have converged.

; -beagle [Beagle filename]
Path to file of the genotype likelihoods in Beagle format.
; -beaglelist [filelist]
Parse a file with a list of multiple Beagle files, e.g. if the genotype likelihoods have been computed separately for each chromosome.
; -M [int]
Maximum number of iterations for covariance estimation. Only needed in rare cases. (Default: 100)
; -M_tole [float]
Tolerance value for the iterative covariance matrix estimation. (Default: 1e-4)
; -EM [int]
Maximum number of EM iterations for computing the population allele frequencies. (Default: 200)
; -EM_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation. (Default: 1e-4)
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies. (Default: Automatically tested)
; -reg [float]
Add regularization term in the modelling of individual allele frequencies to perform ridge regression. May help on convergence for individual allele frequencies. Must be used when scaling principal components prior to the modelling of individual allele frequencies.
; -scaled
Scale significant principal components in relation to the top principal component using their corresponding eigenvalues prior to modelling individual allele frequencies.
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies in prior.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. ''-inbreed'' is required.

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods:

; -inbreed 1
A maximum likelihood estimator computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].
; -inbreed 2
Simple estimator also computed by an EM algorithm. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].
; -inbreed 3
(Not recommended for low depth NGS data!) Estimator using the kinship matrix. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].
; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)
; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)

Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites

==Selection==
A genome selection scan can be computed using two different methods:

; -selection 1
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs.
; -selection 2
(Not fully tested!) Using an extended model of [http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12592/abstract PCAdapt].

LD can also be taken into account when performing selection scans. LD regression has been implemented in PCAngsd.
; -LD [int]
(Not fully tested!) Select the window (in bases) of preceding sites to use in regression.

==Relatedness==
'''Work in progress...'''

Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if ''-inbreed 3'' has been selected.

==Chunk-mode estimations==
PCAngsd can also be run in chunk-mode, where a chunk of the data is processed at a time. This means that estimations on very large data sets are feasible for per-site parameters. In order to run chunk-mode a pre-estimated covariance matrix must be provided, which can be estimated from a representative subset of the data set such that the estimation of the covariance matrix is feasible. Chunk-mode estimations are enabled by specifying the amount of sites to evaluate at a time:

; -chunksize [int]
Number of sites to read in at a time for chunk-mode estimations.
; -cov [file]
Covariance matrix file needed in order to perform chunk-mode estimations.

The following estimations can be performed in chunk-mode (individual allele frequencies are estimated and saved for all sites automatically):
; -selection 1
; -selection 2
; -inbreedSites
; -geno [float]

Note: Genotypes can also be called incorporating both individual allele frequencies and inbreeding coefficients, however one must also provide pre-estimated per-individual inbreeding coefficients as done with the covariance matrix:

; -F [file]
; -genoInbreed [float]

=Citation=

2017-09-19T14:09:55Z

Jonas2:

PCAngsd

2017-09-19T13:35:47Z

Jonas2:

This page contains information about the program PCAngsd, which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Based on the population structure inference PCAngsd is able to estimate individual allele frequencies. By incorporating these allele frequencies in Empirical Bayes approaches, PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components in structured populations. The entire program is written in Python 2.7.

[[File:Pcangsd_pca.png|thumb]]

=Download=

The program can be downloaded from Github:
https://github.com/Rosemeis/pcangsd

Latest release of PCAngsd: 0.3

<pre>
git clone https://github.com/Rosemeis/pcangsd.git;
cd pcangsd/
</pre>

The following Python packages are needed to run PCAngsd (found in all popular distributions):
'''numpy''' and '''pandas'''.

PCAngsd should work on all platforms meeting the requirements but server-side usage is recommended.

==Quick start==
<pre>
# See all options in PCAngsd
python pcangsd.py -h

# Only estimate covariance matrix
python pcangsd.py -beagle test.beagle.gz -o test

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle test.beagle.gz -inbreed 1 -o test

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle test.beagle.gz -selection 1 -o test
</pre>

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. [http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 1 -out genoLikes -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more info on how to compute the genotype likelihoods and call SNPs.

=Using PCAngsd=

All the different options in PCAngsd are listed here. Usually all the desired analyses must be run in the same command, however PCAngsd can also be run in chunk-mode where per-site estimations are performed on a chunk of the data at a time using a pre-estimated covariance matrix. More information of chunk-mode estimations can be found [[#Chunk-mode estimations|here]].

PCAngsd will always compute the covariance matrix (unless performing in chunk-mode estimations). It uses the computed principal components to estimate individual allele frequencies in an iterative procedure. This procedure is performed until the individual allele frequencies have converged.

; -beagle [Beagle filename]
Path to file of the genotype likelihoods in Beagle format.
; -beaglelist [filelist]
Parse a file with a list of multiple Beagle files, e.g. if the genotype likelihoods have been computed separately for each chromosome.
; -M [int]
Maximum number of iterations for covariance estimation. Only needed in rare cases. (Default: 100)
; -M_tole [float]
Tolerance value for the iterative covariance matrix estimation. (Default: 1e-4)
; -EM [int]
Maximum number of EM iterations for computing the population allele frequencies. (Default: 200)
; -EM_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation. (Default: 1e-4)
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies. (Default: Automatically tested)
; -reg [float]
Add regularization term in the modelling of individual allele frequencies to perform ridge regression. May help on convergence for individual allele frequencies. Must be used when scaling principal components prior to the modelling of individual allele frequencies.
; -scaled
Scale significant principal components in relation to the top principal component using their corresponding eigenvalues prior to modelling individual allele frequencies.
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies in prior.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. ''-inbreed'' is required.

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods:

; -inbreed 1
A maximum likelihood estimator computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].
; -inbreed 2
Simple estimator also computed by an EM algorithm. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].
; -inbreed 3
(Not recommended for low depth NGS data!) Estimator using the kinship matrix. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].
; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)
; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)

Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites

==Selection==
A genome selection scan can be computed using two different methods:

; -selection 1
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs.
; -selection 2
(Not fully tested!) Using an extended model of [http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12592/abstract PCAdapt].

LD can also be taken into account when performing selection scans. LD regression has been implemented in PCAngsd.
; -LD [int]
(Not fully tested!) Select the window (in bases) of preceding sites to use in regression.

==Relatedness==
'''Work in progress...'''

Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if ''-inbreed 3'' has been selected.

==Chunk-mode estimations==
PCAngsd can also be run in chunk-mode, where a chunk of the data is processed at a time. This means that estimations on very large data sets are feasible for per-site parameters. In order to run chunk-mode a pre-estimated covariance matrix must be provided, which can be estimated from a representative subset of the data set such that the estimation of the covariance matrix is feasible. Chunk-mode estimations are enabled by specifying the amount of sites to evaluate at a time:

; -chunksize [int]
Number of sites to read in at a time for chunk-mode estimations.
; -cov [file]
Covariance matrix file needed in order to perform chunk-mode estimations.

The following estimations can be performed in chunk-mode (individual allele frequencies are estimated and saved for all sites automatically):
; -selection 1
; -selection 2
; -inbreedSites
; -geno [float]

Note: Genotypes can also be called incorporating both individual allele frequencies and inbreeding coefficients, however one must also provide pre-estimated per-individual inbreeding coefficients as done with the covariance matrix:

; -F [file]
; -genoInbreed [float]

=Citation=