software - User contributions [en]

PCAngsdTutorial

2019-08-07T10:53:51Z

Sonia4: /* Plot the results */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $OUT_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>

library(RcppCNPy) # Numpy library for R

## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-npyLoad("Demo2PCANGSD_2.selection.npy")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:52:35Z

Sonia4: /* Infer selection along the genome */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $OUT_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:50:26Z

Sonia4: /* Download the input and sample information files */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $OUT_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

A file with the positions and sample information, and a beagle file are provided:

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.small.beagle.gz)

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>
(/home/albrecht/oldPhDCourse/PCangsd/data/eu1000g.sample.Info)

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

pdf("PCAngsdDemo2_2_1.pdf")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

pdf("PCAngsdDemo2_2_2.pdf")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:47:16Z

Sonia4: /* Plot the results */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $OUT_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

pdf("PCAngsdDemo2_2_1.pdf")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

pdf("PCAngsdDemo2_2_2.pdf")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:45:55Z

Sonia4: /* Plot the results in R */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $OUT_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)

pdf("PCAngsdDemo2_1.pdf")

plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:44:19Z

Sonia4: /* Download the input and population information files */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $OUT_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:44:00Z

Sonia4: /* Download the input and sample information files */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $OUT_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:32:43Z

Sonia4: /* Run PCAngsd */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:22:21Z

Sonia4: /* Admixture based on 1st two PC */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:22:09Z

Sonia4: /* Admixture based on 1st two PC */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
##Requires previous installation of the library RcppCNPy

library(RcppCNPy) # Numpy library for R

pop<-read.table("Demo1pop.info",as.is=T)

q <- npyLoad("Demo2PCANGSD_3.admix.Q.npy")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:18:33Z

Sonia4: /* Inbreeding in the admixed individuals */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo2PCANGSD_3.admix.Q.npy")

#old NAME: "Demo2PCANGSD_3.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed.npy | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:18:21Z

Sonia4: /* Inbreeding with individual allele frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo2PCANGSD_3.admix.Q.npy")

#old NAME: "Demo2PCANGSD_3.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed.npy | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:11:16Z

Sonia4: /* Infer selection along the genome */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo2PCANGSD_3.admix.Q.npy")

#old NAME: "Demo2PCANGSD_3.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_2 -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("Demo2PCANGSD_2.selection.gz")

## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("Demo2PCANGSD_2.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhattan plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:09:51Z

Sonia4: /* Plot the results in R */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo2PCANGSD_3.admix.Q.npy")

#old NAME: "Demo2PCANGSD_3.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("Demo2PCANGSD_1.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:09:22Z

Sonia4: /* Inbreeding with individual allele frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo2PCANGSD_3.admix.Q.npy")

#old NAME: "Demo2PCANGSD_3.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_5.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:09:06Z

Sonia4: /* Inbreeding with individual allele frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo2PCANGSD_3.admix.Q.npy")

#old NAME: "Demo2PCANGSD_3.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_5 -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo1PCANGSD_5.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:07:48Z

Sonia4: /* Run PCAngsd */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo2PCANGSD_3.admix.Q.npy")

#old NAME: "Demo2PCANGSD_3.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o $OUT_DIR/Demo2PCANGSD_1 -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:06:52Z

Sonia4: /* Inbreeding in the admixed individuals */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo2PCANGSD_3.admix.Q.npy")

#old NAME: "Demo2PCANGSD_3.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_4 -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste Demo1pop.info Demo1PCANGSD_4.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:05:49Z

Sonia4: /* Admixture based on 1st two PC */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD_3 -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo2PCANGSD_3.admix.Q.npy")

#old NAME: "Demo2PCANGSD_3.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:05:19Z

Sonia4: /* Without Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_2 -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_2.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo3PCANGSD.admix.Q.npy")

#old NAME: "Demo3PCANGSD.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:04:59Z

Sonia4: /* Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD_1</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD_1.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo3PCANGSD.admix.Q.npy")

#old NAME: "Demo3PCANGSD.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T10:00:12Z

Sonia4: /* Admixture based on 1st two PC */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo3PCANGSD.admix.Q.npy")

#old NAME: "Demo3PCANGSD.K3.a50.0.qopt"

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:57:55Z

Sonia4: /* Admixture based on 1st two PC */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info",as.is=T)

q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:55:58Z

Sonia4: /* Without Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:50:11Z

Sonia4: /* Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("Demo1pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:48:18Z

Sonia4: /* Infer selection along the genome */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:47:47Z

Sonia4: /* Run PCAngsd */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

==== Run PCAngsd ====

The objective is to show the differences among individuals.

<code> wc $IN_DIR/Demo2sample.info </code>

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4</code>

==== Plot the results in R ====

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

=== Infer selection along the genome ===

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:43:50Z

Sonia4: /* Inbreeding with individual allele frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:43:39Z

Sonia4: /* Inbreeding in the admixed individuals */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:43:27Z

Sonia4: /* Admixture based on 1st two PC */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:43:16Z

Sonia4: /* Without Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:42:21Z

Sonia4: /* Run PCAngsd */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:41:39Z

Sonia4: /* Set the path to PCAngsd */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:40:50Z

Sonia4: /* Set the path to PCAngsd */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=python ~/Software/pcangsd/pcangsd.py
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:40:20Z

Sonia4: /* Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-07T09:39:54Z

Sonia4: /* Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>python $PCANGSD/pcangsd.py -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdv2

2019-08-07T09:35:29Z

Sonia4: /* Inbreeding */

PCAngsd is a program that estimates the covariance matrix for low depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using an iterative procedure based on genotype likelihoods.

Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization and is now compatible with any newer Python version.

The method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]
[[File:Pcangsd_admix.gif|frame]]

[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]

=Overview=

Based on population structure inference, PCAngsd is able to detect the number of significant principal components which is then used to estimate individual allele frequencies using genotype dosages in a SVD model. These individual allele frequencies can be used in various population genetic methods for heterogeneous populations, such that PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate individual admixture proportions, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components.
The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
*Covariance matrix
*Genotype calling
*Admixture
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*Genome selection scan
*Kinship matrix

The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].

=Download and Installation=

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.
Installation has only been tested on Linux systems.

It is assumed that OpenMP is installed [https://www.openmp.org/].

1. Login to your server using ssh on your terminal window.

2. Create the directory where you will install your software and enter it, such as:

<pre>
mkdir ~/Software
cd ~/Software
</pre>

3. Download the source code:

<pre>
git clone https://github.com/Rosemeis/pcangsd.git
</pre>

4. Configure, Compile and Install:

<pre>
cd pcangsd/
python setup.py build_ext --inplace
</pre>

5. Install dependencies:

The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.

<code>pip install --user -r requirements.txt</code>

=Quick start=

<pre>

# Download the input file
'''*ANDERS*'''
wget popgen.dk/software/download/NGSadmix/data/input.gz
maybe change it to:
wget popgen.dk/software/download/PCAngsd/data/input.gz
'''*ANDERS*'''

# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle input.gz -o test1 -threads 10

# Estimate covariance matrix and individual admixture proportions
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10
</pre>

==Detailed Examples and Tutorial==

Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. New functionality for using PLINK files has been added (version 0.9). Genotypes are automatically converted into a genotype likelihood matrix where the user can incorporate an error model.

[http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 1 -out data -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.

=Output=

Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.

In order to read files in python:

<pre>
import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix
S = np.load("output.selection.npy") # Reads results from selection scan
</pre>

R can also read Numpy matrices using the "RcppCNPy" library:
<pre>
library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
S <- npyLoad("output.selection.npy") # Reads results from selection scan
</pre>

=Using PCAngsd=

All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´

<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>

==Estimation of individual allele frequencies==
; -beagle [Beagle filename]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
; -plink_error [float]
Incorporate error model for PLINK genotypes.
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 100).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").
; -indf_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy).
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -sites_save
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -threads [int]
Specify the number of thread(s) to use (Default: 1).

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.

==Admixture==
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.

; -admix
Toggles admixture estimations. Individual ancestry proportions are saved (Binary). Numpy format (.npy).
; -admix_alpha [float-list]
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alpha's in a single run (Default: 0).
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int-list]
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run.
; -admix_K [int]
Not recommended. Manually specify the number of ancestral populations to use in admixture estimations (overrides number chosen from '''-e'''). Structure explained by individual allele frequencies may therefore not reflect the manually chosen K. It is recommended to adjust '''-e''' instead of '''-admix_K'''.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_batch [int]
Specify the number of batches to use in NMF method. (Default: 10)
; -admix_save
Choose to save the population-specific allele frequencies (Binary). Numpy format (.npy).

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 1 is recommended for low depth cases.

; -inbreed 1
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].

; -inbreed 2
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].

; -inbreed 3
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].

; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)

; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)
Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites
Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:

; -hwe [LRT filename]

; -hwe_tole [float]
Tolerance value for HWE test. (Default: 1e-6)

==Selection==
A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):

; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

==Relatedness==
Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if '''-inbreed 3''' has been selected.

Remove related individuals based on kinhsip matrix of previous run:
; -relate [Kinship filename]
; -relate_tole [float]
Threshold for kinship coefficients for removing individuals (Default: 0.0625).

=Citation=
Our methods for inferring population structure have been published in GENETICS:

Population structure: [http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]

HWE test: [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

PCAngsdv2

2019-08-07T09:31:39Z

Sonia4: /* Output */

PCAngsd is a program that estimates the covariance matrix for low depth next-generation sequencing (NGS) data in structured/heterogeneous populations using principal component analysis (PCA) to perform multiple population genetic analyses using an iterative procedure based on genotype likelihoods.

Since version 0.98, PCAngsd was re-written to be based on Cython for computational bottlenecks and parallelization and is now compatible with any newer Python version.

The method was published in 2018 and can be found here: [https://www.genetics.org/content/210/2/719]
[[File:Pcangsd_admix.gif|frame]]

[[File:Pcangsd_pca.png|thumb|400px|Simulated low depth NGS data of 3 populations]]

=Overview=

Based on population structure inference, PCAngsd is able to detect the number of significant principal components which is then used to estimate individual allele frequencies using genotype dosages in a SVD model. These individual allele frequencies can be used in various population genetic methods for heterogeneous populations, such that PCAngsd can perform PCA (estimate covariance matrix), call genotypes, estimate individual admixture proportions, estimate inbreeding coefficients (per-individual and per-site) and perform a genome selection scan using principal components.
The estimated individual allele frequencies and principal components can be used as prior knowledge in other probabilistic methods based on a same Bayesian principle. PCAngsd can perform the following analyses:
*Covariance matrix
*Genotype calling
*Admixture
*Inbreeding coefficients (both per-individual and per-site)
*HWE test
*Genome selection scan
*Kinship matrix

The older version, based on the Numba library (only working with Python 2.7) is still available in version 0.973 and can be found here [https://github.com/Rosemeis/pcangsd/releases/tag/0.973].

=Download and Installation=

PCAngsd should work on all platforms meeting the requirements but server-side usage is highly recommended.
Installation has only been tested on Linux systems.

It is assumed that OpenMP is installed [https://www.openmp.org/].

1. Login to your server using ssh on your terminal window.

2. Create the directory where you will install your software and enter it, such as:

<pre>
mkdir ~/Software
cd ~/Software
</pre>

3. Download the source code:

<pre>
git clone https://github.com/Rosemeis/pcangsd.git
</pre>

4. Configure, Compile and Install:

<pre>
cd pcangsd/
python setup.py build_ext --inplace
</pre>

5. Install dependencies:

The required set of Python packages are easily installed using the pip command and the requirements.txt file included in the pcangsd folder.

<code>pip install --user -r requirements.txt</code>

=Quick start=

<pre>

# Download the input file
'''*ANDERS*'''
wget popgen.dk/software/download/NGSadmix/data/input.gz
maybe change it to:
wget popgen.dk/software/download/PCAngsd/data/input.gz
'''*ANDERS*'''

# Only estimate covariance matrix using 10 threads
python pcangsd.py -beagle input.gz -o test1 -threads 10

# Estimate covariance matrix and individual admixture proportions
python pcangsd.py -beagle input.gz -admix -o test2 -threads 10

# Estimate covariance matrix and inbreeding coefficients
python pcangsd.py -beagle input.gz -inbreed 2 -o test3 -threads 10

# Estimate covariance matrix and perform selection scan
python pcangsd.py -beagle input.gz -selection -o test4 -threads 10
</pre>

==Detailed Examples and Tutorial==

Please refer to the tutorial's page [http://www.popgen.dk/software/index.php/PCAngsdTutorial]

=Input=
The only input PCAngsd needs and accepts are genotype likelihoods in [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] format. New functionality for using PLINK files has been added (version 0.9). Genotypes are automatically converted into a genotype likelihood matrix where the user can incorporate an error model.

[http://popgen.dk/angsd ANGSD] can be easily be used to compute genotype likelihoods and output them in the required Beagle format.

<pre>
./angsd -GL 1 -out data -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam bam.filelist
</pre>

See [http://popgen.dk/angsd ANGSD] for more information on how to compute the genotype likelihoods and call SNPs.

=Output=

Since version 0.98, PCAngsd's output is only in binary Numpy format (.npy) except for the covariance matrix.

In order to read files in python:

<pre>
import numpy as np
C = np.genfromtxt("output.cov") # Reads in estimated covariance matrix
S = np.load("output.selection.npy") # Reads results from selection scan
</pre>

R can also read Numpy matrices using the "RcppCNPy" library:
<pre>
library(RcppCNPy)
C <- as.matrix(read.table("output.cov")) # Reads in estimated covariance matrix
S <- npyLoad("output.selection.npy") # Reads results from selection scan
</pre>

=Using PCAngsd=

All the different options in PCAngsd are listed here. PCAngsd will always compute the covariance matrix, where it uses principal components to estimate individual allele frequencies in an iterative procedure. The estimated individual allele frequencies will then be used in any of the other specified options of PCAngsd.´

<pre>
# See all options in PCAngsd
python pcangsd.py -h
</pre>

==Estimation of individual allele frequencies==
; -beagle [Beagle filename]
Input file of genotype likelihoods in Beagle format (.beagle.gz).
; -plink [Prefix for binary PLINK files]
Path to PLINK files using their prefix (.bed, .bim, .fam).
; -plink_error [float]
Incorporate error model for PLINK genotypes.
; -minMaf [float]
Minimum minor allele frequency threshold. (Default: 0.05)
; -iter [int]
Maximum number of iterations for estimation of individual allele frequencies (Default: 100).
; -tole [float]
Tolerance value for update in estimation of individual allele frequencies (Default: 1e-5).
; -maf_iter [int]
Maximum number of EM iterations for computing the population allele frequencies (Default: 100).
; -maf_tole [float]
Tolerance value in EM algorithm for population allele frequencies estimation (Default: 1e-4).
; -e [int]
Manually select the number of eigenvalues to use in the modelling of individual allele frequencies (Default: Automatically tested using MAP test).
; -o [prefix]
Set the prefix for all output files created by PCAngsd (Default: "pcangsd").
; -indf_save
Choose to save estimated individual allele frequencies (Binary). Numpy format (.npy).
; -dosage_save
Choose to save estimated genotype dosages (Binary). Numpy format (.npy).
; -sites_save
Choose to save the marker IDs after performing filtering using population allele frequencies. Useful for especially selection scans and per-site inbreeding coefficients.
; -post_save
Choose to save the posterior genotype probabilities. Beagle format (.beagle).
; -threads [int]
Specify the number of thread(s) to use (Default: 1).

==Call genotypes==
Genotypes can be called from posterior genotype probabilities incorporating the individual allele frequencies as prior information.

; -geno [float]
Call genotypes with defined threshold.
; -genoInbreed [float]
Call genotypes with defined threshold also taking inbreeding into account. '''-inbreed [int]''' is required, since individual inbreeding coefficients must have been estimated prior to calling genotypes using that information.

==Admixture==
Individual admixture proportions and population-specific allele frequencies can be estimated based on assuming K ancestral populations using an accelerated mini-batch NMF method.

; -admix
Toggles admixture estimations. Individual ancestry proportions are saved (Binary). Numpy format (.npy).
; -admix_alpha [float-list]
Specify alpha (sparseness regularization parameter). Can be specified as a sequence to try several alpha's in a single run (Default: 0).
; -admix_auto [float]
Enable automatic search for optimal alpha using likelihood measure, by giving soft upper search bound of alpha.
; -admix_seed [int-list]
Specify seed for random initializations of factor matrices in admixture estimations. Can be specified as a sequence to try several different seeds in a single run.
; -admix_K [int]
Not recommended. Manually specify the number of ancestral populations to use in admixture estimations (overrides number chosen from '''-e'''). Structure explained by individual allele frequencies may therefore not reflect the manually chosen K. It is recommended to adjust '''-e''' instead of '''-admix_K'''.
; -admix_iter [int]
Maximum number of iterations for admixture estimations using NMF. (Default: 200)
; -admix_tole [float]
Tolerance value for update in admixture estimations using NMF. (Default: 1e-5)
; -admix_batch [int]
Specify the number of batches to use in NMF method. (Default: 10)
; -admix_save
Choose to save the population-specific allele frequencies (Binary). Numpy format (.npy).

==Inbreeding==
Per-individual inbreeding coefficients incorporating population structure can be computed using three different methods. However, -inbreed 1 is recommended for low depth cases.

; -inbreed 1
Simple estimator computed by an EM algorithm. Allows for F-values between -1 and 1. Based on [http://genome.cshlp.org/content/23/11/1852.full ngsF].
; -inbreed 2
A maximum likelihood estimator also computed by an EM algorithm. Only allows for F-values between 0 and 1. Based on [https://www.cambridge.org/core/journals/genetics-research/article/maximum-likelihood-estimation-of-individual-inbreeding-coefficients-and-null-allele-frequencies/2DEBA0C0C2B92DF0EE89BD27DFCAD3FB].
; -inbreed 3
Estimator using an estimated kinship matrix. Allows for F-values between -1 and 1. Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate].
; -inbreed_iter [int]
Maximum number of iterations for the EM algorithm methods. (Default: 200)
; -inbreed_tole [float]
Tolerance value for the EM algorithms for inbreeding coefficients estimation. (Default: 1e-4)

Per-site inbreeding coefficients incorporating population structure alongside likehood ratio tests for HWE can be computed as follows:

; -inbreedSites

Use likelihood ratio tests (.lrt.sites.gz) generated from '''-inbreedSites''' to filter out variable sites using a given threshold for HWE test p-value:

; -hwe [LRT filename]
; -hwe_tole [float]
Tolerance value for HWE test. (Default: 1e-6)

==Selection==
A genome selection scan can be computed based on posterior expectations of the genotypes (genotype dosages):

; -selection
Using an extended model of [http://www.cell.com/ajhg/abstract/S0002-9297(16)00003-3 FastPCA]. Performs a genome selection scan along all significant PCs. Outputs the selection statistics and must be converted to p-values by user. Each column reflect the selection statistics along a tested PC and they are χ²-distributed with 1 degree of freedom.

==Relatedness==
Estimate kinship matrix based on method Based on [http://www.cell.com/ajhg/abstract/S0002-9297(15)00493-0 PC-Relate]:

; -kinship
Automatically estimated if '''-inbreed 3''' has been selected.

Remove related individuals based on kinhsip matrix of previous run:
; -relate [Kinship filename]
; -relate_tole [float]
Threshold for kinship coefficients for removing individuals (Default: 0.0625).

=Citation=
Our methods for inferring population structure have been published in GENETICS:

Population structure: [http://www.genetics.org/content/early/2018/08/21/genetics.118.301336 Inferring Population Structure and Admixture Proportions in Low Depth NGS Data]

HWE test: [https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13019 Testing for Hardy‐Weinberg Equilibrium in Structured Populations using Genotype or Low‐Depth NGS Data]

PCAngsdTutorial

2019-08-06T23:02:14Z

Sonia4: /* Demo 2: Selection */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

=== Run PCAngsd ===

The objective is to show the differences among individuals.

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on population differences. However, PCAngsd offers a good description of the differences among individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:59:15Z

Sonia4: /* Download the input and sample information files */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## Open R

cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations, it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals without having to define disjoint groups.

==== Infer selection along the genome ====

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

==== Plot the results ====

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:56:31Z

Sonia4: /* Download the input and sample information files */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.

(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file

(EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:56:03Z

Sonia4: /* Download the beagle genotype likelihood input file */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the input and sample information files ====

There is a file with the positions and sample information.
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:55:11Z

Sonia4: /* Download the input and population information files */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

To download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the beagle genotype likelihood input file ====

There is a file with the positions and sample information.
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:54:24Z

Sonia4: /* Demo 2: Selection */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individual allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the beagle genotype likelihood input file ====

There is a file with the positions and sample information.
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:54:09Z

Sonia4: /* Demo 2: Selection */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCAngsd to estimate the covariance matrix while jointly estimating the individuals allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the beagle genotype likelihood input file ====

There is a file with the positions and sample information.
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:53:56Z

Sonia4: /* Demo 2: Selection */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals, for example within Europeans.

The objective is to use PCangsd to estimate the covariance matrix while jointly estimating the individuals allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the beagle genotype likelihood input file ====

There is a file with the positions and sample information.
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:53:19Z

Sonia4: /* Inbreeding with individual allele frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals for example within Europeans.

The objective is to use PCangsd to estimate the covariance matrix while jointly estimating the individuals allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the beagle genotype likelihood input file ====

There is a file with the positions and sample information.
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:52:08Z

Sonia4: /* Without Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results
<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals for example within Europeans.

The objective is to use PCangsd to estimate the covariance matrix while jointly estimating the individuals allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the beagle genotype likelihood input file ====

There is a file with the positions and sample information.
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:51:50Z

Sonia4: /* Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot, type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results
<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals for example within Europeans.

The objective is to use PCangsd to estimate the covariance matrix while jointly estimating the individuals allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the beagle genotype likelihood input file ====

There is a file with the positions and sample information.
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>

PCAngsdTutorial

2019-08-06T22:51:41Z

Sonia4: /* Estimating Individual Allele Frequencies */

We will go through some examples on how to use PCAngsd with visualization of the data.

==== Set the path to PCAngsd ====

Every time you open a new terminal window, set the paths to the program and the input file.

<code>
PCANGSD=~/Software/pcangsd
</code>

Test the link

<code>
ls $PCAngsd
</code>

==== Create directories ====
Create the directories that will be used for working:

<code>mkdir Demo</code>

<code>cd Demo</code>

<code>mkdir Data</code>

<code>mkdir Results</code>

==== Set the paths to your local directories ====

<code>IN_DIR=Demo/Data</code>

<code>OUT_DIR=Demo/Results</code>

Test the links

<code>ls $IN_DIR</code>

<code>ls $OUT_DIR</code>

=== Demo 1: Allele Frequencies ===

This example will perform a PCA analysis on 1000 genotype likelihoods.

==== Download the input and population information files ====

PCAngsd uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.

The population information file is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>

<code>mv Demo1input.gz $IN_DIR</code>

<code>mv Demo1pop.info $IN_DIR</code>

'''*ANDERS*'''

==== View the genotype likelihood beagle file ====

*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.

*In order to see the first 10 columns and 10 lines of the input file, type:

:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>

*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):

:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>

==== View population information file ====

To view a summary of the population information file, cut the first column, sort and count:

<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>

Lets make a population label file and place it in the output directory

<code> cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel </code>

=== Run PCAngsd ===

The program estimates the covariance matrix that can be used for PCA.

==== Estimating Individual Allele Frequencies ====

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo1PCANGSD</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo1PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd1.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2", main="individual allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot type:

<code>evince PCAngsd1.pdf</code>

==== Without Estimating Individual Allele Frequencies ====

Try the same analysis but without estimating individual allele frequencies.
This is the same as using the first iteration of the algorithm.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo2PCANGSD -iter 0</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info")
C <- as.matrix(read.table("Demo2PCANGSD.cov"))
e <- eigen(C)

pdf("PCAngsd2.pdf")
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2",main="joint allele frequency")
legend("top",fill=1:5,levels(pop[,1]))
dev.off()
## close R
</pre>

To view the plot:

<code>evince PCAngsd2.pdf</code>

==== Admixture based on 1st two PC ====
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha).

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -admix -admix_alpha 50</code>

Plot the results in R

<pre>
#open R
pop<-read.table("pop.info",as.is=T)

###########CHECK NAME OF FILE###################
q<-read.table("Demo3PCANGSD.K3.a50.0.qopt")

## order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## close R
</pre>

==== Inbreeding in the admixed individuals ====
Inbreeding in admixed samples is usually not possible to estimate using standard approaches. Let's try to estimate the inbreeding coefficient of the samples using the average allele frequency.

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo3PCANGSD -inbreed 2 -iter 0</code>

Join names and results, sort the values and look at the results

<code>paste pop.info Demo3PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

The third column is an estimate of the inbreeding coefficient (allowing for negative)

==== Inbreeding with individual allele frequencies ====

Now let's try to estimate the inbreeding coefficient of the samples by using the individual allele frequencies predicted by the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo1input.gz -o $OUT_DIR/Demo4PCANGSD -inbreed 2 </code>

Join names and results, sort the values and look at the results
<code>paste pop.info Demo4PCANGSD.inbreed | LC_ALL=C sort -k3g</code>

=== Demo 2: Selection ===

For very resent selection we can look within closely related individuals for example within Europeans.

The objective is to use PCangsd to estimate the covariance matrix while jointly estimating the individuals allele frequencies.

Data file:

*Genotype likelihoods in Beagle format
*~150k random SNPs with maf > 5%
*Four EU populations with ~100 individuals in each

:<table class="muse-table" border="2" cellpadding="5">
<tr>
<td>CEU</td>
<td>European (mostly British)</td>
</tr>
<tr>
<td>GBR</td>
<td>Great Britain</td>
</tr>
<tr>
<td>IBS</td>
<td>Iberian/Spain</td>
</tr>
<tr>
<td>TSI</td>
<td>Italien</td>
</tr>
:</table>

==== Download the beagle genotype likelihood input file ====

There is a file with the positions and sample information.
(cp $ThePath/PCangsd/data/eu1000g.sample.Info)

And a beagle file (EU1000=$ThePath/PCangsd/data/eu1000g.small.beagle.gz)

Download the files and move them to your input folder (for example, $IN_DIR):

'''*ANDERS*'''

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz</code>

<code>wget popgen.dk/software/download/NGSadmix/data/Demo2sample.info</code>

<code>mv Demo2input.gz $IN_DIR</code>

<code>mv Demo2sample.info $IN_DIR</code>

'''*ANDERS*'''

<pre>
wc Demo2sample.info
N=424 #one line for header

#run PCANGSD
$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -threads 4
#-n $N -threads 20 # from previous year...
</pre>

Plot the results in R

<pre>
## R
cov <- as.matrix(read.table("EUsmall.cov"))

e<-eigen(cov)
ID<-read.table("Demo2sample.info",head=T)
plot(e$vectors[,1:2],col=ID$POP)

legend("topleft",fill=1:4,levels(ID$POP))
</pre>

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences. However, PCA offers a good description of the differences between individuals which out having the define disjoint groups.

Now let's try to use the PC to infer selection along the genome based on the PCA

<code>$PCANGSD -beagle $IN_DIR/Demo2input.gz -o EUsmall -selection -sites_save #-n $N </code>

plot the results

<pre>
## function for QQplot
qqchi<-function(x,...){
lambda<-round(median(x)/qchisq(0.5,1),2)
qqplot(qchisq((1:length(x)-0.5)/(length(x)),1),x,ylab="Observed",xlab="Expected",...);abline(0,1,col=2,lwd=2)
legend("topleft",paste("lambda=",lambda))
}

### read in seleciton statistics (chi2 distributed)
s<-scan("EUsmall.selection.gz")
## make QQ plot to QC the test statistics
qqchi(s)

# convert test statistic to p-value
pval<-1-pchisq(s,1)

## read positions (hg38)
p<-read.table("EUsmall.sites",colC=c("factor","integer"),sep="_")

names(p)<-c("chr","pos")

## make manhatten plot
plot(-log10(pval),col=p$chr,xlab="Chromosomes",main="Manhatten plot")

## zoom into region
w<-range(which(pval<1e-7)) + c(-100,100)
keep<-w[1]:w[2]
plot(p$pos[keep],-log10(pval[keep]),col=p$chr[keep],xlab="HG38 Position chr2")

## see the position of the most significant SNP
p$pos[which.max(s)]

</pre>