Revision as of 14:24, 5 January 2017

This page contains information about the program called FastNGSadmixPCA, which is a very fast tool for finding admixture proportions from NGS data of a single individual to incorporate into PCA of NGS data. It is based on genotype likelihoods. The program is written in R.

Installation

wget http://popgen.dk/albrecht/kristian/tool_download.zip
unzip tool_download.zip
OR simply use SHINY:
http://popgen.dk:443/kristian/admixpca_human/

Run example

tool.zip contains all files needed to execute FASTNGSAdmixPCA. The sample is from the HAPMAP project. In need of more samples, one can find a couple more samples in http://popgen.dk/albrecht/kristian/ The Rscript below executes the tool. all output is directed to a output_folder that is created in the process. To see the preset: Rscript FastNGSAdmixPCA.r

Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz

All arguments can be altered. To alter the reference populations, one need to write comma separated populations to the refpops argument as shown below

Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz refpops=YRI,JPT,CHB,CEU

To get an overview of available reference populations, one can make a dry run

Rscript FastNGSAdmixPCA.r infile=TRUE dryrun=TRUE

Input Files

Input files are contains genotype likelihoods in genotype likelihood beagle input file format [1]. We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format.

The example below show how to make a beagle file of genotype likelihood using ANGSD.

HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood

Example of a beagle genotype likelihood input file for 3 individuals.

marker       allele1  allele2   Ind0      Ind0    Ind0
1_14000023      1       0       0.941    0.058    0.000
1_14000072      2       3       0.709    0.177    0.112
1_14000113      0       2       0.855    0.106    0.037
1_14000202      2       0       0.835    0.104    0.060
...

version 2

Input files are genotype likelihoods in the genotype likelihood beagle input file format [2]. Or called genotypes in the binary plink files (*.bed) format [3] We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files.

The example below show how to make a beagle file of genotype likelihood using ANGSD.

HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood

Example of a beagle genotype likelihood input file for 1 individual.

marker       allele1  allele2   Ind0      Ind0    Ind0
1_14000023      1       0       0.941    0.058    0.000
1_14000072      2       3       0.709    0.177    0.112
1_14000113      0       2       0.855    0.106    0.037
1_14000202      2       0       0.835    0.104    0.060
...

A provided SNP.sites file has been included.

The program also needs frequencies of a reference panel with the populations for which admixture proportions should be estimated, for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population. There is an R script called plinkToRefV2.R, that can convert a plink file to a reference panel and size of reference panel populations.

fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected.

I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).

Furthermore I removed sites with more than 5 % missing and a MAF below 5 %, and only autosomal sites. I selected 5 populations French, Han, Karitiana, Papuan and Yoruba to have representation for most of the world. Furthermore I made sure that I only used unadmixed individuals within each population.

An example of a command running fastNGSadmix with a beagle file and a chosen K of 3:

./fastNGSadmix -likes indi_genotypelikelihood.beagle -fname refPanel.txt -Nname nInd.txt -outfiles indi_genotypelikelihood.beagle -K 3

It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log.

Or with a plink file:

./fastNGSadmix -plink plinkFile -fname refPanel.txt -Nname nInd.txt -outfiles plinkFile -K 3

OPTIONS OPTIONS OPTIONS!??

Custom refpanel can be supplied, has to look like this, where the 5 first columns have to be, then populations frequencies:

chr,pos,name,A0,A1

The frequencies have to be of the A0 allele.

Basically the files have to look like this: bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 B A f(B)

Then solution is this: bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 B A 1-f(B)

Then prepFreqs.R will take care of preparing the files properly.

Example of running prepFreqs.R:

Rscript prepFreqs.R indi_genotypelikelihood.bgl

Can also specify other populations than the default 5 ones in the reference panel. Also a custom made reference panel can be supplied (has to be .Rdata file).

Rscript prepFreqs.R indi_genotypelikelihood.bgl Pop1,Pop2,Pop3,Pop4 customRefPanel

indi_genotypelikelihood

Then run prepFreqs.R to get the proper beagle, refpanel and nInd files for the analysis.

Then run fastNGSadmix.

All the awesome options with the program.

So bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 A B f(A)

(So if the 3 columns with genotype likelihoods in the beagle file is coded like this AA AB BB, then the frequencies should be of the A allele.)

Furthermore a file with the number of individuals in each reference population should be supplied.

Then a lot of different options and filters can be specified:

(TO BE CONTINUED...)

FastNgsAdmixOld: Difference between revisions

Revision as of 14:24, 5 January 2017

Contents

Installation

Run example

Input Files

version 2

Navigation menu

@@ Line 48: / Line 48: @@
 =version 2=
-Input files are contains genotype likelihoods in genotype likelihood beagle input file format [http://faculty.washington.edu/browning/beagle/beagle.html]. We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format.
+Input files are genotype likelihoods in the genotype likelihood beagle input file format [http://faculty.washington.edu/browning/beagle/beagle.html]. Or called genotypes in the binary plink files (*.bed) format [https://www.cog-genomics.org/plink2]
+We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files.
 The example below show how to make a beagle file of genotype likelihood using ANGSD.
@@ Line 55: / Line 56: @@
 </pre>
-Example of a beagle genotype likelihood input file for 3 individuals.
+Example of a beagle genotype likelihood input file for 1 individual.
 <pre>
 marker       allele1  allele2   Ind0      Ind0    Ind0
@@ Line 69: / Line 70: @@
 The program also needs frequencies of a reference panel with the populations for which admixture proportions should be estimated,
-for instance from 1000 G or HGDP, or a custom made reference panel.
+for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population.
+There is an R script called plinkToRefV2.R, that can convert a plink file to a reference panel and size of reference panel populations.
-However it comes with a premade reference panel:
+fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected.
-Reference panel and genotypes are taken from Lazaridis et al. (2014) where the curated dataset was selected.
 I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).
@@ Line 80: / Line 80: @@
 I selected 5 populations French, Han, Karitiana, Papuan and Yoruba to have representation for most of the world.
 Furthermore I made sure that I only used unadmixed individuals within each population.
+An example of a command running fastNGSadmix with a beagle file and a chosen K of 3:
+<pre>
+./fastNGSadmix -likes indi_genotypelikelihood.beagle -fname refPanel.txt -Nname nInd.txt -outfiles indi_genotypelikelihood.beagle -K 3
+</pre>
+It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log.
+Or with a plink file:
+<pre>
+./fastNGSadmix -plink plinkFile -fname refPanel.txt -Nname nInd.txt -outfiles plinkFile -K 3
+</pre>
+OPTIONS OPTIONS OPTIONS!??
@@ Line 129: / Line 149: @@
 Furthermore a file with the number of individuals in each reference population should be supplied.
-An example of a command:
-<pre>
-./fastNGSadmix -likes Yoruba10Japanese65Han25_3000000_d10_N10_GL.txt -fname Yoruba10Japanese65Han25_3000000_d10_N10_Ref.txt -Nname sYoruba10Japanese65Han25_3000000_d10_N10_nInd.txt -outfiles Yoruba10Japanese65Han25_3000000_d10_N10
-</pre>
 Then a lot of different options and filters can be specified:
 (TO BE CONTINUED...)