FastNgsAdmixOld: Difference between revisions

From software
Jump to navigation Jump to search
No edit summary
Line 48: Line 48:


=version 2=
=version 2=
Input files are contains genotype likelihoods in genotype likelihood beagle input file format [http://faculty.washington.edu/browning/beagle/beagle.html]. We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format.
Input files are genotype likelihoods in the genotype likelihood beagle input file format [http://faculty.washington.edu/browning/beagle/beagle.html]. Or called genotypes in the binary plink files (*.bed) format [https://www.cog-genomics.org/plink2]
We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files.


The example below show how to make a beagle file of genotype likelihood using ANGSD.
The example below show how to make a beagle file of genotype likelihood using ANGSD.
Line 55: Line 56:
</pre>
</pre>


Example of a beagle genotype likelihood input file for 3 individuals.  
Example of a beagle genotype likelihood input file for 1 individual.  
<pre>
<pre>
marker      allele1  allele2  Ind0      Ind0    Ind0
marker      allele1  allele2  Ind0      Ind0    Ind0
Line 69: Line 70:


The program also needs frequencies of a reference panel with the populations for which admixture proportions should be estimated,
The program also needs frequencies of a reference panel with the populations for which admixture proportions should be estimated,
for instance from 1000 G or HGDP, or a custom made reference panel.
for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population.
There is an R script called plinkToRefV2.R, that can convert a plink file to a reference panel and size of reference panel populations.


However it comes with a premade reference panel:
fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected.
 
Reference panel and genotypes are taken from Lazaridis et al. (2014) where the curated dataset was selected.


I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).
I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).
Line 80: Line 80:
I selected 5 populations French, Han, Karitiana, Papuan and Yoruba to have representation for most of the world.
I selected 5 populations French, Han, Karitiana, Papuan and Yoruba to have representation for most of the world.
Furthermore I made sure that I only used unadmixed individuals within each population.
Furthermore I made sure that I only used unadmixed individuals within each population.
An example of a command running fastNGSadmix with a beagle file and a chosen K of 3:
<pre>
./fastNGSadmix -likes indi_genotypelikelihood.beagle -fname refPanel.txt -Nname nInd.txt -outfiles indi_genotypelikelihood.beagle -K 3
</pre>
It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log.
Or with a plink file:
<pre>
./fastNGSadmix -plink plinkFile -fname refPanel.txt -Nname nInd.txt -outfiles plinkFile -K 3
</pre>
OPTIONS OPTIONS OPTIONS!??




Line 129: Line 149:
Furthermore a file with the number of individuals in each reference population should be supplied.
Furthermore a file with the number of individuals in each reference population should be supplied.


An example of a command:


<pre>
./fastNGSadmix -likes Yoruba10Japanese65Han25_3000000_d10_N10_GL.txt -fname Yoruba10Japanese65Han25_3000000_d10_N10_Ref.txt -Nname sYoruba10Japanese65Han25_3000000_d10_N10_nInd.txt -outfiles Yoruba10Japanese65Han25_3000000_d10_N10
</pre>


Then a lot of different options and filters can be specified:
Then a lot of different options and filters can be specified:


(TO BE CONTINUED...)
(TO BE CONTINUED...)

Revision as of 14:24, 5 January 2017

This page contains information about the program called FastNGSadmixPCA, which is a very fast tool for finding admixture proportions from NGS data of a single individual to incorporate into PCA of NGS data. It is based on genotype likelihoods. The program is written in R.

Installation

wget http://popgen.dk/albrecht/kristian/tool_download.zip
unzip tool_download.zip
OR simply use SHINY:
http://popgen.dk:443/kristian/admixpca_human/

Run example

tool.zip contains all files needed to execute FASTNGSAdmixPCA. The sample is from the HAPMAP project. In need of more samples, one can find a couple more samples in http://popgen.dk/albrecht/kristian/ The Rscript below executes the tool. all output is directed to a output_folder that is created in the process. To see the preset: Rscript FastNGSAdmixPCA.r

Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz

All arguments can be altered. To alter the reference populations, one need to write comma separated populations to the refpops argument as shown below

Rscript FastNGSAdmixPCA.r infile=NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz refpops=YRI,JPT,CHB,CEU

To get an overview of available reference populations, one can make a dry run

Rscript FastNGSAdmixPCA.r infile=TRUE dryrun=TRUE


Input Files

Input files are contains genotype likelihoods in genotype likelihood beagle input file format [1]. We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format.

The example below show how to make a beagle file of genotype likelihood using ANGSD.

HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood

Example of a beagle genotype likelihood input file for 3 individuals.

marker       allele1  allele2   Ind0      Ind0    Ind0
1_14000023      1       0       0.941    0.058    0.000
1_14000072      2       3       0.709    0.177    0.112
1_14000113      0       2       0.855    0.106    0.037
1_14000202      2       0       0.835    0.104    0.060
...


version 2

Input files are genotype likelihoods in the genotype likelihood beagle input file format [2]. Or called genotypes in the binary plink files (*.bed) format [3] We recommend [ANGSD] for easy transformation of Next-generation sequencing data to beagle format and plink2 for handling plink files.

The example below show how to make a beagle file of genotype likelihood using ANGSD.

HOME$ ./angsd0.594/angsd -i 'pathtoindi.bam' -GL 2 -sites 'SNP.sites' -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out indi_genotypelikelihood

Example of a beagle genotype likelihood input file for 1 individual.

marker       allele1  allele2   Ind0      Ind0    Ind0
1_14000023      1       0       0.941    0.058    0.000
1_14000072      2       3       0.709    0.177    0.112
1_14000113      0       2       0.855    0.106    0.037
1_14000202      2       0       0.835    0.104    0.060
...

A provided SNP.sites file has been included.


The program also needs frequencies of a reference panel with the populations for which admixture proportions should be estimated, for instance from 1000 G or HGDP, or another custom made reference panel. The program also needs a file telling the size of each reference panel population. There is an R script called plinkToRefV2.R, that can convert a plink file to a reference panel and size of reference panel populations.

fastNGSadmix already comes with a premade reference panel, made from Lazaridis et al. (2014) where the curated dataset was selected.

I lifted the dataset hg19 using the program liftOver, I then translated snpNames to rs names, using 1000G data, generating a unique name for each site via "chr-pos-A1-A2" (where A1 and A2 are alphabetically sorted).

Furthermore I removed sites with more than 5 % missing and a MAF below 5 %, and only autosomal sites. I selected 5 populations French, Han, Karitiana, Papuan and Yoruba to have representation for most of the world. Furthermore I made sure that I only used unadmixed individuals within each population.

An example of a command running fastNGSadmix with a beagle file and a chosen K of 3:

./fastNGSadmix -likes indi_genotypelikelihood.beagle -fname refPanel.txt -Nname nInd.txt -outfiles indi_genotypelikelihood.beagle -K 3

It then produces two files indi_genotypelikelihood.beagle.qopt with the admixture proportions and indi_genotypelikelihood.beagle.log.

Or with a plink file:

./fastNGSadmix -plink plinkFile -fname refPanel.txt -Nname nInd.txt -outfiles plinkFile -K 3

OPTIONS OPTIONS OPTIONS!??





Custom refpanel can be supplied, has to look like this, where the 5 first columns have to be, then populations frequencies:

chr,pos,name,A0,A1

The frequencies have to be of the A0 allele.

Basically the files have to look like this: bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 B A f(B)

Then solution is this: bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 B A 1-f(B)

Then prepFreqs.R will take care of preparing the files properly.

Example of running prepFreqs.R:

Rscript prepFreqs.R indi_genotypelikelihood.bgl

Can also specify other populations than the default 5 ones in the reference panel. Also a custom made reference panel can be supplied (has to be .Rdata file).

Rscript prepFreqs.R indi_genotypelikelihood.bgl Pop1,Pop2,Pop3,Pop4 customRefPanel


indi_genotypelikelihood

Then run prepFreqs.R to get the proper beagle, refpanel and nInd files for the analysis.

Then run fastNGSadmix.

All the awesome options with the program.


So bgl rs1 A B GL(AA) GL(AB) GL(BB) Then ref 1 1 rs1 A B f(A)

(So if the 3 columns with genotype likelihoods in the beagle file is coded like this AA AB BB, then the frequencies should be of the A allele.)

Furthermore a file with the number of individuals in each reference population should be supplied.


Then a lot of different options and filters can be specified:

(TO BE CONTINUED...)