FastNGSadmix: Difference between revisions

From software
Jump to navigation Jump to search
Line 375: Line 375:
paste -d" " <(cut -f1 -d" " humanOrigins_7worldPops.2.Q | paste -sd+ | bc) <(cut -f2 -d" " humanOrigins_7worldPops.2.Q | paste -sd+ | bc) >> nInd_humanOrigins_7worldPops.2.Q.txt
paste -d" " <(cut -f1 -d" " humanOrigins_7worldPops.2.Q | paste -sd+ | bc) <(cut -f2 -d" " humanOrigins_7worldPops.2.Q | paste -sd+ | bc) >> nInd_humanOrigins_7worldPops.2.Q.txt
</pre>
</pre>
Then we can run fastNGSadmix using this reference panel
<pre>
#genotype likelihoods from sequencing file
GL=example/NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz


./fastNGSadmix -likes $GL -fname refPanel_humanOrigins_7worldPops.2.P.txt -Nname nInd_humanOrigins_7worldPops.2.Q.txt -out NA12763_CEU_K2
./fastNGSadmix -likes $GL -fname refPanel_humanOrigins_7worldPops.2.P.txt -Nname nInd_humanOrigins_7worldPops.2.Q.txt -out NA12763_CEU_K2
</pre>


=Human reference panels=
=Human reference panels=

Revision as of 16:54, 7 March 2017

This page contains information about the program fastNGSadmix, a very fast tool for finding admixture proportions from NGS data of a single individual and a method for doing PCA of NGS data, using the estimated admixture proportions. It is based on genotype likelihoods. It also read plink files. The admixture estimation part is written in C++ and the PCA part is written in R.

Download

The program can be downloaded from github:

https://github.com/e-jorsboe/fastNGSadmix

git clone https://github.com/e-jorsboe/fastNGSadmix.git;
cd fastNGSadmix 
make

So far it has only been tested on Linux systems.

Data for the analyses as well as example data can be downloaded from:

wget popgen.dk/software/download/fastNGSadmix/data.tar.gz
wget popgen.dk/software/download/fastNGSadmix/example.tar.gz

Use curl if you are on a MAC

They can be unpacked thus:

tar -xzf data.tar.gz
tar -xzf example.tar.gz

The data folder both has an already made reference panel (more on this in the Making a reference panel section) for the admixture estimation and the genotypes of the reference individuals in plink format for the PCA analysis.


Quick start

#file with genotype likelihoods
GL=example/NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz

#plink file with genotypes for a single individual
PLINKFILE=example/NA20502_TSI

#frequencies from reference panel
REF=data/refPanel_humanOrigins_7worldPops.txt

#number of individuals in each population of reference panel
NIND=data/nInd_humanOrigins_7worldPops.txt

#Estimate admixture proportions
./fastNGSadmix -likes  $GL -fname $REF -Nname $NIND -out NA12763_CEU -whichPops French,Han,Yoruba

#perform PCA
Rscript R/fastNGSadmixPCA.R likes=$GL  qopt=NA12763_CEU.qopt out=NA12763_CEU  geno=data/humanOrigins_7worldPops

#Estimate admixture proportions with plink File
./fastNGSadmix -plink  $PLINKFILE -fname $REF -Nname $NIND -out NA20502_TSI -whichPops French,Han,Yoruba

#perform PCA with plink file
Rscript R/fastNGSadmixPCA.R plinkFile=$PLINKFILE  qopt=NA20502_TSI.qopt out=NA20502_TSI  geno=data/humanOrigins_7worldPops

Input Files

Input files are genotype likelihoods in the genotype likelihood beagle input file format [1]. Or called genotypes in the binary plink files (*.bed) format [2] We recommend ANGSD for easy transformation of Next-generation sequencing data to the beagle format and plink2 for handling plink files. See the ANGSD wiki for installing and other documentation

The example below shows how to make a beagle file of genotype likelihoods using ANGSD.


#BAM/CRAM file
BAM=example/smallNA12874.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam 

#sites in the referene panel including major and minor allele
SITES=data/humanOrigins_7worldPops.sites

#run ANGSD
./angsd -i $BAM -GL 2 -sites $SITES -doGlf 2 -doMajorMinor 3 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out outName

Example of a beagle genotype likelihood input file for 1 individual.

marker       allele1  allele2   Ind0      Ind0    Ind0
1_14000023      1       0       0.941    0.058    0.000
1_14000072      2       3       0.709    0.177    0.112
1_14000113      0       2       0.855    0.106    0.037
1_14000202      2       0       0.835    0.104    0.060
...

Where the marker column SHOULD be chr_pos, instead of rs ID.

The reference panel looks like this:

id chr pos name A0_freq A1 French Han Chukchi Karitiana Papuan Sindhi Yoruba
1_752566 1 752566 rs3094315 G A 0.166 0.061 0.37 0.0833 0.071 0.306 0.671
1_842013 1 842013 rs7419119 G T 0.22 0.106 0.565 0.083 0.036 0.361 0.143
1_891021 1 891021 rs13302957 G A 0.060 0.197 0.326 0.75 0.893 0.056 0.264
1_903426 1 903426 rs6696609 C T 0.62 0.636 0.705 0.25 0.179 0.583 0.257
...

Where again we have the id column, which is being used for detect the overlap with the input, the frequencies have to be in direction of the A0_freq allele.

The number of individuals in each reference population looks like this:

French Han Chukchi Karitiana Papuan Sindhi Yoruba
25 33 23 12 14 18 70

Where these number are also allowed to be floats, the names have to match the reference panel!

A provided .sites file has been included, in the data folder.

It looks like this:

1 752566 G A
1 842013 G T
1 891021 G A
1 903426 C T
1 949654 A G
...

Running fastNGSadmix

An example of running a command with fastNGSadmix with a beagle file:

#genotype likelihoods from sequencing file
GL=example/NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz

#frequencies from reference panel
REF=data/refPanel_humanOrigins_7worldPops.txt

#number of individuals in each population of reference panel
NIND=data/nInd_humanOrigins_7worldPops.txt
./fastNGSadmix -likes $GL -fname $REF -Nname $NIND -out NA12763_CEU

By default fastNGSadmix just includes all populations in the reference panel. You however can pick which populations should be analyzed via the "-whichPops" option, where you write the names of the population comma seperated, K has to be same number as selected populations. It should also be noted that the program quits if there are duplicate sites (based on chr_pos ID) in the input or the reference panel.

It then produces two files NA12763_CEU.qopt with the admixture proportions and a log file NA12763_CEU.log.

Or with a plink file, with also the -whichPops option:

PLINKFILE=example/NA20502_TSI

./fastNGSadmix -plink $PLINKFILE -fname $REF -Nname $NIND -out NA20502_TSI -whichPops French,Han,Yoruba

A whole list of options can be explored by running fastNGSadmix without any input:

./fastNGSadmix
-likes [char*]

Beagle likelihood file input of one individual.

-plink [chr*]

Plink file input of one individual in the binary bed format.

-Nname [char*]

Number of individuals in each reference populations, supplied with names of the populations.

-fname [char*]

Ancestral population frequencies with population names. By default all populations in the reference panel are included use "-whichPops" to specify which population to include.

-out [char*]

Prefix for the output files .qopt and .log.

-printFreq [int]

This option prints the admixture adjusted allele frequencies of reference panel + input individual. Disabled per default can be enabled by setting this to 1.

-whichPops [char*]

Which populations from the reference panel to include in analysis, must be comma separated (pop1,pop2,..) if not specified all populations in the reference will be included.

-doAdjust [int]

By default the method adjusting the frequencies is used (see more in the paper), to use the unadjusted approach set "-doAdjust 0".

-seed [int]

Set seed for initial guesses in EM and for bootstrap.

-method [int]

Enable acceleration of EM algorithm, enabled per default, set to 0 for unaccelerated EM.

-Qconv [int]

For faster inference of admixture proportions the "-Qconv" option can be set to 1, this bases the converge criteria on change in the admixture proportions values. The threshold of this can be set with "-Qtol".

-Qtol [float]

By default the "-Qtol" threshold is 0.0000001, it should not be put lower than this and generally this is only for when wanting a fast overview of the data, as it is less precise than the likelihood based convergence.

-tol [float]

Tolerance for convergence based on likelihoods, per default 0.00001.

-maxiter [int]

aximum number of EM iterations per default it is set to 2000.

-conv [int]

Specifies the number of convergence runs, with a new random starting point for each run. This is useful to test for convergence. The program will execute a maximum of 10 times.

-boot [int]

Specifies the number of bootstrap runs, where random sites are sampled for each run. This is useful for generating a confidence interval of your estimate, for example doing the empirical 0.025 and 0.975 quantiles. The maximum number of bootstraps is 10000.


Outputs

fastNGSadmix produces two files, a .qopt file, with the estimated admixture proportions, with names of which population they are inferred for. If n number of bootstraps are run, there will be n rows with boostrapping estimates, the first row of the .qopt file has the names of the populations analyzed and the second row has the converged upon estimates:

cat NA20502_TSI.qopt

French Han Yoruba 
1.0000 0.0000 0.0000

And if doing 10 bootstraps:

./fastNGSadmix -plink $PLINKFILE -fname $REF -Nname $NIND -out NA20502_TSI_boot -whichPops French,Han,Yoruba -boot 10


cat  NA20502_TSI_boot.qopt

French Han Yoruba 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 

Where the first row is the converged upon estimates, and the rest are the bootstrapping runs.

It also produces a .log file with all the information about the run, as well as the converged upon estimate:

cat NA20502_TSI_boot.log 

Input: likes=(null) plink=example/NA20502_TSI Nname=data/nInd_humanOrigins_7worldPops.txt fname=data/refPanel_humanOrigins_7worldPops.txt out=NA20502_TSI_boot
Setup: seed=1488883039 method=1
The accelerated EM has been chosen
The adjusted method has been chosen
Convergence: maxIter=2000 tol=0.00000010
The following number of bootstraps have been chosen: 10
Input has this many sites 441695
Ref has this many sites 442770
Overlap: of 441695 sites between input and ref

nPop=3

Opening nInd file: data/refPanel_humanOrigins_7worldPops.txt with nPop=3
Chosen pop French
N = 25.000000
Chosen pop Han
N = 33.000000
Chosen pop Yoruba
N = 70.000000

This many iterations 33 for run 0

At this bootstrapping: 1 out of: 10
At this bootstrapping: 2 out of: 10
At this bootstrapping: 3 out of: 10
At this bootstrapping: 4 out of: 10
At this bootstrapping: 5 out of: 10
At this bootstrapping: 6 out of: 10
At this bootstrapping: 7 out of: 10
At this bootstrapping: 8 out of: 10
At this bootstrapping: 9 out of: 10
At this bootstrapping: 10 out of: 10
best like -304264.732114 after 0!
Q 0.999980 Q 0.000010 Q 0.000010  after 0!

Estimated  Q = 0.999980 0.000010 0.000010 
best like -304264.732114 after 0 runs!
FIRST row of .qopt file is BEST estimated Q, rest are nBoot bootstrapping Qs
	[ALL done] cpu-time used =  11.14 sec
	[ALL done] walltime used =  11.00 sec

PCA

After having run the admixture estimation, a PCA analysis can be run using the estimated admixture proportions to account for population structure in the PCA. The PCA method is implemented in R and is run using the script fastNGSAdmixPCA.R.

The method needs the estimated admixture proportions, the analyzed beagle or plink file as well as the genotypes of the reference panel used (as binary plink files).

GL=example/NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz
Rscript R/fastNGSadmixPCA.R likes=$GL  qopt=NA12763_CEU.qopt out=NA12763_CEU  geno=data/humanOrigins_7worldPops

Or using analyzed plink files.

Rscript R/fastNGSadmixPCA.R plinkFile=$PLINKFILE qopt=NA20502_TSI.qopt out=NA20502_TSI  geno=data/humanOrigins_7worldPops

By default the populations of the analyzed *.qopt file are used. It can be specified which PCs should be plotted (default PC1 and PC2), by for example giving PCs=2,3 as an argument (if PC2 and PC3), the PCA plot is plotted as *_PCAplot.pdf. The script also generates barplots (*_quantile_admixBarplot.png) of the admixture proportions supplied with confidence intervals (in the case of bootstraps), as well as generating files of the covariance matrix (*_covar.txt), eigenvectors (*_eigenvecs.txt) and eigenvalues (*_eigenvals.txt) used for the PCA plot.

The covariance matrix and eigenvectors have the ids of the individuals, the input individual has the id "SAMPLE". An *_indi.txt file is also created with the individual id and population/groupd id of each individual.

To find out which options there are run wihtout any input:

Rscript R/fastNGSadmixPCA.R

The snpStats R package, is needed for running this script.

Making a reference panel

The program needs a reference panel of population specific frequencies, populations for which admixture proportions should be estimated for. This reference panel can be derived from 1000 Genomes or HGDP, or other data. The program also needs a file telling the size of each reference panel population. There is an R script plinkToRef.R in the R folder, that can convert a plink file to a reference panel and size of reference panel populations. Example:

Rscript R/plinkToRef.R data/humanOrigins_7worldPops

This generates 3 files, a reference panel named data/refPanel_humanOrigins_7worldPops.txt, a number of individuals file called data/nInd_humanOrigins_7worldPops.txt and a data/humanOrigins_7worldPops.sites file with chr pos and minor major alleles, for being used with ANGSD. The provided .fam file must have the group or population specified as FID, meaning first column and then the individual ID as the second column:

French F1 0 0 1 1
French F2 0 0 1 1
Yoruba Y1 0 0 2 1
Yoruba Y2 0 0 1 1
Yoruba Y3 0 0 2 1
...

A second argument of a MAF threshold can be supplied, meaning all sites where the MAF is below this in any population is removed.

Rscript R/plinkToRef.R data/humanOrigins_7worldPops 0.05

The snpStats R package, is also needed for running the plinkToRef.R script.

It should be noted that reading in plink files especially big ones might require a lot of RAM, why doing it on a server might be preferable!

Making a reference panel with ADMIXTURE

A clustered analysis can be done, where instead of inferring admixture proportions for populations, we instead estimate admixture proportions for clusters of populations.

For instance if we want to detect the african and non-african component on an individual, we might first do an analysis of the reference panel with ADMIXTURE using K=2:

admixture data/humanOrigins_7worldPops.bed 2

It might be necessary to run ADMIXTURE more than once in order to make sure that it has converged.

We can then create a reference panel from the .P file, with two populations:

echo "id chr pos name A0_freq A1 K1 K2" > refPanel_humanOrigins_7worldPops.2.P.txt
paste -d" " <( awk -F " " ' { print $1"_"$4, $1, $4, $2, $6, $5 } ' data/humanOrigins_7worldPops.bim ) humanOrigins_7worldPops.2.P >> refPanel_humanOrigins_7worldPops.2.P.txt

And then we can create a number of individuals in each reference population file, summing each column of the .Q file:

echo "K1 K2" > nInd_humanOrigins_7worldPops.2.Q.txt
paste -d" " <(cut -f1 -d" " humanOrigins_7worldPops.2.Q | paste -sd+ | bc) <(cut -f2 -d" " humanOrigins_7worldPops.2.Q | paste -sd+ | bc) >> nInd_humanOrigins_7worldPops.2.Q.txt

Then we can run fastNGSadmix using this reference panel

#genotype likelihoods from sequencing file
GL=example/NA12763.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.bam.beagle.gz

./fastNGSadmix -likes $GL -fname refPanel_humanOrigins_7worldPops.2.P.txt -Nname nInd_humanOrigins_7worldPops.2.Q.txt -out NA12763_CEU_K2

Human reference panels

fastNGSadmix already comes with a premade reference panel.

refPanel_humanOrigins_7worldPops

This reference panel is made from Lazaridis et al. (2014) where the curated Human Origins dataset was selected.

The dataset was lifted to hg19 using the program liftOver, The SNPs then got rs IDs, using 1000 Genomes data.

Furthermore sites with more than 5 % missing and a MAF below 5 % were removed, only autosomal sites were kept. 7 populations were selected French, Han, Chukchi, Karitiana, Papuan, Sindhi and Yoruba to have representation for most of the world. Furthermore it was made sure that there were only unadmixed individuals within each population.

Creating other reference panels

Other reference panels can easily be made from the Human Origins data:


wget popgen.dk/software/download/fastNGSadmix/humanOrigins_ALL.tar.gz
tar -xzf humanOrigins_ALL.tar.gz

Where I have updated the ids with the group/population-id for the FID and individual-id as the IID as well as the snp-ids as chr_pos. Then I would recommend to keep only autosomal sites and do a MAF and geno filter (one can play around with these filters!):


plink --bfile humanOrigins_ALL/humanOrigins_ALL --make-bed --out humanOrigins_ALL/humanOrigins_ALLV2 --maf 0.05 --geno 0.05 --chr 1-22

And then the desired individuals/populations can be extracted from the humanOrigins_ALL/humanOrigins_ALLV2.* plink files using plink, using the --keep option. And it can then easily be turned into a reference panel using plinkToRef.R.


Rscript R/plinkToRef.R humanOrigins_ALL/humanOrigins_ALLV2

Other reference panels can be created for instance based on 1000 Genomes or own data, first convert the data into plink files and then use the script plinkToRef.R.

Citation