NgsAdmixTutorial: Difference between revisions

From software
Jump to navigation Jump to search
No edit summary
 
(120 intermediate revisions by 3 users not shown)
Line 1: Line 1:
NGSadmix Tutorial
We will go through a simple and more complex example on how to use NGSadmix with visualization of the data.


Example NGSadmix - Simple
User need to to add -printInfo 1 in order to get information about sites retained for analysis.
In our first example, we will be using a small dataset of .bam files with low depth NGS data from 30 human samples: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).  


Set directories to all required programs and the data you will use depending on where you installed them every time you open a new terminal window
==Example of NGSadmix - very small data set==


# Set path to NGSadmix
In our first example, we will infer admixture proportions for low depth NGS data using a small dataset from 30 human samples.
NGSADMIX=~/Software/NGSadmix
#test the link
ls $NGSADMIX


# Set path to the directory with the input files
===Set paths to software===
AA=/../home/albrecht/PhDCourse
Every time you open a new terminal window, set directories to all required programs and the data you will use depending on where you installed them
#test the link
ls $AA
To create the data files, please go the the following link:
https://github.com/mfumagalli/ngsTools/blob/master/Scripts/data.sh


Set the path to NGSadmix, for example:
<code>NGSADMIX=~/Software/NGSadmix</code>
Test the link
<code>ls $NGSADMIX</code>
===Create directories===
Create the directories that will be used for working:
Create the directories that will be used for working:
mkdir Demo
cd Demo
mkdir Data
mkdir Results


#Set paths to local directories
<code>mkdir Demo</code>
IN_DIR=~/Demo/Data
 
OUT_DIR=~/Demo/Results
<code>cd Demo</code>
 
<code>mkdir Data</code>


#Test the links
<code>mkdir Results</code>
ls $IN_DIR
ls $OUT_DIR


NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you. A file with labels that indicate which population the individuals are sampled from is also provided as the population information. Copy the files to your input folder ($IN_DIR):
<code>cd ..</code>
cp xxxx $IN_DIR
cp $AA/admixture/data/pop.info $IN_DIR


We need to prepare the population information file by cuttin the first column, sorting and counting:
===Set the paths to your local directories===
cut -f 1 -d " " $IN_DIR/pop.info | sort | uniq -c


Now let’s analyze the input file. In order to see the first 10 columns and 10 lines of the input file, type:
<code>IN_DIR=Demo/Data</code>
gunzip -c $IN_DIR/Demo1GLs | head -n 10 | cut -fl-10 | column -t


Use this command to count the number of lines. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):
<code>OUT_DIR=Demo/Results</code>
gunzip -c $IN_DIR/Demo1GLs | wc -l


Test the links
<code>ls $IN_DIR</code>
<code>ls $OUT_DIR</code>
===The test data===
We will use a very reduced data set.
*10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
*a very reduced genome 30 x 100k random regions across the autosomes
*each individual is sequenced at 2-6X
<table class="muse-table" border="2" cellpadding="5">
    <tr>
      <td>CEU</td>
      <td>Europeans (mostly of British ancestry)</td>
    </tr>
    <tr>
      <td>JPT</td>
      <td>East Asian - Japanese individuals</td>
    </tr>
    <tr>
      <td>YRI</td>
      <td>West African - Nigerian Yoruba individuals</td>
    </tr>
</table>
You can either download the beagle input files or create them yourself from bam files. Therefore you should choose either 'Download the beagle genotype likelihood input file' or 'Create the the beagle genotype likelihood input file using ANGSD'.
====Download the beagle genotype likelihood input file====
NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you.
A file with the information about the population is also provided.
Download the files and move them to your input folder (for example, $IN_DIR):
<code>wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz</code>
<code>wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info</code>
<code>mv Demo1input.gz $IN_DIR</code>
<code>mv Demo1pop.info $IN_DIR</code>
====Create the the beagle genotype likelihood input file using ANGSD====
first set the path to ANGSD  (change ~/github/angsd/ to the path on your system)
<code> ANGSD=~/github/angsd/angsd </code>
test that you have the rigth path and that ANGSD is installed
<code> $ANGSD</code>
Download small BAM files and extract files
<code> wget http://popgen.dk/software/download/NGSadmix/data/smallerbams.tar </code>
<code> tar xf smallerbams.tar </code>
Make a list with the bam file names
<code>find smallerbams/ |  grep bam$ > all.files</code>
calculate genotype likelihoods for polymorphic sites using ANGSD (see ANGSD website for further information)
<code>$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/Demo1input.gz -P 5 </code>
create population information file
<code> paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info </code>
==== View population information file ====
To view a summary of the population information file, cut the first column, sort and count:
<code>cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c</code>
Lets make a population label file and place it in the output directory
<code> cut -f1 -d" " $IN_DIR/Demo1pop.info  > $OUT_DIR/poplabel </code>
==== View the genotype likelihood beagle file ====
*In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.
*In order to see the first 10 columns and 10 lines of the input file, type:
:<code>gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t</code>
*Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):
:<code>gunzip -c $IN_DIR/Demo1input.gz | wc -l</code>
=== Let’s analyze the input file ===
To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:
To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:
$NGSadmix -likes $IN_DIR/Demo1GLs.beagle.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix


The output:
<code>$NGSADMIX -likes $IN_DIR/Demo1input.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix</code>
 
For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]
 
=== The output ===


The analysis performed by NGSadmix produces 4 files:
The analysis performed by NGSadmix produces 4 files:


Log likelihood of the estimates
*'''Log likelihood of the estimates'''
A .log file that summarizes the run.
A .log file that summarizes the run.
Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:
Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:
cat Demo1NGSadmix.log


Estimated allele frequency
<code>cat $OUT_DIR/Demo1NGSadmix.log</code>
A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).
We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:
zcat Demo1NGSadmix.fopt.gz | head -n 5


Estimated admixture proportions
*'''Estimated allele frequency'''
A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).
To obtain the estimated admixture proportions for the first 5 individuals, type the following command:
head -n 5 Demo1NGSadmix.qopt


DemoNGSadmix.filter: is an empty file???? 
::A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).
If the filter is used, it will show the sites that were left out or in…
CHECK


::We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:


::<code>zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5</code>
*'''Estimated admixture proportions'''
::A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).
::To obtain the estimated admixture proportions for the first 5 individuals, type the following command:
::<code>head -n 5 Demo1NGSadmix.qopt</code>
=== Plot the results in R ===


Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:
Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:


Type “R” in the terminal and press enter.
Make sure you stand on the output directory
Paste the following code into R:
 
#set up the working directories
<code>cd $OUT_DIR</code>
 
Type “R” in the terminal and press enter and paste the following code into R:


# Fill up a table with the IDs of the population information for each individual
<pre>
pop<-read.table("pop.info", as.is=T)
# Get ID and pop info for each individual
pop<-scan("poplabel",what="theFuck")


# Read inferred admixture proportions
# Read inferred admixture proportions file
q<-read.table("Demo1NGSadmix.qopt")
q<-read.table("Demo1NGSadmix.qopt")


Line 88: Line 185:
ord = order(pop)
ord = order(pop)
par(mar=c(7,4,1,1))
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Admixture proportions",cex.names=0.75)
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75)
</pre>


The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.  
The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.  
Each color represents a different ancestral population.
Each color represents a different ancestral population.
The proportion of each color shows the different admixture of the individuals for each ancestral population.  
The proportion of each color shows the different admixture of the individuals for each ancestral population.  
The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.
The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.


NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast). Results from an analysis of data from the entire genome can be seen here.
NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast).


==Example of NGSadmix with admixed populations==


Example NGSadmix - 1000 Genome Project Data
Now that you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like, let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on 50000 sites from the 1000 genomes project from the following populations:


Now you know how to run NGSadmix and what the output looks like. Let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on data from the 1000 genomes project from the following populations. The input file for you for this dataset and a file with population information are given as in the previous example. The input is a file with genotype likelihoods from 100 individuals in .beagle format.
:<table class="muse-table" border="2" cellpadding="5">
    <tr>
      <td>ASW</td>
      <td>HapMap African Americans from SW US</td>
    </tr>
    <tr>
      <td>CEU</td>
      <td>European individuals</td>
    </tr>
    <tr>
      <td>CHB</td>
      <td>Han Chinese in Beijing</td>
    </tr>
    <tr>
      <td>JPT</td>
      <td>Japanese individuals</td>
    </tr>
    <tr>
      <td>YRI</td>
      <td>Yoruba individuals from Nigeria</td>
    </tr>
    <tr>
      <td>MXL</td>
      <td>Mexican individuals from LA California</td>
    </tr>
:</table>


ASW
HapMap African ancestry individuals from SW US
CEU
European individuals
CHB
Han Chinese in Beijing
JPT
Japanese individuals
YRI
Yoruba individuals from Nigeria
MXL
Mexican individuals from LA California


Set directories to all required programs and the data you will use depending on where you installed them every time you open a new terminal window
The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given.
''Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.''


# Set path to NGSadmix
===Download Files===
NGSADMIX=~/Software/NGSadmix
Download and copy the files to your input folder (for example, $IN_DIR)
<pre>
wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
mv Demo2input.gz $IN_DIR
mv Demo2pop.info $OUT_DIR</pre>


# Set path to the directory with the input files
make sure you are back in your original folder (not the $OUT_DIR folder)
AA=/../home/albrecht/PhDCourse


#Set paths to local directories
===Take a quick look at the population data===
IN_DIR=~/Demo/Data
make a summary by cutting the first column, sorting and counting
OUT_DIR=~/Demo/Results
<code>cut -f 1 -d " " $OUT_DIR/Demo2pop.info | sort | uniq -c</code>


Take a quick look at the data
===Run NGSadmix===
First try to get an overview of the dataset by copying the information file and making a summary using the following:
Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [http://www.popgen.dk/software/index.php/NgsAdmixv2#Parameters]
#copy to folder
cp $AA/admixture/data/pop.info .


## cut first column | sort | count
<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3</code>
cut -f 1 -d " " pop.info | sort | uniq -c


Which countries are the samples from and how many samples from each?
===Plot===
Now look at the genotype file input.gz. It is in the same format at the file we looked at in the previous example:
Plot the estimated admixture proportions by running the following code in R:
Use the same approach as in the previous example to find out how many loci there are genotype likelihoods from
Run an analysis of the data with NGSadmix
Try to start an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21) and set the prefix of the output files to myownoutfilesK3 (-o myownoutfilesK3).
$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3


$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4
Make sure you stand on the output directory
<code>cd $OUT_DIR</code>


Type “R” in the terminal and press enter and paste the following code into R:


Next, plot the estimated admixture proportions by running the following code in R :
<pre>
## open R
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK3.qopt")


## read population labels and estimated admixture proportions
# order according to population
pop<-read.table("pop.info",as.is=T)
q<-read.table("myownoutfilesK3.qopt")
 
## order according to population
ord<-order(pop[,1])
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
</pre>


Note that - like in the previous example - the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.
Note that like in the previous example, the order of the individuals in the plot is not the same as in the .qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.
Try to explain what the plot shows (what is on the axes, what do the colors mean and so on)
What does the plot suggest in terms of population structure and admixture?
Other K values (if you have time)
Try to run NGSadmix with K=4 instead.
Plot the output (if you have trouble plotting it here is the plot I got: K4 plot.
Based on all the results what can you say about the Mexican samples (MXL)?


===Choose a differnt K===
Try to run NGSadmix with K=4 and compare the plots.


Example using ANGSD to create input file
<code>$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4</code>
set directories to all required programs depending on where you installed them, for instance these are my paths:
NGSTOOLS=~/Software/ngsTools
ANGSD=~/Software/angsd
NGSADMIX=~/Software/NGSadmix/NGSadmix


SAMTOOLS=~/Software/samtools-1.4.1/samtools
Type “R” in the terminal and press enter and paste the following code into R:
HTSLIB=~/Software/htslib-1.4.1
FASTME=~/Software/fastme-2.1.5/src/fastme
MYDIR=~/Demo/Data
NGSadmix uses Genotype Likelihoods (GLs) as input, and therefore we are going to calculate the GLs from the .bam files to provide an input file in .beagle format:  


Create a file that contains a list with the path to each one of the .bam files:
<pre>
find $BAMFOLDER | grep bam$ > ~/Demo/Data/all.bamfiles
# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK4.qopt")


Type in this command to very the content of the created file:
# order according to population
cat all.bamfiles
ord<-order(pop[,1])
 
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=4")
$ANGSD -bam $IN_DIR/all.bamfiles -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/DemoGLs -P 5
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
 
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
 
</pre>
wssd
 
 
 
asdf
You can see the ID of the first individual by getting the first line of the file you created with all your original bam files in the beginning:
head -n1 all.files
 
Based on that ID, which population does the individual come from?
What does this suggest about what column to look for the frequencies for that population in the qopt file?
Based on this and the frequency estimates for the first locus that you looked at earlier, what does NGSadmix estimate the allele frequency to be at the first locus in that population?
 
 
Plot results[edit]
Plot in the order of the input file.
admix<-t(as.matrix(read.table("myoutfiles.qopt")))
barplot(admix,col=1:3,space=0,border=NA,xlab="Individuals",ylab="admixture")
 
 
Plot using a population label file.
pop<-read.table("pop.info",as.is=T)
admix<-t(as.matrix(read.table("myoutfiles.qopt")))
admix<-admix[,order(pop[,1])]
pop<-pop[order(pop[,1]),]
h<-barplot(admix,col=1:3,space=0,border=NA,xlab="Individuals",ylab="admixture")
text(tapply(1:nrow(pop),pop[,1],mean),-0.05,unique(pop[,1]),xpd=T)

Latest revision as of 00:59, 7 June 2022

We will go through a simple and more complex example on how to use NGSadmix with visualization of the data.

User need to to add -printInfo 1 in order to get information about sites retained for analysis.

Example of NGSadmix - very small data set

In our first example, we will infer admixture proportions for low depth NGS data using a small dataset from 30 human samples.

Set paths to software

Every time you open a new terminal window, set directories to all required programs and the data you will use depending on where you installed them

Set the path to NGSadmix, for example:

NGSADMIX=~/Software/NGSadmix

Test the link

ls $NGSADMIX

Create directories

Create the directories that will be used for working:

mkdir Demo

cd Demo

mkdir Data

mkdir Results

cd ..

Set the paths to your local directories

IN_DIR=Demo/Data

OUT_DIR=Demo/Results

Test the links

ls $IN_DIR

ls $OUT_DIR

The test data

We will use a very reduced data set.

  • 10 individuals from each population: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European Ancestry (CEU).
  • a very reduced genome 30 x 100k random regions across the autosomes
  • each individual is sequenced at 2-6X
CEU Europeans (mostly of British ancestry)
JPT East Asian - Japanese individuals
YRI West African - Nigerian Yoruba individuals


You can either download the beagle input files or create them yourself from bam files. Therefore you should choose either 'Download the beagle genotype likelihood input file' or 'Create the the beagle genotype likelihood input file using ANGSD'.

Download the beagle genotype likelihood input file

NGSadmix uses Genotype Likelihoods (GLs) in .beagle format as input. The input file has been created for you. A file with the information about the population is also provided.

Download the files and move them to your input folder (for example, $IN_DIR):

wget popgen.dk/software/download/NGSadmix/data/Demo1input.gz

wget popgen.dk/software/download/NGSadmix/data/Demo1pop.info

mv Demo1input.gz $IN_DIR

mv Demo1pop.info $IN_DIR

Create the the beagle genotype likelihood input file using ANGSD

first set the path to ANGSD (change ~/github/angsd/ to the path on your system)

ANGSD=~/github/angsd/angsd

test that you have the rigth path and that ANGSD is installed

$ANGSD

Download small BAM files and extract files

wget http://popgen.dk/software/download/NGSadmix/data/smallerbams.tar

tar xf smallerbams.tar

Make a list with the bam file names

find smallerbams/ | grep bam$ > all.files

calculate genotype likelihoods for polymorphic sites using ANGSD (see ANGSD website for further information)

$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out $IN_DIR/Demo1input.gz -P 5

create population information file

paste -d " " <( cut -f 5 -d"." all.files ) <(cut -f 1 -d"." all.files | xargs -n1 basename) > $IN_DIR/Demo1pop.info

View population information file

To view a summary of the population information file, cut the first column, sort and count:

cut -f 1 -d " " $IN_DIR/Demo1pop.info | sort | uniq -c

Lets make a population label file and place it in the output directory

cut -f1 -d" " $IN_DIR/Demo1pop.info > $OUT_DIR/poplabel

View the genotype likelihood beagle file

  • In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individual. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software, but it does not mean that they are genotype probabilities.
  • In order to see the first 10 columns and 10 lines of the input file, type:
gunzip -c $IN_DIR/Demo1input.gz | head -n 10 | cut -f 1-10 | column -t
  • Use this command to count the number of lines of the input file. The number of lines, indicates the number of loci for which there are GLs plus one (as the command includes the count of the header line):
gunzip -c $IN_DIR/Demo1input.gz | wc -l

Let’s analyze the input file

To run an analysis of the GLs with NGSadmix, assuming the number of ancestral populations is K=3, type the following command:

$NGSADMIX -likes $IN_DIR/Demo1input.gz -K 3 -minMaf 0.05 -seed 1 -o $OUT_DIR/Demo1NGSadmix

For a reference on the parameters that can be used to run NGSadmix, please go to [1]

The output

The analysis performed by NGSadmix produces 4 files:

  • Log likelihood of the estimates

A .log file that summarizes the run.

Let’s take a look at the log file to determine the log likelihood of the estimates achieved by NGSadmix which is called the “best like” in the file:

cat $OUT_DIR/Demo1NGSadmix.log

  • Estimated allele frequency
A zipped .fopt file, that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations (there is a line for each locus).
We can use this file to obtain the estimated allele frequency of the first 5 SNPs (one per line) of the three assumed ancestral populations, by typing the following command:
zcat $OUT_DIR/Demo1NGSadmix.fopt.gz | head -n 5
  • Estimated admixture proportions
A .qopt file, that contains an estimate of the individual's ancestry proportion from each of the three assumed ancestral populations (there is a line for each individual).
To obtain the estimated admixture proportions for the first 5 individuals, type the following command:
head -n 5 Demo1NGSadmix.qopt

Plot the results in R

Follow these instructions to make a simple plot of the estimated admixture proportions for all individuals in R:

Make sure you stand on the output directory

cd $OUT_DIR

Type “R” in the terminal and press enter and paste the following code into R:

# Get ID and pop info for each individual
pop<-scan("poplabel",what="theFuck")

# Read inferred admixture proportions file
q<-read.table("Demo1NGSadmix.qopt")

# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(2,1,3),names=pop[ord],las=2,ylab="Demo1 Admixture proportions",cex.names=0.75)

The y-axis of the plot show the admixture proportions and the individuals in the samples are plotted in the x-axis.

Each color represents a different ancestral population.

The proportion of each color shows the different admixture of the individuals for each ancestral population.

The plot is sorted by the population of origin of each individual in the sample, and therefore, it shows blocks with prevalence of a certain color, which represents the population to which each individual belongs.

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast).

Example of NGSadmix with admixed populations

Now that you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like, let's try to look at a more realistic size dataset. More specifically let's try to run NGSadmix on 50000 sites from the 1000 genomes project from the following populations:

ASW HapMap African Americans from SW US
CEU European individuals
CHB Han Chinese in Beijing
JPT Japanese individuals
YRI Yoruba individuals from Nigeria
MXL Mexican individuals from LA California

The input file Demo2input.gz with genotype likelihoods from 100 individuals in .beagle format, and a file with population info are given. Note: please make sure you have created and set up the working directories as indicated in the previous tutorial.

Download Files

Download and copy the files to your input folder (for example, $IN_DIR)

wget popgen.dk/software/download/NGSadmix/data/Demo2input.gz
wget popgen.dk/software/download/NGSadmix/data/Demo2pop.info
mv Demo2input.gz $IN_DIR
mv Demo2pop.info $OUT_DIR

make sure you are back in your original folder (not the $OUT_DIR folder)

Take a quick look at the population data

make a summary by cutting the first column, sorting and counting cut -f 1 -d " " $OUT_DIR/Demo2pop.info | sort | uniq -c

Run NGSadmix

Run an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21), and set the prefix of the output files to Demo2NGSadmixK3. For a reference on the parameters that can be used to run NGSadmix, please go to [2]

$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK3

Plot

Plot the estimated admixture proportions by running the following code in R:

Make sure you stand on the output directory cd $OUT_DIR

Type “R” in the terminal and press enter and paste the following code into R:

# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK3.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=3")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

Note that like in the previous example, the order of the individuals in the plot is not the same as in the .qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.

Choose a differnt K

Try to run NGSadmix with K=4 and compare the plots.

$NGSADMIX -likes $IN_DIR/Demo2input.gz -K 4 -P 1 -minMaf 0.05 -seed 21 -o $OUT_DIR/Demo2NGSadmixK4

Type “R” in the terminal and press enter and paste the following code into R:

# read population labels and estimated admixture proportions
pop<-read.table("Demo2pop.info",as.is=T)
q<-read.table("Demo2NGSadmixK4.qopt")

# order according to population
ord<-order(pop[,1])
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Demo2 Admixture proportions for K=4")
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)