ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Association: Difference between revisions

From angsd
Jump to navigation Jump to search
Line 86: Line 86:
</pre>
</pre>
</div>
</div>
;-yQuant [file]
;-yQuant [Filename]
a file containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.
a file containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.
<div class="toccolours mw-collapsible mw-collapsed">
<div class="toccolours mw-collapsible mw-collapsed">

Revision as of 17:27, 28 June 2013

Association can be performed using two approaches.

  1. Based on testing differences in allele frequencies between cases and controls, using genotype likelihoods
  2. Based on a generalized linear framework which also allows for quantitative traits and for including additional covariates, using genotype posteriors.

We recommend that users don't perform association analysis on all sites, but limit the analysis to informative sites, and in the case of alignement data (BAM), we advise that users filter out the low mapping quality reads and the low qscore bases.

The filtering of the alignment data is described in Input, and filtering based on frequencies/polymorphic sites are described here.

This can be done easily at the command line by adding the below commands

-minQ 20 -minMapQ 30 -minLRT 24 #Use polymorphic sites with a p-value of 10^-6
-minQ 20 -minMapQ 30 -minMaf 0.05 #Use sites with a MAF >0.05

Case control association using allele frequencies

To test for differences in the allele frequencies, genotype likelihood needs to be provided or estimated. The test is an implimentation of the likelihoods ratio test for differences between cases and controls described in details in Kim2011.

-doAsso [int]

1: The test is performed assuming the minor allele is known.
3: The test is performed summing over all possible minor alleles.

-yBin [file]

A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes. The file should contain a single phenotype entry per line.

Example of cases control phenotype file

1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1

Example

./angsd -yBin pheno.ybin -doAsso 1 -GL 1 -out out -doMajorMinor 1 -minLRT 24 -doMaf 2 -doSNP 1 -bam bam.filelist 

Dependency Chain

The method is based on estimating frequencies from genotype likelihoods. If alignment data has been supplied you need to specify the following.

  1. Genotype likelihood model (-GL).
  2. Determine Major/Minor (-doMajorMinor).
  3. Maf estimator (-doMaf).

If you have supplied genotype likelihood files as input for angsd you can skip 1.

Score statistic

To perform the test in a generalized linear framework posterior genotype probabilities must be provided or estimated. The approach is published here skotte2012.

-doAsso 2
-yBin [Filename]

A file containing the case control status. 0 being the controls, 1 being the cases and -999 being missing phenotypes.

Example of cases control phenotype file

1
0
0
0
1
1
1
1
0
-999
1
0
0
0
0
1
-yQuant [Filename]

a file containing the phenotype values.-999 being missing phenotypes. The file should contain a single phenotype entry per line.

Example of quantitative phenotype file

-999
2.06164722761138
-0.091935218675602
-0.287527686061831
-999
-999
-1.20996664036026
0.0188541092307412
-2.1122713873334
-999
-1.32920529536579
-1.10582299663753
-0.391773417823766
-0.501400984567535
-999
1.06014677976046
-1.10582299663753
-999
0.223156127557052
-0.189660869820135
-cov [file]

a files containing additional covariates in the analysis. Each lines should contain the additional covariates for a single individuals. Thus the number of lines should match the number of individuals and the number of coloums should match the number of additional covariates.

Example of covariate file

1 0 0 1 
1 0.1 0 0 
2 0 1 0 
2 0 1 0 
2 0.1 0 1 
1 0 0 1 
1 0.3 0 0 
2 0 0 0 
1 0 0 0 
2 0.2 0 1 
1 0 1 0 
1 0 0 0 
1 0.1 0 0 
1 0 0 0 
2 0 0 1 
2 0 0 0 
2 0 0 0 
1 0 0 1 
1 0.5 0 0 
2 0 0 0
-minHigh [int]

default = 10
This approach needs a certain amount of variability in the genotype probabilities. minHigh filters out sites that does not have at least [int] number of heterozygoes and homogoes genotype with at least 0.9 probability. This filter avoids the scenario where all individuals are heterozygoes with a high probability.

-minCount [int]

default = 10
The minimum expected minor alleles in the sample. This is the frequency multiplied by to times the number of individuals. Performing association on extremely low minor allele frequencies does not make sence.

-model [int]

1 (default): additive/log-additive for linear/logistic regression

2: dominant

3: recessive

Example

For cases control data for polymorphic sites (LRT>24)

./angsd -yBin pheno.ybin -doAsso 2 -GL 1 -doPost 1 -out out -doMajorMinor 1 -minLRT 24 -doMaf 2 -doSNP 1 -bam bam.filelist 


For quantitative traits (normal distributed errors) for polymorphic sites (LRT>24) and additional covariates

./angsd -yQuant pheno.y -doAsso 2 -cov cov.file -GL 1 -doPost 1 -out out -doMajorMinor 1 -minLRT 24 -doMaf 2 -doSNP 1 -bam bam.filelist 

Dependency Chain

The method is based on genotype probabilities. If alignment data has been supplied you need to specify the following.

  1. Genotype likelihood model (-GL).
  2. Determine Major/Minor (-doMajorMinor).
  3. Maf estimator (-doMaf).
  4. Calculate posterior (-doPost).

If you have supplied genotype likelihoods for angsd, then you should skip 1.

If you have supplied genotype probabilities (as beagle output format), you can skip all

output

Score statistic (prefix lrt*)

Chromosome Position major minor Frequency N LRT highHe highHo
  • Chromosome

The Chromosome

  • Position

The physical Position

  • major

Often the most common allele. If beagle input is used it might be the minor (see frequency)

  • major

Often the minorallele. If beagle input is used it might be the major (see frequency)

  • Frequency

The frequency estimate. The choice of estimation is determined by the *doMaf option.

  • N

The number of individuals with non-missing data. That is the individuals who have both some sequencing data for the given site and have phenotype data

  • LRT

The likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.000000

  • highHe

Number of sites with a heterozygoes posterior probability above 0.9

  • highHe

Number of sites with a homozygote posterior probability above 0.9


example:

chr     position        major   minor   frequency       Nind    LRT     highHe  highHo
1       13999950        G       T       0.208311        330     -999.000000     0       20
1       14000023        C       A       0.032045        330     4.429480        20      310
1       14000072        G       T       0.015167        330     0.262688        10      320
1       14000113        A       G       0.015194        330     0.404170        10      320
1       14000202        G       A       0.245345        330     0.562326        90      60
1       14000375        T       C       0.015211        330     0.263012        10      320
1       14000851        T       C       0.015177        330     0.478610        10      320
1       14000873        G       A       0.303030        330     1.032727        100     230
1       14001008        T       C       0.015158        330     3.831819        10      320
1       14001018        T       C       0.300947        330     0.859651        80      240