ANGSD: Analysis of next generation Sequencing Data

Latest tar.gz version is (0.938/0.939 on github), see Change_log for changes, and download it here.

Contamination

From angsd
Jump to navigation Jump to search

Angsd can estimate contamination, but only for chromosomes that exists in one genecopy (eg chrX for males). This method requires a list of HapMap sites along with their frequency and we also recommend to discard regions with low mappability.

We have included a mappability and HapMap files for chrX these are found in the RES subfolder of the angsd source package. So if you are working with humans, and your sample is a male then you can estimate the contamination with the follow two commands.

  • First we generate a binary count file for chrX for a single BAM file (ANGSD cprogram)
  • Then we do a fisher test for finding a p-value, and jackknife to get an estimate of contamination (Rprogram)


An example are found below:

./angsd -i my.bam -r X: -doCounts 1  -iCounts 1 -minMapQ 30 -minQ 20
Rscript contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24

The contamination.R program is found in the R/ subfolder, and the resource files are found in the RES folder. The jackknive procedure can be quite slow, so we allocate 24 cores for this analysis mc.cores=24.

Output

The output from the above command is shown below

Rscript ../R/contamination.R mapFile="map100.chrX.bz2" hapFile="hapMapCeuXlift.map.bz2" countFile="angsdput.icnts.gz" mc.cores=24
Loading required package: multicore

-----------------------
Doing Fisher exact test for Method1:
      [,1]   [,2]
[1,]   246    157
[2,] 17700 143407

        Fisher's Exact Test for Count Data

data:  mat
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 10.34000 15.61672
sample estimates:
odds ratio 
   12.6959 


-----------------------
Doing Fisher exact test for Method2:
     [,1]  [,2]
[1,]   91    55
[2,] 7355 59513

        Fisher's Exact Test for Count Data

data:  mat2
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  9.466476 19.085589
sample estimates:
odds ratio 
  13.38675 

----------------------
Running jackknife for Method1 (could be slow)
Running jackknife for Method2 (could be slow)
$est
              Method1     Method2    
Contamination 0.03837625  0.03380983 
llh           1034.078    483.5145   
SE            0.002630455 0.003900376

$err
[1] 0.01370779

$c
[1] 0.001093589

$est
              Method1     Method2    
Contamination 0.03837625  0.03380983 
llh           1034.078    483.5145   
SE            0.002630455 0.003900376

$err
[1] 0.01370779

$c
[1] 0.001093589

Interpretation of outputfiles

Both methods shows a highly significant pvalue, and estimate the level of contamination to be approx 3%.