Latest revision as of 15:55, 20 March 2014

Small fast cprogram to extract bases from a fasta file. Download here [1]

Program can either work as a standalone program, or allow for easy retrieval of reference bases by using the API.

Install

wget http://popgen.dk/software/download/refFinder/refFinder.tar.gz
tar xf refFinder.tar.gz
cd refFinder/
make
cd ..

Stand alone

Example

Generate samtools chr pos ref doing

samtools mpileup -b smallBam.filelist -f /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa |cut -f1-3 >small.sam

Use refFinder to find the bases for each position in small.sam

cut -f1-2 ../angsd/test/small.sam |./refFinder /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa full >tst
cmp tst ../angsd/test/small.sam

possible options are

inputIsZero
full

These are flags, so examples are

cut -f1-2 ../angsd/test/small.sam |./refFinder /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa |head
a
g
c
t
a
c
t
c
g
g

Or if we want the chr position also

cut -f1-2 ../angsd/test/small.sam |./refFinder /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa full |head
1	13999902	  a
1	13999903	  g
1	13999904	  c
1	13999905	  t
1	13999906	  a
1	13999907	  c
1	13999908	  t
1	13999909	  c
1	13999910	  g
1	13999911	  g

Or if the positions are zero index as opposed to one indexed:

cut -f1-2 ../angsd/test/small.sam |./refFinder /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa full inputIsZero |head
1	13999902	 g
1	13999903	 c
1	13999904	 t
1	13999905	 a
1	13999906	 c
1	13999907	 t
1	13999908	 c
1	13999909	 g
1	13999910	 g
1	13999911	 g

API


#include "refFinder.h"
perFasta *pf = init("hg19.fa");
char refbase = getchar("chr20",130224101,pf)
//refbase now contains the reference base for chr20 at position 130,224,101
destroy(pf);

Remember to link with refFinder.o and -lz

g++ sampleProg.cpp refFinder.o -lz

bugs

do check if reference file doesn't exist.

RefFinder: Difference between revisions

Latest revision as of 15:55, 20 March 2014

Contents

Install

Stand alone

Example

API

bugs

Navigation menu

@@ Line 13: / Line 13: @@
 =Stand alone=
-This sofware is for internal use. If you found this page den congrats. Feel free you use the methods but they might not work, give wrong results or other nasty stuff. If and then they fuck up don't contact us or complain.
+==Example==
-=Secret software=
+Generate samtools chr pos ref doing
-==[[Grepper]]==
-Formerly known as getliner. Used for grep'ing with in columns
-==[[Sortdep]]==
-Fast counting of integers
-==[[RefFinder]]==
-Extract bases from a fasta file
-=Other stuff=
-==chisquare==
-A small library build using numerical recipies third edition for calculating stuff relating to the chisqdistribution
-==bambi/bammer==
-(Thorfinn, april20 2012)
-===Download===
-[[bambi.tar.gz]]
-===Usage===
-Program is invoked with either:
-<pre>
-./bammer view     #samtools view/ single sample only
-./bammer uppile   #This does emulates the samtools mpileup
-</pre>
-Inputfiles are supplied with either
-<pre>
-tmp.bam                #single bamfile
--b bam.list            #filelist containing bamfiles
--b bam.list -nInd INT  #only select the first INT samples from bam.list
-</pre>
-Regions of interest can be supplied by defining a region in the following way
-<pre>
--r chr:a-b  #select chromsome chr from a, to b
--r chr:-b   #select chromosome chr from beginning to be
--r chr:b-   #select chromosome chr from b to end of chromosome
-</pre>
-If we want to dump different regions in the same run, this can be done with
-<pre>
--rf regions.file  #every region is on a newline, has same format as above
-</pre>
-Program can also estimate genotypelikelihoods if uppile is selected
-<pre>
--GL 0    #SOAPsnp
--GL 1    #samtools
--GL 2    #gatk
-</pre>
-Output is hardcode dumped in glfs.glfs, these is a binary double file. Information for first site is the first 10*sizeof(double)*nInd bytes. This is the samtools ordering AA,AC,AG,...
-Information about the position is dumped in mafs.mafs. This is NOT the MAF, but simply a 2 column file with chr tab.
-Below are some examples for running the program
 <pre>
-#view commands below
+samtools mpileup -b smallBam.filelist -f /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa |cut -f1-3 >small.sam
-./bammer view new2.bam
-./bammer view bwape_sort_rmdup.bam chr21:9483252-9672320
-./bammer view bwape_sort_rmdup.bam chrY:-1000000
-#mpileup below
-./bammer uppile -b lucampHighBam.filelist -nInd 2 -r chr1:-376920
-#samtools estimation of GL
-./bammer uppile -b lucampHighBam.filelist -nInd 2 -r chr1:-376920 -GL 1
 </pre>
-** Technics
+Use refFinder to find the bases for each position in '''small.sam'''
-Program spawns a single thread, when a chunk has been read, and the thread is not finished it will wait.
-* ieatgor
-Extract lines from file in "gor" format. The files can be found in this [[ieatgor/][folder]]
-to compile
 <pre>
-g++ -O3 -o ieatgor ieatgor.cpp -lz
+cut -f1-2 ../angsd/test/small.sam |./refFinder /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa full >tst
+cmp tst ../angsd/test/small.sam
 </pre>
-The target file must be sorted. example of target file
+possible options are
-<pre>
+;inputIsZero
-chr1:22323:232322
+;full
-chr10:10000:10000000
-chr3:10000:10000000
- </pre>
-A "gor" file is a file with the first coloum being the chromosome and the second coloum being the positions. The rest of the coloums are not important. The gor file must be sorted based on the positions and chr. gz gor files are also accepted.
+These are flags, so examples are
-example of a gor file
- <pre>
-chr1    1033267 2       3       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
-chr1    1033268 0       2       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
-chr1    1033279 3       1       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
-chr1    1033286 0       2       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
-chr1    1033297 0       2       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
-chr1    1033304 0       1       know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9
-...
- </pre>
-run example
 <pre>
-./ieatgor file.target file.gor
+cut -f1-2 ../angsd/test/small.sam |./refFinder /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa |head
+a
+g
+c
+t
+a
+c
+t
+c
+g
+g
 </pre>
-the extracted line are returned to standard out.
-* getliner
-NB HUGE BUG, must +1 when using -l and zipped files. Sorry for inconveincace. (tsk)
-This is a general tool for extract lines from newline seperated textfiles (can be gz compressed). Getliners4 allows for complement grep, use '-v 1'. Default is '-v 0'. Program can also print out basic statistics for hit/nonhit if -i filename is supplied. getliners6.cpp tokinzes on \r, so should work on windows files.
+Or if we want the chr position also
-Download here [[getliners6.cpp]] .
 <pre>
-g++ getliners5.cpp -lz -O3 -o getliners
+cut -f1-2 ../angsd/test/small.sam |./refFinder /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa full |head
+	13999902	  a
+	13999903	  g
+	13999904	  c
+	13999905	  t
+	13999906	  a
+	13999907	  c
+	13999908	  t
+	13999909	  c
+	13999910	  g
+	13999911	  g
 </pre>
+Or if the positions are zero index as opposed to one indexed:
-Program can either extract specific lines, or "keys". The linenumbers to extract must be 1-indexed (first line i 1). When using keys, you must supply which column to 'grep' for. The program builds an associative array of the keys. And therefore only a single pass of the datafile is required.
-Fields in the datafile can be seperated by a delimiter that can be specified at runtine with the "-d " argument. Default is space/tab. If delimter is an escape sequence it will not work and the code should be modified.
-* calcABBABABA
-This is program that calculates ABBA-BABA related stats for a given chromosome.
-It can be downloaded here [[abbababatest.cpp]] and can be compiled with:
-<pre>
-g++ -O3 abbababatest.cpp -o calcABBABABA -lz
-</pre>
-To get all options just type
-<pre>
- ./calcABBABABA
-</pre>
-An example run cmd
 <pre>
-../calcABBABABA -file /space/genomes/refgenomes/ancestal/hg19/shortForm/chr20.ans -bam  testbamfilelistbotoV2chr20.txt -outfileprefix testbotoV2Chr20plus1x2 -blocksz 5
+cut -f1-2 ../angsd/test/small.sam |./refFinder /space/genomes/refgenomes/hg19/merged/hg19NoChr.fa full inputIsZero |head
+	13999902	 g
+	13999903	 c
+	13999904	 t
+	13999905	 a
+	13999906	 c
+	13999907	 t
+	13999908	 c
+	13999909	 g
+	13999910	 g
+	13999911	 g
 </pre>
-The program outputs three files: an abbababa file (with only abbababa stats), a count file (with overall counts for the chromosome) and a summary file where the compile time, run time and the cmds and the input files used when the program was run are all listed.
-* plot Plink
-[[Rfun/][Folder]]
-<pre>
- Rscript /home/software/public/software/intern/Rfun/plink.plot.R <plink output file>
-</pre>
-* ngsAdmix
-*** download
-[[ngsAdmix/][Download folder]]
-Download path
-<pre>
-/home/software/public/software/intern/ngsAdmix/
-</pre>
-*** compile
-compile with icpc or g++ (icpc much better)
-<pre>
-icpc -O3 -o ngsAdmix ngsAdmix12.cpp -lz -pthread
-g++ -O3 -o ngsAdmix ngsAdmix12.cpp -lz -pthread
-</pre>
-*** options
-options are shown with
-<pre>
-./ngsAdmix
-</pre>
-All options also have a short form (first letter)
- - default stopping criteria is either tolerence 0.0001 for emSq
- - default stopping criteria is difference in loglikelihoods of 0.1 over 50 iterations for both em and emSq
- - default optimaziation is emSq
-*** run example
-<pre>
-./ngsAdmix -l dat.beagle -i 5000 -K 3 -P 5 -m 1 -o datK3
-</pre>
-* liftover a plink file
-You might have the change the chainFile path and the liftover path if you are not on pontus
-Note that if two SNP map to the same position after liftover one of them is removed.
-<pre>
-#!/bin/bash
-#make liftover file
-Files=$1
-liftoverProg=/home/albrecht/bin/prog/liftover/liftOver
-chainFile=/home/albrecht/bin/prog/liftover/chainFiles/hg18ToHg19.over.chain
-plink=/home/albrecht/bin/prog/plink-1.07-x86_64/plink
-if [ "$#" -eq 0 ]; then
-    echo "supply a plink prefix"
-    exit
-fi
-if [ -f $Files.map ]
-then
-    echo using file $Files.map
-else
-    echo the file ${Files}.map does not exists \(use plink --recode\)
-fi
-if [ -f $Files.bed ]
-then
-    echo using file $Files.bed
-else
-    echo the file $Files.bed does not exists  \(use plink --make-bed\)
-fi
-#make bed file for liftover
-Rscript -e "options(scipen=10);map<-read.table('$Files.map',hea=F,as.is=T);res<-cbind(paste('chr',map[,1],sep=''),map[,4]-1,map[,4],map[,2],0,'+');write.table(res,file='$Files.bedfile',col=F,row=F,qu=F,sep='\t')"
-#liftover
-$liftoverProg $Files.bedfile $chainFile ${Files}Hg19.bedfile ${Files}Hg19.unmapped
-# rm unmapped
-grep -v \# ${Files}Hg19.unmapped  | cut -f4 > rmSNPs.txt
-#rm duplicates
-cat ${Files}Hg19.bedfile | sort -k1 | rev | uniq -d -f3 | rev | cut -f4 > dubSNP.txt
-echo removing `wc -l dubSNP.txt`
-cat dubSNP.txt >> rmSNPs.txt
-$plink --noweb --bfile $Files --exclude rmSNPs.txt --recode --out ${Files}Red
-#change pos
-Rscript -e "liftmap<-read.table('${Files}Hg19.bedfile',hea=F,colC=c('character','integer')[c(1,2,2,1,2,1)]);liftmap<-read.table('${Files}Hg19.bedfile',hea=F,as.is=T);map<-read.table('${Files}Red.map',hea=F,as.is=T); liftmap<-liftmap[liftmap[,4]%in%map[,2],];rownames(map)<-map[,2];map[liftmap[,4],4]<-liftmap[,3];map[liftmap[,4],1]<-sub('chr','',liftmap[,1]);write.table(map,file='hg19.map',col=F,row=F,qu=F)"
-echo The new plink file is called  ${Files}Hg19 and are on the same strand as the original file
-#flip strands
-cat ${Files}Hg19.bedfile | grep \- | cut -f4 > flipSNPs.txt
-$plink  --noweb --ped ${Files}Red.ped --map hg19.map --flip flipSNPs.txt --recode --make-bed --out ${Files}Hg19
-$plink  --noweb --bfile ${Files}Hg19 --recode --out ${Files}Hg19
-</pre>
-*** run pre
-<pre>
-./liftover.sh myPlinkFile
-</pre>
-* angsd
-[[angsd/][Download folder]]
-* plot admix results
-*** Download
-[[Rfun/plotAdmix_anders.R][Get the R script]]
-or the more fancy one
-[[Rfun/plotAdmix2.R][fancier plotter (ida)]]
-to get the options type
-<pre>
-Rscript plotAdmix_anders.R
-Rscript plotAdmix2.R
-</pre>
-* plot LD region from plink file
-plot the LD in a region. You need SNPmatrix working. If you cannot get SNPmatrix to work use Relate instead (ldsnp3 function) or if on pontus use
-<pre>
-library(snpMatrix,lib="/home/albrecht/R/x86_64-pc-linux-gnu-library/2.15/")
-</pre>
-R function:
-<include file="andersLink/ldplot.R" markup="example">
-*** example
-<pre>
-library(snpMatrix)
-#snp info from plink files
-pl<-read.plink("/space/anders/greenland/data/plink/datGRDKplus")
-plB<-read.table("/space/anders/greenland/data/plink/datGRDKplus.bim")
-#choose region
-min.pos<-70e6
-max.pos<-81e6
-chr=1
-minMaf<-0.05
-minCallrate<-0.95
-depth<-400 #depth of ld calculations (window size)
-# calculatate maf and callrate
-info<-col.summary(pl)
-#chose SNPs
-table(keep <- plB[,1]==chr&plB[,4]>min.pos&plB[,4]<max.pos & info$MAF>minMaf & !is.na(info$MAF) & info$Call.rate > minCallrate)
-#estimate LD
-ldinfo <- ld.snp(pl[,keep],depth=depth)
-#plot LD (D')
-ldplot(ldinfo$dprime,pos=plB[keep,4],main="LD (D')")
-#plot R squared
-ldplot(ldinfo$rsq2,pos=plB[keep,4],main="LD (r2)")
-#change color palette
-colR=colorRampPalette(c("white","grey","black"))(40)
-ldplot(ldinfo$rsq2,pos=plB[keep,4],main="LD (r2)",colR=colR)
-</pre>
-This example produces the plots: [[andersLink/ldpic1.pdf][pdf]]
 =API=