Intern: Difference between revisions
No edit summary |
|||
(15 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
This sofware is for internal use. If you found this page | This sofware is for internal use. If you found this page then congrats. Feel free you use the methods but they might not work, give wrong results or other nasty stuff. If and when they fuck up don't contact us or complain. | ||
=Secret software= | =Secret software= | ||
Line 14: | Line 14: | ||
Extract bases from a fasta file | Extract bases from a fasta file | ||
==[[Getinsertsize]]== | |||
Get an estimate of the insert sizes, and read lengths. | |||
=Other stuff= | =Other stuff= | ||
Line 22: | Line 26: | ||
==bambi/bammer== | ==bambi/bammer== | ||
(Thorfinn, april20 2012) | (Thorfinn, april20 2012) | ||
This was a precursor to angsd. But it should be useful for playing around with single bam files. | |||
===Download=== | |||
[[http://www.popgen.dk/software/download/bambi.tar.gz]] | |||
===Usage=== | ===Usage=== | ||
Program is invoked with either: | Program is invoked with either: | ||
Line 83: | Line 88: | ||
</pre> | </pre> | ||
===Technics=== | |||
Program spawns a single thread, when a chunk has been read, and the thread is not finished it will wait. | Program spawns a single thread, when a chunk has been read, and the thread is not finished it will wait. | ||
==ieatgor== | |||
Extract lines from file in "gor" format. The files can be found /home/software/public/software/intern/ieatgor/ieatgor on pontus | |||
to compile see header of most recent version (V3) | |||
<pre> | <pre> | ||
./ieatgor | |||
RUN: ieatgor <targetFile> <file> OPTIONS | |||
target file format: chrName:start-end | |||
OPTIONS: | |||
-offset INT offset the target positions by INT | |||
-skip INT number of lines to skip (will be printet) | |||
-num INT are the chromosome in lexical (default 0) or numeric (1) order | |||
</pre> | </pre> | ||
The target file must be sorted. example of target file | |||
The target file must be sorted in the same order as the file you want to extract from (gor file). example of target file | |||
<pre> | <pre> | ||
chr1:22323:232322 | chr1:22323:232322 | ||
Line 103: | Line 119: | ||
</pre> | </pre> | ||
A "gor" file is a file with the first | |||
A "gor" file is a file with the first column being the chromosome and the second column being the positions. The rest of the columns are not important. The gor file must be sorted based on the positions and chr. gz gor files are also accepted. The sorting must be lexical or numeric | |||
example of a gor file | example of a gor file | ||
<pre> | <pre> | ||
Line 119: | Line 136: | ||
./ieatgor file.target file.gor | ./ieatgor file.target file.gor | ||
</pre> | </pre> | ||
the extracted line are returned to standard out. | the extracted line are returned to standard out. | ||
==getliner== | |||
NB HUGE BUG, must +1 when using -l and zipped files. Sorry for inconveincace. (tsk) | NB HUGE BUG, must +1 when using -l and zipped files. Sorry for inconveincace. (tsk) | ||
Line 141: | Line 158: | ||
==calcABBABABA== | |||
This is program that calculates ABBA-BABA related stats for a given chromosome. | This is program that calculates ABBA-BABA related stats for a given chromosome. | ||
Line 159: | Line 176: | ||
The program outputs three files: an abbababa file (with only abbababa stats), a count file (with overall counts for the chromosome) and a summary file where the compile time, run time and the cmds and the input files used when the program was run are all listed. | The program outputs three files: an abbababa file (with only abbababa stats), a count file (with overall counts for the chromosome) and a summary file where the compile time, run time and the cmds and the input files used when the program was run are all listed. | ||
==plot Plink== | |||
[[Rfun/][Folder]] | [[Rfun/][Folder]] | ||
Line 168: | Line 185: | ||
==liftover a plink file== | |||
Line 209: | Line 192: | ||
Note that if two SNP map to the same position after liftover one of them is removed. | Note that if two SNP map to the same position after liftover one of them is removed. | ||
<pre> | <pre> | ||
#!/bin/bash | #!/bin/bash | ||
#make liftover file | #make liftover file | ||
Line 219: | Line 203: | ||
echo "supply a plink prefix" | echo "supply a plink prefix" | ||
exit | exit | ||
fi | fi | ||
Line 234: | Line 211: | ||
echo the file $Files.bed does not exists \(use plink --make-bed\) | echo the file $Files.bed does not exists \(use plink --make-bed\) | ||
fi | fi | ||
#make bed file for liftover | #make bed file for liftover | ||
Rscript -e "options(scipen=10);map<-read.table('$Files. | Rscript -e "options(scipen=10);map<-read.table('$Files.bim',hea=F,as.is=T);res<-cbind(paste('chr',map[,1],sep=''),map[,4]-1,map[,4],map[,2],0,'+');write.table(res,file='$Files.bedfile',col=F,row=F,qu=F,sep='\t')" | ||
#liftover | #liftover | ||
Line 254: | Line 232: | ||
echo The new plink file is called ${Files}Hg19 and are on the same strand as the original file | echo The new plink file is called ${Files}Hg19 and are on the same strand as the original file | ||
#flip strands | #flip strands | ||
cat ${Files}Hg19.bedfile | grep \- | cut -f4 > flipSNPs.txt | cat ${Files}Hg19.bedfile | grep -w \- | cut -f4 > flipSNPs.txt | ||
$plink --noweb --ped ${Files}Red.ped --map hg19.map --flip flipSNPs.txt --recode --make-bed --out ${Files}Hg19 | $plink --noweb --ped ${Files}Red.ped --map hg19.map --flip flipSNPs.txt --recode --make-bed --out ${Files}Hg19 | ||
$plink --noweb --bfile ${Files}Hg19 --recode --out ${Files}Hg19 | #$plink --noweb --bfile ${Files}Hg19 --recode --out ${Files}Hg19 | ||
</pre> | </pre> | ||
Line 268: | Line 246: | ||
[[angsd/][Download folder]] | [[angsd/][Download folder]] | ||
==plot admix results== | |||
===Download=== | |||
[[Rfun/plotAdmix_anders.R][Get the R script]] | [[Rfun/plotAdmix_anders.R][Get the R script]] | ||
Line 288: | Line 265: | ||
=plot LD region from plink file= | |||
plot the LD in a region. You need SNPmatrix working. If you cannot get SNPmatrix to work use Relate instead (ldsnp3 function) | plot the LD in a region. You need SNPmatrix working. If you cannot get SNPmatrix to work use Relate instead (ldsnp3 function) | ||
===R function=== | |||
<pre> | <pre> | ||
ldplot<-function(x,pos,colR=colorRampPalette(c("white","yellow","red"))(40),minLod=3,xlab="Mb",lod,bg.col="lightblue",posScale=1e6,legend=FALSE,...){ | |||
if(missing(pos)) | |||
pos<-1:(nrow(x)+1) | |||
meanPos<-pos[-1]-diff(pos)/2 | |||
x[x==-1]<-NA | |||
if(!missing(lod)) | |||
x[lod<minLod]<-NA | |||
matBG<-matrix(NA,nrow=nrow(x)*2,ncol=ncol(x)) | |||
nr<-nrow(x) | |||
for(c in 1:ncol(matBG)) | |||
matBG[,c]<- c(rep(NA,c-1),rep(1,length(1:(nr-c+1))*2),rep(NA,c-1)) | |||
mat<-matrix(NA,nrow=nrow(x)*2,ncol=ncol(x)) | |||
nr<-nrow(x) | |||
for(c in 1:ncol(mat)) | |||
mat[,c]<- c(rep(NA,c-1),rep(x[1:(nr-c+1),c],each=2),rep(NA,c-1)) | |||
image(1:nrow(mat),1:ncol(x),matBG,col=bg.col,ylab="",axes=F,xlab=xlab,...) | |||
image(1:nrow(mat),1:ncol(x),mat,col= colR,add=TRUE,...) | |||
ticks<-unique(floor(meanPos/posScale))[-1] | |||
if(length(ticks)>0){ | |||
at<-sapply(1:length(ticks),function(x) which.min(abs(meanPos-ticks[x]*posScale))) | |||
axis(1,at=at*2,label=ticks) | |||
} | |||
if(legend) | |||
points(rep(ncol(mat)/10,1000),1000:1/1000*ncol(mat)/1.3+ncol(mat)*0.15,pch="-",col=colR[ceiling(1:1000/25)],cex=2) | |||
} | |||
</pre> | </pre> | ||
===example=== | |||
<pre> | <pre> | ||
library(snpMatrix) | library(snpMatrix) |
Latest revision as of 14:02, 24 September 2014
This sofware is for internal use. If you found this page then congrats. Feel free you use the methods but they might not work, give wrong results or other nasty stuff. If and when they fuck up don't contact us or complain.
Secret software
Grepper
Formerly known as getliner. Used for grep'ing with in columns
Sortdep
Fast counting of integers
RefFinder
Extract bases from a fasta file
Getinsertsize
Get an estimate of the insert sizes, and read lengths.
Other stuff
chisquare
A small library build using numerical recipies third edition for calculating stuff relating to the chisqdistribution
bambi/bammer
(Thorfinn, april20 2012) This was a precursor to angsd. But it should be useful for playing around with single bam files.
Download
[[1]]
Usage
Program is invoked with either:
./bammer view #samtools view/ single sample only ./bammer uppile #This does emulates the samtools mpileup
Inputfiles are supplied with either
tmp.bam #single bamfile -b bam.list #filelist containing bamfiles -b bam.list -nInd INT #only select the first INT samples from bam.list
Regions of interest can be supplied by defining a region in the following way
-r chr:a-b #select chromsome chr from a, to b -r chr:-b #select chromosome chr from beginning to be -r chr:b- #select chromosome chr from b to end of chromosome
If we want to dump different regions in the same run, this can be done with
-rf regions.file #every region is on a newline, has same format as above
Program can also estimate genotypelikelihoods if uppile is selected
-GL 0 #SOAPsnp -GL 1 #samtools -GL 2 #gatk
Output is hardcode dumped in glfs.glfs, these is a binary double file. Information for first site is the first 10*sizeof(double)*nInd bytes. This is the samtools ordering AA,AC,AG,...
Information about the position is dumped in mafs.mafs. This is NOT the MAF, but simply a 2 column file with chr tab.
Below are some examples for running the program
#view commands below ./bammer view new2.bam ./bammer view bwape_sort_rmdup.bam chr21:9483252-9672320 ./bammer view bwape_sort_rmdup.bam chrY:-1000000 #mpileup below ./bammer uppile -b lucampHighBam.filelist -nInd 2 -r chr1:-376920 #samtools estimation of GL ./bammer uppile -b lucampHighBam.filelist -nInd 2 -r chr1:-376920 -GL 1
Technics
Program spawns a single thread, when a chunk has been read, and the thread is not finished it will wait.
ieatgor
Extract lines from file in "gor" format. The files can be found /home/software/public/software/intern/ieatgor/ieatgor on pontus to compile see header of most recent version (V3)
./ieatgor RUN: ieatgor <targetFile> <file> OPTIONS target file format: chrName:start-end OPTIONS: -offset INT offset the target positions by INT -skip INT number of lines to skip (will be printet) -num INT are the chromosome in lexical (default 0) or numeric (1) order
The target file must be sorted in the same order as the file you want to extract from (gor file). example of target file
chr1:22323:232322 chr10:10000:10000000 chr3:10000:10000000
A "gor" file is a file with the first column being the chromosome and the second column being the positions. The rest of the columns are not important. The gor file must be sorted based on the positions and chr. gz gor files are also accepted. The sorting must be lexical or numeric
example of a gor file
chr1 1033267 2 3 know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9 chr1 1033268 0 2 know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9 chr1 1033279 3 1 know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9 chr1 1033286 0 2 know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9 chr1 1033297 0 2 know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9 chr1 1033304 0 1 know 0.000000 EM 0.000000 unk 0.000000 EMunk 0.000000 Nind 9 ...
run example
./ieatgor file.target file.gor
the extracted line are returned to standard out.
getliner
NB HUGE BUG, must +1 when using -l and zipped files. Sorry for inconveincace. (tsk)
This is a general tool for extract lines from newline seperated textfiles (can be gz compressed). Getliners4 allows for complement grep, use '-v 1'. Default is '-v 0'. Program can also print out basic statistics for hit/nonhit if -i filename is supplied. getliners6.cpp tokinzes on \r, so should work on windows files.
Download here getliners6.cpp .
g++ getliners5.cpp -lz -O3 -o getliners
Program can either extract specific lines, or "keys". The linenumbers to extract must be 1-indexed (first line i 1). When using keys, you must supply which column to 'grep' for. The program builds an associative array of the keys. And therefore only a single pass of the datafile is required.
Fields in the datafile can be seperated by a delimiter that can be specified at runtine with the "-d " argument. Default is space/tab. If delimter is an escape sequence it will not work and the code should be modified.
calcABBABABA
This is program that calculates ABBA-BABA related stats for a given chromosome.
It can be downloaded here abbababatest.cpp and can be compiled with:
g++ -O3 abbababatest.cpp -o calcABBABABA -lz
To get all options just type
./calcABBABABA
An example run cmd
../calcABBABABA -file /space/genomes/refgenomes/ancestal/hg19/shortForm/chr20.ans -bam testbamfilelistbotoV2chr20.txt -outfileprefix testbotoV2Chr20plus1x2 -blocksz 5
The program outputs three files: an abbababa file (with only abbababa stats), a count file (with overall counts for the chromosome) and a summary file where the compile time, run time and the cmds and the input files used when the program was run are all listed.
plot Plink
[[Rfun/][Folder]]
Rscript /home/software/public/software/intern/Rfun/plink.plot.R <plink output file>
liftover a plink file
You might have the change the chainFile path and the liftover path if you are not on pontus Note that if two SNP map to the same position after liftover one of them is removed.
#!/bin/bash #make liftover file Files=$1 liftoverProg=/home/albrecht/bin/prog/liftover/liftOver chainFile=/home/albrecht/bin/prog/liftover/chainFiles/hg18ToHg19.over.chain plink=/home/albrecht/bin/prog/plink-1.07-x86_64/plink if [ "$#" -eq 0 ]; then echo "supply a plink prefix" exit fi if [ -f $Files.bed ] then echo using file $Files.bed else echo the file $Files.bed does not exists \(use plink --make-bed\) fi #make bed file for liftover Rscript -e "options(scipen=10);map<-read.table('$Files.bim',hea=F,as.is=T);res<-cbind(paste('chr',map[,1],sep=''),map[,4]-1,map[,4],map[,2],0,'+');write.table(res,file='$Files.bedfile',col=F,row=F,qu=F,sep='\t')" #liftover $liftoverProg $Files.bedfile $chainFile ${Files}Hg19.bedfile ${Files}Hg19.unmapped # rm unmapped grep -v \# ${Files}Hg19.unmapped | cut -f4 > rmSNPs.txt #rm duplicates cat ${Files}Hg19.bedfile | sort -k1 | rev | uniq -d -f3 | rev | cut -f4 > dubSNP.txt echo removing `wc -l dubSNP.txt` cat dubSNP.txt >> rmSNPs.txt $plink --noweb --bfile $Files --exclude rmSNPs.txt --recode --out ${Files}Red #change pos Rscript -e "liftmap<-read.table('${Files}Hg19.bedfile',hea=F,colC=c('character','integer')[c(1,2,2,1,2,1)]);liftmap<-read.table('${Files}Hg19.bedfile',hea=F,as.is=T);map<-read.table('${Files}Red.map',hea=F,as.is=T); liftmap<-liftmap[liftmap[,4]%in%map[,2],];rownames(map)<-map[,2];map[liftmap[,4],4]<-liftmap[,3];map[liftmap[,4],1]<-sub('chr','',liftmap[,1]);write.table(map,file='hg19.map',col=F,row=F,qu=F)" echo The new plink file is called ${Files}Hg19 and are on the same strand as the original file #flip strands cat ${Files}Hg19.bedfile | grep -w \- | cut -f4 > flipSNPs.txt $plink --noweb --ped ${Files}Red.ped --map hg19.map --flip flipSNPs.txt --recode --make-bed --out ${Files}Hg19 #$plink --noweb --bfile ${Files}Hg19 --recode --out ${Files}Hg19
- run pre
./liftover.sh myPlinkFile
- angsd
[[angsd/][Download folder]]
plot admix results
Download
[[Rfun/plotAdmix_anders.R][Get the R script]]
or the more fancy one
[[Rfun/plotAdmix2.R][fancier plotter (ida)]]
to get the options type
Rscript plotAdmix_anders.R Rscript plotAdmix2.R
plot LD region from plink file
plot the LD in a region. You need SNPmatrix working. If you cannot get SNPmatrix to work use Relate instead (ldsnp3 function)
R function
ldplot<-function(x,pos,colR=colorRampPalette(c("white","yellow","red"))(40),minLod=3,xlab="Mb",lod,bg.col="lightblue",posScale=1e6,legend=FALSE,...){ if(missing(pos)) pos<-1:(nrow(x)+1) meanPos<-pos[-1]-diff(pos)/2 x[x==-1]<-NA if(!missing(lod)) x[lod<minLod]<-NA matBG<-matrix(NA,nrow=nrow(x)*2,ncol=ncol(x)) nr<-nrow(x) for(c in 1:ncol(matBG)) matBG[,c]<- c(rep(NA,c-1),rep(1,length(1:(nr-c+1))*2),rep(NA,c-1)) mat<-matrix(NA,nrow=nrow(x)*2,ncol=ncol(x)) nr<-nrow(x) for(c in 1:ncol(mat)) mat[,c]<- c(rep(NA,c-1),rep(x[1:(nr-c+1),c],each=2),rep(NA,c-1)) image(1:nrow(mat),1:ncol(x),matBG,col=bg.col,ylab="",axes=F,xlab=xlab,...) image(1:nrow(mat),1:ncol(x),mat,col= colR,add=TRUE,...) ticks<-unique(floor(meanPos/posScale))[-1] if(length(ticks)>0){ at<-sapply(1:length(ticks),function(x) which.min(abs(meanPos-ticks[x]*posScale))) axis(1,at=at*2,label=ticks) } if(legend) points(rep(ncol(mat)/10,1000),1000:1/1000*ncol(mat)/1.3+ncol(mat)*0.15,pch="-",col=colR[ceiling(1:1000/25)],cex=2) }
example
library(snpMatrix) #snp info from plink files pl<-read.plink("/space/anders/greenland/data/plink/datGRDKplus") plB<-read.table("/space/anders/greenland/data/plink/datGRDKplus.bim") #choose region min.pos<-70e6 max.pos<-81e6 chr=1 minMaf<-0.05 minCallrate<-0.95 depth<-400 #depth of ld calculations (window size) # calculatate maf and callrate info<-col.summary(pl) #chose SNPs table(keep <- plB[,1]==chr&plB[,4]>min.pos&plB[,4]<max.pos & info$MAF>minMaf & !is.na(info$MAF) & info$Call.rate > minCallrate) #estimate LD ldinfo <- ld.snp(pl[,keep],depth=depth) #plot LD (D') ldplot(ldinfo$dprime,pos=plB[keep,4],main="LD (D')") #plot R squared ldplot(ldinfo$rsq2,pos=plB[keep,4],main="LD (r2)") #change color palette colR=colorRampPalette(c("white","grey","black"))(40) ldplot(ldinfo$rsq2,pos=plB[keep,4],main="LD (r2)",colR=colR)
This example produces the plots: [[andersLink/ldpic1.pdf][pdf]]