Grepper

From software
Jump to navigation Jump to search

Small usefull command line program called grepper which is a usefull to complement POSIX grep.

The idea is that you have a file that can be 'whatever' delimited and then you want to extract entire rows, but you only want to do the search in specific columns.

  1. You supply a file containing all the keys that will be searched for in the datafile. The program will only do a single pass of the datafile.
  2. You specify which column to use for the search.

The program supports the following options which should make it somewhat similar to grep

  • -i don't care about cases (capital letter vs small letter)
  • -w seach for whole words
  • -v do complement

Brief overview

usage: grepper [OPTION] -k keyfile datafile.gz
usage: gunzip -c datafile.gz | grepper [OPTION] -k keyfile
options:
	-c [int]: which column to use for grepping (1 indexed)
	-d [char] delimitor for the datafile
	-w search for whole words (similar to grep -w option)
	-v complement grep (similar to grep -v option)
	-i ignore case (similar to grep -i option)

Installation

Program is a single c++ file that can be downloaded [1]

wget http://popgen.dk/software/download/grepper.cpp
g++ grepper.cpp -O3 -o grepper -lz

Examples

Assuming the keys file:

"Setosa"
"Virginica"

And the datafile

"Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
"1" 5.1 3.5 1.4 0.2 "setosa"
"2" 4.9 3 1.4 0.2 "Setosa"
"3" 4.7 3.2 1.3 0.2 "setosa"
"4" 4.6 3.1 1.5 0.2 "setosa"
"5" 5 3.6 1.4 0.2 "setosa"
"6" 5.4 3.9 1.7 0.4 "setosa"
"7" 4.6 3.4 1.4 0.3 "setosa"
"8" 5 3.4 1.5 0.2 "setosa"
"9" 4.4 2.9 1.4 0.2 "setosa"
"101" 6.3 3.3 6 2.5 "virginica"
"102" 5.8 2.7 5.1 1.9 "virginica"
"103" 7.1 3 5.9 2.1 "virginica"
"104" 6.3 2.9 5.6 1.8 "virginica"
"105" 6.5 3 5.8 2.2 "virginica"
"106" 7.6 3 6.6 2.1 "virginica"
"107" 4.9 2.5 4.5 1.7 "virginica"
"108" 7.3 2.9 6.3 1.8 "virginica"
"109" 6.7 2.5 5.8 1.8 "Virginica"
"110" 7.2 3.6 6.1 2.5 "virginica"
"51" 7 3.2 4.7 1.4 "versicolor"
"52" 6.4 3.2 4.5 1.5 "versicolor"
"53" 6.9 3.1 4.9 1.5 "versicolor"
"54" 5.5 2.3 4 1.3 "versicolor"
"55" 6.5 2.8 4.6 1.5 "versicolor"
"56" 5.7 2.8 4.5 1.3 "versicolor"
"57" 6.3 3.3 4.7 1.6 "versicolor"
"58" 4.9 2.4 3.3 1 "versicolor"
"59" 6.6 2.9 4.6 1.3 "versicolor"
"60" 5.2 2.7 3.9 1.4 "versicolor"

1.

We have single space delimited datafile and we want to grep from column6

./grepper -k key datafile -d ' ' -c 6
"2" 4.9 3 1.4 0.2 "Setosa"
"109" 6.7 2.5 5.8 1.8 "Virginica"

2.

If we don't care about the cases add -i

./grepper -k key datafile -d ' ' -c 6 -i
"1" 5.1 3.5 1.4 0.2 "setosa"
"2" 4.9 3 1.4 0.2 "Setosa"
"3" 4.7 3.2 1.3 0.2 "setosa"
"4" 4.6 3.1 1.5 0.2 "setosa"
"5" 5 3.6 1.4 0.2 "setosa"
"6" 5.4 3.9 1.7 0.4 "setosa"
"7" 4.6 3.4 1.4 0.3 "setosa"
"8" 5 3.4 1.5 0.2 "setosa"
"9" 4.4 2.9 1.4 0.2 "setosa"
"101" 6.3 3.3 6 2.5 "virginica"
"102" 5.8 2.7 5.1 1.9 "virginica"
"103" 7.1 3 5.9 2.1 "virginica"
"104" 6.3 2.9 5.6 1.8 "virginica"
"105" 6.5 3 5.8 2.2 "virginica"
"106" 7.6 3 6.6 2.1 "virginica"
"107" 4.9 2.5 4.5 1.7 "virginica"
"108" 7.3 2.9 6.3 1.8 "virginica"
"109" 6.7 2.5 5.8 1.8 "Virginica"
"110" 7.2 3.6 6.1 2.5 "virginica"

3.

say we want the complement of example 1.

./grepper -k key datafile -d ' ' -c 6 -v
"Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
"1" 5.1 3.5 1.4 0.2 "setosa"
"3" 4.7 3.2 1.3 0.2 "setosa"
"4" 4.6 3.1 1.5 0.2 "setosa"
"5" 5 3.6 1.4 0.2 "setosa"
"6" 5.4 3.9 1.7 0.4 "setosa"
"7" 4.6 3.4 1.4 0.3 "setosa"
"8" 5 3.4 1.5 0.2 "setosa"
"9" 4.4 2.9 1.4 0.2 "setosa"
"101" 6.3 3.3 6 2.5 "virginica"
"102" 5.8 2.7 5.1 1.9 "virginica"
"103" 7.1 3 5.9 2.1 "virginica"
"104" 6.3 2.9 5.6 1.8 "virginica"
"105" 6.5 3 5.8 2.2 "virginica"
"106" 7.6 3 6.6 2.1 "virginica"
"107" 4.9 2.5 4.5 1.7 "virginica"
"108" 7.3 2.9 6.3 1.8 "virginica"
"110" 7.2 3.6 6.1 2.5 "virginica"
"51" 7 3.2 4.7 1.4 "versicolor"
"52" 6.4 3.2 4.5 1.5 "versicolor"
"53" 6.9 3.1 4.9 1.5 "versicolor"
"54" 5.5 2.3 4 1.3 "versicolor"
"55" 6.5 2.8 4.6 1.5 "versicolor"
"56" 5.7 2.8 4.5 1.3 "versicolor"
"57" 6.3 3.3 4.7 1.6 "versicolor"
"58" 4.9 2.4 3.3 1 "versicolor"
"59" 6.6 2.9 4.6 1.3 "versicolor"
"60" 5.2 2.7 3.9 1.4 "versicolor"