Linux Terminal: How to do fuzzy search with tre-agrep

Dec 302013

Probably everyone that use a terminal know the command grep, from its man page:

grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a match to the given PATTERN. By default, grep prints the matching lines.

So this is the best tool to search in big file for a specific pattern, or a specific process in the complete list of running processes, but it has a small limit, it searches for the exact string that you ask, and sometime it could be useful to do an “approximate” or “fuzzy” search.

For this goal the program agrep was firstly developed, from wikipedia we can see some detail of this software:

agrep (approximate grep) is a proprietary approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.

It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including Manber and Wu’s bitap algorithm based on Levenshtein distances.

agrep is also the search engine in the indexer program GLIMPSE. agrep is free for private and non-commercial use only, and belongs to the University of Arizona.

So it’s closed source, but luckily there is an open source source alternative: tre-agrep

Tre Library

TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.

The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M^2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.

Approximate matching

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.

Installation

Tre-agrep it’s usually not installed by default by any distribution but it’s available in the repository so you can easily install it with the package manager of your distribution for Debian/Ubuntu and Mint you can use the command:

apt-get install tre-agrep

Basic Usage

The usage it’s best shown with some simple example of this powerfull command, given the file example.txt that contains:

Résumé
RÉSUMÉ
resume
Resümee
rèsümê
Resume
linuxaria

These are the output of the command tre-agrep with some different option:

 mint-desktop tmp # tre-agrep resume example.txt
resume
 
mint-desktop tmp # tre-agrep -i resume example.txt
resume
Resume
 
mint-desktop tmp # tre-agrep -1 -i resume example.txt
resume
Resümee
Resume
 
mint-desktop tmp # tre-agrep -2 -i resume example.txt
Résumé
RÉSUMÉ
resume
Resümee
Resume

As you can see without any option it got the same result as a normal grep, the -i option it’s used to Ignore case distinctions, and the interesting options are the -1 and -2, these are the distances allowed in the search, so the bigger the number the more result you’ll get, as you allow a greater “distance” from the original pattern.

To see the distance of each match you can use the option -s that print each match cost:

mint-desktop tmp # tre-agrep -5 -s -i resume example.txt
2:Résumé
2:RÉSUMÉ
0:resume
1:Resümee
3:rèsümê
0:Resume
5:linuxaria

So in this example the string Resume has a cost of 0, while linuxaria has a cost of 5.

Another interesting options are those to assign a cost for the different operations:

-D NUM, –delete-cost=NUM Set cost of missing characters to NUM.
-I NUM, –insert-cost=NUM Set cost of extra characters to NUM.
-S NUM, –substitute-cost=NUM Set cost of incorrect characters to NUM. Note that a deletion (a missing character) and an insertion (an extra character) together constitute a substituted character, but the cost will be the that of a deletion and an insertion added together.

Conclusions

The command tre-agrep it’s yet another small tool that can save your day if you work a lot with terminals and bash scripts, all you have to do it’s to remember that it exists at the right moment.

Linuxaria

Linux Terminal: How to do fuzzy search with tre-agrep

Installation

Basic Usage

Conclusions

Popular Posts:

Leave a Reply Cancel reply