November 7, 2012

Title: Using Rabin-Karp Fingerprints and LevelDB for Faster Searches

Speaker: Richard Alan Deighton, Computer Science MSc student, Ontario Tech University

Abstract: This thesis represents the results of a study into using fingerprints generated in a manner similar to the Rabin-Karp Algorithm (Karp & Rabin, 1987) and a database LevelDB (Google, 2011) to achieve Text Search times below GREP (Gnu, 2009), which is a standard command-line UNIX text search tool. We discuss the following topics.

Text Search is a set of algorithms that find a string of characters called a Search Pattern in a much larger string of characters in a document we call a text file.
The Rabin-Karp Algorithm iterates through a text file converting character strings into fingerprints at each location. A fingerprint numerically represents a window length string of characters to the left of its location. Then, the algorithm compares that fingerprint to the Search Pattern’s fingerprint. (Karp & Rabin, 1987) When fingerprints are not equal, they guarantee there can be no match in the corresponding strings. Whereas when fingerprints equal, the strings likely match. A verification process confirms matches by checking respective characters.

Our application makes the following major changes to Rabin-Karp that are unique to us. We employ a two-step technique rather than one. During step 1, the preprocessing step, we calculate and store these fingerprints in a LevelDB database called an Index Database. This is our first major change unique to us. Step 2, the matching step, is our second unique change. We use the Index Database to look-up the Search Pattern’s fingerprint and gather its set of locations. Finally, our key change unique to us is how we arrive at the set of locations. We allow the pattern to be any length relative to the window length, whereas Rabin-Karp requires them to be the same. We even created an equation to check if the difference in length is too long for the fingerprint’s number system base.
We facilitated our performance experiments by first building our application and testing it against GREP for a wide range of different parameters. Our conclusions and recommendations determine that although we currently only outperform GREP in about half the cases, we identify some promising opportunities to modify some parts of our application so that we can outperform GREP in all instances.