« CNN elects “Magic Wall”
» Show us The Stuff Already

Programming, Research

Inventing the Wheel of Search

02.08.08 | Comment?

This morning when I woke up I noticed it was still very early, so I stayed in bed a little longer and thought through some of the challenges I face in my current projects.

One such problem is high speed string-matching, or more simply put, checking if a piece of text in one list is also present in another list, without needing to check the entire list as a worst-case scenario (i.e. checking every single piece of text only to discover it is not there). I turned to hashtables for the time being, but I kept thinking there could be better ways to handle this.

I mused a bit about how our brains handle these types of searches, and quite suddenly a solution presented itself in my mind; an elegant way of eliminating most of the non-matches in an efficient manner.

I made some quick sketches of the idea so I wouldn’t forget and got out of bed.

At the office I sketched some more diagrams and added some optimizations. Then I figured that the chances of me being the first to invent something so elegant were pretty small, so I opened up Google and typed:

high performance string matching

in the search bar, where I found some interesting papers about Network Intrusion Detection Systems (NIDS). These systems constantly monitor packets of data to check for known patterns. If a suspicious pattern is detected the datapacket can be quarantined and defensive action taken.

NIDS have to operate at “line speed” - checking the data while it passes along without slowing it down. Since dataspeeds keep increasing, it also becomes increasingly hard to monitor datastreams, which is a security hazard. There is simply not enough time to check everything thoroughly, like when thousands of people attend a concert or rave, some will probably succeed in smuggling drugs or alcohol inside because security checks are mostly superficial and deterring in nature.

For NIDS this is unacceptable, and it has become a powerful innovation driver for string matching algorithms. When I investigated the leading Open Source NID System, Snort, I learned it uses a modified Aho-Corasick algorithm. Thank God for Wikipedia, as it details how it works.

There I read that it constructs a trie… (click)

And there I saw the exact same diagram I had drawn this morning. Yep. Yet again I succeeded in inventing something that has existed for almost 60 years…

Trie example

As the Wikipedia article notes, the term trie comes from the word retrieval, and is pronounced as tree. Trie-based search algorithms rank among the most efficient algorithms in the world and are probably used by most internet search engines (although all vendors tend to be very secretive about their implementations).

The algorithm takes care of the first stage of a search: matching a set of keywords to a list of texts that contain the keywords. Google’s genius lies mostly in the stage after this, the ordering of the results. Still, they depend on clever indexing and high speed matching to deliver the break-neck speeds that they do.

Word completion

It seemed very obvious to me that it can be used as an ideal word-completion algorithm as well. T9 (the text-completion algorithm used in many cell-phones) probably works the same way, breaking up words not in single characters but in groups of characters (’abc’, ‘def’ in stead of ‘a’, ‘b’, etc.)

Fuzzy search and Automatic spelling correction

Automatically correcting spelling-errors is also relatively easy, as you could use a Wu-Manber algorithm to find words in your wordlist (which is finite) that are almost identical in spelling. Many misspellings usually have much less associated texts than their correctly spelled equivalents, so the software can make an educated guess of which is the right one without any human intervention.

I’m thinking of writing an Open Source search engine for PHP-based websites that employs these algorithms, as I am not aware of one existing at present. If I could just find some spare time…

have your say

You can skip to the end and leave a response. Pinging is currently not allowed.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

:

:


« CNN elects “Magic Wall”
» Show us The Stuff Already