B
bpontius
The GES Algorithm
A Surprisingly Simple Algorithm for Parallel Pattern Matching
"Partially because the best algorithms presented in the literature
are difficult to understand and to implement, knowledge of fast and
practical algorithms is not commonplace."
Hume and Sunday, "Fast String Searching", Software - Practice
and Experience, Vol. 21 # 11, pp 1221-48
In other disciplines, such as mathematics, the solution to a complex
problem is frequently startling in its simplicity. In the world of
software, this is usually not the case. At its worst, a software
solution may simply be a road map of the programmer's confusion as he
or she attempts to grasp the problem. At its best, this usually
reveals the programmer's pride in being able to reproduce and
proliferate complexity, rather than to think things through until the
problem's innate simplicity shines through.
In 1985, one of the tasks that I was asked to accomplish at my first
programming job, was that of creating a user configurable terminal
emulation program. As it happened, Dr. Dobb's Journal had just
published an article called "Parallel Pattern Matching and Fgrep", by a
gentleman named Ian E. Ashdown. The article presented an
implementation, in C, of the Aho-Corasick algorithm. I read the
article and, difficult as it was, adapted the code they published to
create my terminal emulator.
A few years ago, I was about to look for a job, after having spent
several months relocating, and not programming. I decided to polish my
C++ skills. Having used the Aho-Corasick Algorithm previously, my goal
was to implement it, in a clean and efficient manner, using C++.
Having seen several implementations in C, I consulted a respected
book, Practical Algorithms in C++ by Bryan Flamig, (1999 John Wiley &
Sons) for his implementation in C++ of the Aho-Corasick system. After
reading this, I concluded that the algorithm did not lend itself to a
clean or efficient implementation in C++. Flaming makes the following
comment in his book: "The logic behind the ConstructDFA() function
is rather difficult to explain (the author has trouble understanding it
himself), so we leave the explanation to the original Aho-Corasick
paper." Upon seeing this, I read their paper.
Never having learned FORTRAN, it was only when I read their original
paper, and discovered that their algorithm was thinking and speaking
FORTRAN, that it all made sense to me. I was immediately reminded of
an old programming joke (the humor of which I have never understood),
"The determined programmer can write a FORTRAN program in any
language". Having no desire to do this, I asked myself, "Just how
complicated can the creation of a finite state automaton for parallel
pattern matching be?"
As I discovered, the problem itself is quite simple. While the
Aho-Corasick system has been, for thirty years, regarded as the only
way to create parallel pattern matching machines, its complexity far
surpasses the complexity of the problem, and it is in no way
object-oriented.
The introduction of object-oriented programming languages has been a
major step forward for the software development world, and it allows us
to model our simple solution to the problem with some elegance. While
our system lends itself to a straightforward and efficient
implementation in most current programming languages, it is
particularly well suited to object-oriented languages, such as C++ and
Java. (As a matter of fact, our solution is simple enough that it can
be implemented using three-by-five index cards)!
I immediately formed a company, Great East Software, and applied for a
patent for our GES algorithm. The algorithm is now patent-pending.
The simplicity of our algorithm allows for array-based state
transitions, which means that a variety of efficient implementations
may easily be created. And, implemented in standard C++, the
prototypes which we have created greatly outperform any implementations
of the Aho-Corasick algorithm, or anything else against which we have
been able to test.
My question is this - does anyone know of a large software company who
might be interested in my algorithm?
A Surprisingly Simple Algorithm for Parallel Pattern Matching
"Partially because the best algorithms presented in the literature
are difficult to understand and to implement, knowledge of fast and
practical algorithms is not commonplace."
Hume and Sunday, "Fast String Searching", Software - Practice
and Experience, Vol. 21 # 11, pp 1221-48
In other disciplines, such as mathematics, the solution to a complex
problem is frequently startling in its simplicity. In the world of
software, this is usually not the case. At its worst, a software
solution may simply be a road map of the programmer's confusion as he
or she attempts to grasp the problem. At its best, this usually
reveals the programmer's pride in being able to reproduce and
proliferate complexity, rather than to think things through until the
problem's innate simplicity shines through.
In 1985, one of the tasks that I was asked to accomplish at my first
programming job, was that of creating a user configurable terminal
emulation program. As it happened, Dr. Dobb's Journal had just
published an article called "Parallel Pattern Matching and Fgrep", by a
gentleman named Ian E. Ashdown. The article presented an
implementation, in C, of the Aho-Corasick algorithm. I read the
article and, difficult as it was, adapted the code they published to
create my terminal emulator.
A few years ago, I was about to look for a job, after having spent
several months relocating, and not programming. I decided to polish my
C++ skills. Having used the Aho-Corasick Algorithm previously, my goal
was to implement it, in a clean and efficient manner, using C++.
Having seen several implementations in C, I consulted a respected
book, Practical Algorithms in C++ by Bryan Flamig, (1999 John Wiley &
Sons) for his implementation in C++ of the Aho-Corasick system. After
reading this, I concluded that the algorithm did not lend itself to a
clean or efficient implementation in C++. Flaming makes the following
comment in his book: "The logic behind the ConstructDFA() function
is rather difficult to explain (the author has trouble understanding it
himself), so we leave the explanation to the original Aho-Corasick
paper." Upon seeing this, I read their paper.
Never having learned FORTRAN, it was only when I read their original
paper, and discovered that their algorithm was thinking and speaking
FORTRAN, that it all made sense to me. I was immediately reminded of
an old programming joke (the humor of which I have never understood),
"The determined programmer can write a FORTRAN program in any
language". Having no desire to do this, I asked myself, "Just how
complicated can the creation of a finite state automaton for parallel
pattern matching be?"
As I discovered, the problem itself is quite simple. While the
Aho-Corasick system has been, for thirty years, regarded as the only
way to create parallel pattern matching machines, its complexity far
surpasses the complexity of the problem, and it is in no way
object-oriented.
The introduction of object-oriented programming languages has been a
major step forward for the software development world, and it allows us
to model our simple solution to the problem with some elegance. While
our system lends itself to a straightforward and efficient
implementation in most current programming languages, it is
particularly well suited to object-oriented languages, such as C++ and
Java. (As a matter of fact, our solution is simple enough that it can
be implemented using three-by-five index cards)!
I immediately formed a company, Great East Software, and applied for a
patent for our GES algorithm. The algorithm is now patent-pending.
The simplicity of our algorithm allows for array-based state
transitions, which means that a variety of efficient implementations
may easily be created. And, implemented in standard C++, the
prototypes which we have created greatly outperform any implementations
of the Aho-Corasick algorithm, or anything else against which we have
been able to test.
My question is this - does anyone know of a large software company who
might be interested in my algorithm?