I have a similar question.
What I want: a tokeniser generator which can take a lex-style grammar (not
necessarily lex syntax, but a set of token specifications defined by
REs, BNF, or whatever), generate a DFA, then run the DFA on sequences of
bytes. It must allow the syntax to be defined at run-time.
What I don't want: anything written by someone who doesn't understand the
field (i.e. anything which doesn't use a DFA).
lepl will do this, but it's integrated with the rest of the parser
(which is recursive descent).
for example:
float = Token(Float())
word = Token(Word(Lower())
punctuation = ~Token(r'[\.,]')
line = (float | word)[:, punctuation]
parser = line.string_parser()
will generate a lexer with three tokens. here two are specified using
lepl's matchers and one using a regexp, but in all three cases they
are converted to dfas internally.
then a parser is generated that will match a sequence of floats and
words, separated by punctuation. spaces are discarded by the lexer by
default, but that can be changed through the configuration (which
would be passed to the string_parser method).
it's also possible to specify everything using matchers and then get
lepl to compile "as much as possible" of the matcher graph to nfas
before matching (nfas rather than dfas because they are implemented
with a stack to preserve the backtracking abilities of the recursive
descent parser they replace). the problem here is that not all
matchers can be converted (matchers can contain arbitrary python
functions, while my nfa+dfa implementations cannot, and also my
"compiler" isn't very smart), while using tokens explicitly gives you
an error if the automatic compilation fails (in which case the simple
fix is to just give the regexp).
(also, you say "sequence of bytes" rather than strings - lepl will
parse the byte[] type in python3 and even has support for matching
binary values).
disclaimer: newish library, python 2.6+ only, and while i have quite a
few users (or, at least, downloads), i doubt that many use these more
advanced features, and everything is pure python with little
performance tuning so far.
andrew