Named regexp variables, an extension proposal.

P

Paddy

Proposal: Named RE variables
======================

The problem I have is that I am writing a 'good-enough' verilog tag
extractor as a long regular expression (with the 'x' flag for
readability), and find myself both
1) Repeating sections of the RE, and
2) Wanting to add '(?P<some_clarifier>...) ' around sections
because I know what the section does but don't really want
the group.

If I could write:
(?P/verilog_name/ [A-Za-z_][A-Za-z_0-9\$\.]* | \\\S+ )

....and have the RE parser extract the section of RE after the second
'/' and store it associated with its name that appears between the
first two '/'. The RE should NOT try and match against anything between
the outer '(' ')' pair at this point, just store.

Then the following code appearing later in the RE:
(?P=verilog_name)

....should retrieve the RE snippet named and insert it into the RE
instead of the '(?P=...)' group before interpreting the RE 'as normal'

Instead of writing the following to search for event declarations:
vlog_extract = r'''(?smx)
# Verilog event definition extraction
(?: event \s+ [A-Za-z_][A-Za-z_0-9\$\.]* \s* (?: , \s*
[A-Za-z_][A-Za-z_0-9\$\.]*)* )
'''
I could write the following RE, which I think is clearer:
vlog_extract = r'''(?smx)
# Verilog identifier definition
(?P/IDENT/ [A-Za-z_][A-Za-z_0-9\$\.]* (?!\.) )
# Verilog event definition extraction
(?: event \s+ (?P=IDENT) \s* (?: , \s* (?P=IDENT))* )
'''

Extension; named RE variables, with arguments
===================================
In this, all group definitions in the body of the variable definition
reference the literal contents of groups appearing after the variable
name, (but within the variable reference), when the variable is
referenced

So an RE variable definition like:
defs = r'(?smx) (?P/GO/ go \s for \s \1 )'

Used like:
rgexp = defs + r"""
(?P=GO (it) )
\s+
(?P=\GO (broke) )
"""
Would match the string:
"go for it go for broke"

As would:
defs2 = r'(?smx) (?P/GO/ go \s for \s (?P=subject) )'
rgexp = defs2 + r"""
(?P=GO (?P<subject> it) )
\s+
(?P=\GO (?P<subject> broke) )
"""

The above would allow me to factor out sections of REs and define
named, re-ussable RE snippets.


Please comment :)

- Paddy.
 
J

John Machin

On 13/05/2006 7:39 PM, Paddy wrote:
[snip]
Extension; named RE variables, with arguments
===================================
In this, all group definitions in the body of the variable definition
reference the literal contents of groups appearing after the variable
name, (but within the variable reference), when the variable is
referenced

So an RE variable definition like:
defs = r'(?smx) (?P/GO/ go \s for \s \1 )'

Used like:
rgexp = defs + r"""
(?P=GO (it) )
\s+
(?P=\GO (broke) )
"""
Would match the string:
"go for it go for broke"

As would:
defs2 = r'(?smx) (?P/GO/ go \s for \s (?P=subject) )'
rgexp = defs2 + r"""
(?P=GO (?P<subject> it) )
\s+
(?P=\GO (?P<subject> broke) )
"""

The above would allow me to factor out sections of REs and define
named, re-ussable RE snippets.


Please comment :)


1. Regex syntax is over-rich already.
2. You may be better off with a parser for this application instead of
using regexes.
3. "\\" is overloaded to the point of collapse already. Using it as an
argument marker could make the universe implode.
4. You could always use Python to roll your own macro expansion gadget,
like this:

C:\junk>type paddy_rx.py
import re
flags = r'(?smx)'
GO = r'go \s for \s &1 &2'
WS = r'\s+'

ARGMARK = "&"

# Can the comments about the style of
# this code; I've just translated it from
# a now-dead language with max 6 chars in variable names :)
def macsub(tmplt, *infils):
wstr = tmplt
ostr = ""
while wstr:
lpos = wstr.find(ARGMARK)
if lpos < 0:
return ostr + wstr
ostr = ostr + wstr[:lpos]
nch = wstr[lpos+1:lpos+2]
if "1" <= nch <= "9":
x = ord(nch)-ord("1")
if x < len(infils):
ostr = ostr + infils[x]
elif nch == ARGMARK: # double & (or whatever)
ostr = ostr + ARGMARK
else:
ostr = ostr + ARGMARK + nch
wstr = wstr[lpos+2:]
return ostr

regexp = " ".join([
flags,
macsub(GO, 'it,\s', 'Paddy'),
WS,
macsub(GO, 'broke'),
])
print regexp
text = "go for it, Paddy go for broke"
m = re.match(regexp, text)
print len(text), m.end()

C:\junk>paddy_rx.py
(?smx) go \s for \s it,\s Paddy \s+ go \s for \s broke
30 30



Cheers,
John
 
P

Paul McGuire

Paddy said:
Proposal: Named RE variables
======================

The problem I have is that I am writing a 'good-enough' verilog tag
extractor as a long regular expression (with the 'x' flag for
readability), and find myself both
1) Repeating sections of the RE, and
2) Wanting to add '(?P<some_clarifier>...) ' around sections
because I know what the section does but don't really want
the group.

If I could write:
(?P/verilog_name/ [A-Za-z_][A-Za-z_0-9\$\.]* | \\\S+ )

...and have the RE parser extract the section of RE after the second
'/' and store it associated with its name that appears between the
first two '/'. The RE should NOT try and match against anything between
the outer '(' ')' pair at this point, just store.

Then the following code appearing later in the RE:
(?P=verilog_name)

...should retrieve the RE snippet named and insert it into the RE
instead of the '(?P=...)' group before interpreting the RE 'as normal'

Instead of writing the following to search for event declarations:
vlog_extract = r'''(?smx)
# Verilog event definition extraction
(?: event \s+ [A-Za-z_][A-Za-z_0-9\$\.]* \s* (?: , \s*
[A-Za-z_][A-Za-z_0-9\$\.]*)* )
'''
I could write the following RE, which I think is clearer:
vlog_extract = r'''(?smx)
# Verilog identifier definition
(?P/IDENT/ [A-Za-z_][A-Za-z_0-9\$\.]* (?!\.) )
# Verilog event definition extraction
(?: event \s+ (?P=IDENT) \s* (?: , \s* (?P=IDENT))* )
'''

By contrast, the event declaration expression in the pyparsing Verilog
parser is:

identLead = alphas+"$_"
identBody = alphanums+"$_"
#~ identifier = Combine( Optional(".") +
#~ delimitedList( Word(identLead, identBody), ".",
combine=True ) ).setName("baseIdent")
# replace pyparsing composition with Regex - improves performance ~10% for
this construct
identifier = Regex(
r"\.?["+identLead+"]["+identBody+"]*(\.["+identLead+"]["+identBody+"]*)*" ).
setName("baseIdent")

eventDecl = Group( "event" + delimitedList( identifier ) + semi )


But why do you need an update to RE to compose snippets? Especially
snippets that you can only use in the same RE? Just do string interp:
I could write the following RE, which I think is clearer:
vlog_extract = r'''(?smx)
# Verilog identifier definition
(?P/IDENT/ [A-Za-z_][A-Za-z_0-9\$\.]* (?!\.) )
# Verilog event definition extraction
(?: event \s+ (?P=IDENT) \s* (?: , \s* (?P=IDENT))* )
'''
IDENT = "[A-Za-z_][A-Za-z_0-9\$\.]* (?!\.)"
vlog_extract = r'''(?smx)
# Verilog event definition extraction
(?: event \s+ %(IDENT)s \s* (?: , \s* %(IDENT)s)* )
''' % locals()

Yuk, this is a mess - which '%' signs are part of RE and which are for
string interp? Maybe just plain old string concat is better:

IDENT = "[A-Za-z_][A-Za-z_0-9\$\.]* (?!\.)"
vlog_extract = r'''(?smx)
# Verilog event definition extraction
(?: event \s+ ''' + IDENT + ''' \s* (?: , \s* ''' + IDENT + ''')* )'''

By the way, your IDENT is not totally accurate - it does not permit a
leading ".", and it does permit leading digits in identifier elements after
the first ".". So ".goForIt" would not be matched as a valid identifier
when it should, and "go.4it" would be matched as valid when it shouldn't (at
least as far as I read the Verilog grammar).

(Pyparsing (http://sourceforge.net/projects/pyparsing/) is open source under
the MIT license. The Verilog grammar is not distributed with pyparsing, and
is only available free for noncommercial use.)

-- Paul
 
P

Paddy

John said:
On 13/05/2006 7:39 PM, Paddy wrote:
[snip]
Extension; named RE variables, with arguments
===================================
In this, all group definitions in the body of the variable definition
reference the literal contents of groups appearing after the variable
name, (but within the variable reference), when the variable is
referenced

So an RE variable definition like:
defs = r'(?smx) (?P/GO/ go \s for \s \1 )'

Used like:
rgexp = defs + r"""
(?P=GO (it) )
\s+
(?P=\GO (broke) )
"""
Would match the string:
"go for it go for broke"

As would:
defs2 = r'(?smx) (?P/GO/ go \s for \s (?P=subject) )'
rgexp = defs2 + r"""
(?P=GO (?P<subject> it) )
\s+
(?P=\GO (?P<subject> broke) )
"""

The above would allow me to factor out sections of REs and define
named, re-ussable RE snippets.


Please comment :)


1. Regex syntax is over-rich already.

First, thanks for the reply John.

Yep, regex syntax is rich, but one of the reasons I went ahead with my
post was that it might add a new way to organize regexps into more
managable chunks, rather ike functions do.
2. You may be better off with a parser for this application instead of
using regexes.
unfortunately my experience counts against me going for parser
solutions rather than regxps. Although, being a Python user I always
think again before using a regexp and remember to think if their might
be a clearer string method solution to tasks; I am not comfotable with
parsers/parser generators.

The reason I used to dismiss parsers this time is that I have only
ever seen parsers for complete languages. I don't want to write a
complete parser for Verilog, I want to take an easier 'good enough'
route that I have used with success, from my AWK days. (Don't laugh, my
exposure to AWK after years of C, was just as profound as more recent
speakers have blogged about their fealings of release from Java after
exposure to new dynamic languages - all hail AWK, not completely put
out to stud :)
I intend to write a large regexp that picks out the things that I want
from a verilog file, skipping the bits I am un-iterested in. With a
regular expression, if I don't write something to match, say, always
blocks, then, although if someone wrote ssignal definitions (which I am
interested in), in the task, then I would pick those up as well as
module level signal definitions, but that would be 'good enough' for my
app.
All the parser examples I see don't 'skip things',

- Hell, despite writing my own interpreted, recursive descent, language
many (many..), years ago in C; too much early lex &yacc'ing about left
me with a grudge!
3. "\\" is overloaded to the point of collapse already. Using it as an
argument marker could make the universe implode.

Did I truly write '=\GO' ? Twice!
Sorry, the example should have used '=GO' to refer to RE variables. I
made, then copied the error.
Note: I also tried to cut down on extra syntax by re-using the syntax
for referring to named groups (Or I would have if my proof reading were
better).
4. You could always use Python to roll your own macro expansion gadget,
like this:

Thanks for going to the trouble of writing the expander. I too had
thought of that, but that would lead to 'my little RE syntax' that
would be harder to maintain and others might reinvent the solution but
with their own mini macro syntax.
Cheers,
John

- Paddy.
 
P

Paddy

Hi Paul, please also refer to my reply to John.
By contrast, the event declaration expression in the pyparsing Verilog
parser is:

identLead = alphas+"$_"
identBody = alphanums+"$_"
#~ identifier = Combine( Optional(".") +
#~ delimitedList( Word(identLead, identBody), ".",
combine=True ) ).setName("baseIdent")
# replace pyparsing composition with Regex - improves performance ~10% for
this construct
identifier = Regex(
r"\.?["+identLead+"]["+identBody+"]*(\.["+identLead+"]["+identBody+"]*)*" ).
setName("baseIdent")

eventDecl = Group( "event" + delimitedList( identifier ) + semi )
I have had years of success by writing RE's to extract what I am
interested in, not react to what I'm not interested in, and maybe make
slight mods down the line as examples crop up that break the program. I
do rely on what examples I get to test my extractors, but I find
examples a lot easier to come by than the funds/time for a language
parser. Since I tend to stay in a job for a number of years, I know
that the method works, and gives quick results that rapidly become
dependable as I am their to catch any flak ;-).

It's difficult to switch to parsers for me even though examples like
pyparsing seem readable, I do want to skip what I am not interested in
rather than having to write a parser for everything. But converely,
when something skipped does bite me - I want to be able to easily add
it in.

Are their any examples of this kind of working with parsers?

But why do you need an update to RE to compose snippets? Especially
snippets that you can only use in the same RE? Just do string interp:
I could write the following RE, which I think is clearer:
vlog_extract = r'''(?smx)
# Verilog identifier definition
(?P/IDENT/ [A-Za-z_][A-Za-z_0-9\$\.]* (?!\.) )
# Verilog event definition extraction
(?: event \s+ (?P=IDENT) \s* (?: , \s* (?P=IDENT))* )
'''
IDENT = "[A-Za-z_][A-Za-z_0-9\$\.]* (?!\.)"
vlog_extract = r'''(?smx)
# Verilog event definition extraction
(?: event \s+ %(IDENT)s \s* (?: , \s* %(IDENT)s)* )
''' % locals()

Yuk, this is a mess - which '%' signs are part of RE and which are for
string interp? Maybe just plain old string concat is better:

Yeah, I too thought that the % thing was ugly when used on an RE.
IDENT = "[A-Za-z_][A-Za-z_0-9\$\.]* (?!\.)"
vlog_extract = r'''(?smx)
# Verilog event definition extraction
(?: event \s+ ''' + IDENT + ''' \s* (?: , \s* ''' + IDENT + ''')* )'''

.... And the string concats broke up the visual flow of my multi-line
RE.
By the way, your IDENT is not totally accurate - it does not permit a
leading ".", and it does permit leading digits in identifier elements after
the first ".". So ".goForIt" would not be matched as a valid identifier
when it should, and "go.4it" would be matched as valid when it shouldn't (at
least as far as I read the Verilog grammar).

Thanks for the info on IDENT. I am not working with the grammer spec in
front of me, and I know I will have to revisit my RE. you've saved me
some time!
(Pyparsing (http://sourceforge.net/projects/pyparsing/) is open source under
the MIT license. The Verilog grammar is not distributed with pyparsing, and
is only available free for noncommercial use.)

-- Paul

- Paddy.
 
P

Paul McGuire

Paddy said:
It's difficult to switch to parsers for me even though examples like
pyparsing seem readable, I do want to skip what I am not interested in
rather than having to write a parser for everything. But converely,
when something skipped does bite me - I want to be able to easily add
it in.

Are their any examples of this kind of working with parsers?

pyparsing offers several flavors of skipping over uninteresting text. The
most obvious is scanString. scanString is a generator function that scans
through the input text looking for pattern matches (multiple patterns can be
OR'ed together) - when a match is found, the matching tokens, start, and end
locations are yielded. Here's a short example that ships with pyparsing:

from pyparsing import Word, alphas, alphanums, Literal, restOfLine,
OneOrMore, Empty

# simulate some C++ code
testData = """
#define MAX_LOCS=100
#define USERNAME = "floyd"
#define PASSWORD = "swordfish"

a = MAX_LOCS;
CORBA::initORB("xyzzy", USERNAME, PASSWORD );

"""

#################
print "Example of an extractor"
print "----------------------"

# simple grammar to match #define's
ident = Word(alphas, alphanums+"_")
macroDef = Literal("#define") + ident.setResultsName("name") + "=" +
restOfLine.setResultsName("value")
for t,s,e in macroDef.scanString( testData ):
print t.name,":", t.value

# or a quick way to make a dictionary of the names and values
macros = dict([(t.name,t.value) for t,s,e in macroDef.scanString(testData)])
print "macros =", macros
print

--------------------
prints:
Example of an extractor
----------------------
MAX_LOCS : 100
USERNAME : "floyd"
PASSWORD : "swordfish"
macros = {'USERNAME': '"floyd"', 'PASSWORD': '"swordfish"', 'MAX_LOCS':
'100'}


Note that scanString worked only with the expressions we defined, and
ignored pretty much everything else.

scanString has a companion method, transformString. transformString calls
scanString internally - the purpose is to apply any parse actions or
suppressions on the matched tokens, substitute them back in for the original
text, and then return the transformed string. Here are two transformer
examples, one uses the macros dictionary we just created, and does simple
macro substitution; the other converts C++-namespaced references to
C-compatible global symbols (something we had to do in the early days of
CORBA):

#################
print "Examples of a transformer"
print "----------------------"

# convert C++ namespaces to mangled C-compatible names
scopedIdent = ident + OneOrMore( Literal("::").suppress() + ident )
scopedIdent.setParseAction(lambda s,l,t: "_".join(t))

print "(replace namespace-scoped names with C-compatible names)"
print scopedIdent.transformString( testData )


# or a crude pre-processor (use parse actions to replace matching text)
def substituteMacro(s,l,t):
if t[0] in macros:
return macros[t[0]]
ident.setParseAction( substituteMacro )
ident.ignore(macroDef)

print "(simulate #define pre-processor)"
print ident.transformString( testData )

--------------------------
prints:
Examples of a transformer
----------------------
(replace namespace-scoped names with C-compatible names)

#define MAX_LOCS=100
#define USERNAME = "floyd"
#define PASSWORD = "swordfish"

a = MAX_LOCS;
CORBA_initORB("xyzzy", USERNAME, PASSWORD );


(simulate #define pre-processor)

#define MAX_LOCS=100
#define USERNAME = "floyd"
#define PASSWORD = "swordfish"

a = 100;
CORBA::initORB("xyzzy", "floyd", "swordfish" );


I'd say it took me about 8 weeks to develop a complete Verilog parser using
pyparsing, so I can sympathize that you wouldn't want to write a complete
parser for it. But the individual elements are pretty straightforward, and
can map to pyparsing expressions without much difficulty.

Lastly, pyparsing is not as fast as RE's. But early performance problems
can often be improved through some judicious grammar tuning. And for many
parsing applications, pyparsing is plenty fast enough.

Regards,
-- Paul
 
P

Paddy

I have another use case.
If you want to match a comma separated list of words you end up writing
what constitutes a word twice, i.e:
r"\w+[,\w+]"
As what constitues a word gets longer, you have to repeat a longer RE
fragment so the fact that it is a match of a comma separated list is
lost, e.g:
r"[a-zA-Z_]\w+[,[a-zA-Z_]\w+]"

- Paddy.
 
P

Paul McGuire

Paddy said:
I have another use case.
If you want to match a comma separated list of words you end up writing
what constitutes a word twice, i.e:
r"\w+[,\w+]"
As what constitues a word gets longer, you have to repeat a longer RE
fragment so the fact that it is a match of a comma separated list is
lost, e.g:
r"[a-zA-Z_]\w+[,[a-zA-Z_]\w+]"

- Paddy.
Write a short function to return a comma separated list RE. This has the
added advantage of DRY, too. Adding an optional delim argument allows you
to generalize to lists delimited by dots, dashes, etc.

(Note - your posted re requires 2-letter words - I think you meant
"[A-Za-z_]\w*", not "[A-Za-z_]\w+".)
-- Paul


import re

def commaSeparatedList(regex, delim=","):
return "%s[%s%s]*" % (regex, delim, regex)

listOfWords = re.compile( commaSeparatedList(r"\w+") )
listOfIdents = re.compile( commaSeparatedList(r"[A-Za-z_]\w*") )

# might be more robust - people put whitespace in the darndest places!
def whitespaceTolerantCommaSeparatedList(regex, delim=","):
return r"%s[\s*%s\s*%s]*" % (regex, delim, regex)


# (BTW, delimitedList in pyparsing does this too - the default delimiter is
a comma, but other expressions can be used too)
from pyparsing import Word, delimitedList, alphas, alphanums

listOfWords = delimitedList( Word(alphas) )
listOfIdents = delimitedList( Word(alphas+"_", alphanums+"_") )


-- Paul
 
E

Edward Elliott

Paddy said:
I have another use case.
If you want to match a comma separated list of words you end up writing
what constitutes a word twice, i.e:
r"\w+[,\w+]"

That matches one or more alphanum characters followed by exactly one comma,
plus, or alphanum. I think you meant
r'\w+(,\w+)*'

or if you don't care where or how many commas there are
r'[\w,]*'

or if previous but has to start with alphanum
r'\w[\w,]*'

As what constitues a word gets longer, you have to repeat a longer RE
fragment so the fact that it is a match of a comma separated list is
lost, e.g:
r"[a-zA-Z_]\w+[,[a-zA-Z_]\w+]"

That's why god invented % interpolation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top