Using Regular Expressions to Parse SQL

C

ct60

Hello again -

I do not seem to be able to get a handle on non-greedy pattern
matching.

I am trying to parse the following - note that there are no line
breaks in the string:

" FROM ((qry_Scores_Lookup1 INNER JOIN CSS_Rpt1 ON
(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) AND
(qry_Scores_Lookup1.lastcdu = CSS_Rpt1.lastcdu))"

What I would like to do is be able to split on the "AND" and the "="
to come up with the following:
qry_Scores_Lookup1.desc
CSS_Rpt1.desc
qry_Scores_Lookup1.lastcdu
CSS_Rpt1.lastcdu

The following is one of my many attempts to do this:

import re

s= " FROM ((qry_Scores_Lookup1 INNER JOIN CSS_Rpt1 ON
(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) AND
(qry_Scores_Lookup1.lastcdu = CSS_Rpt1.lastcdu))"

pat = " FROM .+ (?:INNER|LEFT|RIGHT) JOIN .+ ON (?:AND)*?((.+)=(.+))"

m = re.match(pat, s)

if m is None:
print "No Match"
else:
for mat in m.groups():
print mat

My pattern does not even come close.

Any help would be greatly appreciated. My goal is to analyse a large
number of SQL querys to try to identify the join field and see where
indexing might make sense.

While I am mostly interested in understanding regular expressions, I
would also be interested in knowing about any Python SQL parsers out
there.

Thanks in advance.

Chris ([email protected])
 
P

Paul McGuire

Hello again -

I do not seem to be able to get a handle on non-greedy pattern
matching.

Regexps wont cut it when you have to parse nested ()'s in a logical
expression.

Here is a pyparsing solution. For your specific application, you will
need to sift through WHERE clauses too, you should be able to use this
as a starting point for creating a WHERE clause parser.

-- Paul


from pyparsing import *

# define terminal items
identifier = Word(alphas,alphanums+'_')
column_name = Combine( identifier + ZeroOrMore('.' + identifier) )
integer = Word(nums)
value = integer | sglQuotedString

relationalOperator = oneOf("< = > >= <= != <>")
simple_comparison = Group(column_name + relationalOperator +
(column_name | value))

AND_cl = CaselessLiteral("AND")
OR_cl = CaselessLiteral("OR")
NOT_cl = CaselessLiteral("NOT")

logical_expression = operatorPrecedence( simple_comparison,
[
(NOT_cl, 1, opAssoc.RIGHT),
(OR_cl, 2, opAssoc.LEFT),
(AND_cl, 2, opAssoc.LEFT),
])

# syntax for FROM ... join expression
FROM_cl = CaselessLiteral("FROM")
ON_cl = CaselessLiteral("ON")
INNER_cl = CaselessLiteral("INNER")
JOIN_cl = CaselessLiteral("JOIN")

join_expr = identifier('table1') + INNER_cl + JOIN_cl + \
identifier('table2') + ON_cl + \
logical_expression('join_condition')
from_clause = FROM_cl + '(' + join_expr + ')'


test = """ FROM (qry_Scores_Lookup1 INNER JOIN CSS_Rpt1 ON
(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) AND
(qry_Scores_Lookup1.lastcdu = CSS_Rpt1.lastcdu))"""

# parsing the FROM clause normally
join_fields = from_clause.parseString(test)
print "\nParsed tokens, plus any named fields"
print join_fields.dump()
print "\nAccessing some named fields directly"
print "table1 =", join_fields.table1
print "join_condition =", join_fields.join_condition

# create parse action to tally up column names as they are found
# during the parsing process
def tally_names(tokens):
column_names.append(tokens[0])
column_name.setParseAction(tally_names)

# parse the FROM clause, but just let the parse action do the work
print "\nExtracting the column names"
column_names = []
from_clause.parseString(test)
print column_names


Prints:

Parsed tokens, plus any named fields
['FROM', '(', 'qry_Scores_Lookup1', 'INNER', 'JOIN', 'CSS_Rpt1', 'ON',
[['qry_Scores_Lookup1.desc', '=', 'CSS_Rpt1.desc'], 'AND',
['qry_Scores_Lookup1.lastcdu', '=', 'CSS_Rpt1.lastcdu']], ')']
- join_condition: [['qry_Scores_Lookup1.desc', '=', 'CSS_Rpt1.desc'],
'AND', ['qry_Scores_Lookup1.lastcdu', '=', 'CSS_Rpt1.lastcdu']]
- table1: qry_Scores_Lookup1
- table2: CSS_Rpt1

Accessing some named fields directly
table1 = qry_Scores_Lookup1
join_condition = [['qry_Scores_Lookup1.desc', '=', 'CSS_Rpt1.desc'],
'AND', ['qry_Scores_Lookup1.lastcdu', '=', 'CSS_Rpt1.lastcdu']]

Extracting the column names
['qry_Scores_Lookup1.desc', 'CSS_Rpt1.desc',
'qry_Scores_Lookup1.lastcdu', 'CSS_Rpt1.lastcdu']
 
R

Reedick, Andrew

-----Original Message-----
From: [email protected] [mailto:python-
[email protected]] On Behalf Of (e-mail address removed)
Sent: Tuesday, February 05, 2008 9:31 AM
To: (e-mail address removed)
Subject: Using Regular Expressions to Parse SQL


My pattern does not even come close.

Any help would be greatly appreciated. My goal is to analyse a large
number of SQL querys to try to identify the join field and see where
indexing might make sense.

While I am mostly interested in understanding regular expressions, I
would also be interested in knowing about any Python SQL parsers out
there.


Python's regexes are a tad odd compared to Perl (or Perl's regexes are
odd. Relativity.) You need re.DOTALL to handle newlines in the sql:
DOTALL
Make the "." special character match any character at all,
including a newline; without this flag, "." will match anything except a
newline.



import re

s= ''' FROM ((qry_Scores_Lookup1 INNER JOIN CSS_Rpt1 ON
(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) AND
(qry_Scores_Lookup1.lastcdu = CSS_Rpt1.lastcdu))'''

pat =
r"(^|\s+)FROM\s.+\s(?:INNER|LEFT|RIGHT)\s+JOIN\s.+\sON\s+((\s*(?:\S+)\s*
=\s*(?:\S+))(?:\s+|\s+AND\s+|$))+"
m = re.compile(pat, re.DOTALL).match(s)
if m:
print m


('(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) ', '(qry_Scores_Lookup1.desc
= CSS_Rpt1.desc)')


You should be able to tweak the regex to get the exact matches you want.

An alternative to writing a lengthy regex might be to just grab
everything after the ON and then string split on "AND".



*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA623
 
C

ct60

Firstly, thanks to those who posted.

I just do not understand how the non-greedy operator works.

Using the following code:

import re

s = "qry_Lookup.desc = CSS_Rpt1.desc AND qry_Lookup.lcdu1 =
CSS_Rpt1.lcdu"

pat = "(.+=)+?(.+)"

m = re.match(pat, s)

if m is None:
print "No Match"
else:
for mat in m.groups():
print mat

I would expect that the first group would match one or more times with
the fewest amount of text. However the result is:
qry_Lookup.desc = CSS_Rpt1.desc AND qry_Lookup.lcdu1 =
CSS_Rpt1.lcdu

The first match of the "=" char is still greedy. I would have
expected:
qry_Lookup.desc =
CSS_Rpt1.desc AND qry_Lookup.lcdu1
=
CSS_Rpt1.lcdu

I'm obviously missing something because the non-greedy match seems to
not be acting as expected.

Any insights would be greatly appreciated.

Thanks in advance,

Chris ([email protected])
 
M

MRAB

Firstly, thanks to those who posted.

I just do not understand how the non-greedy operator works.

Using the following code:

import re

s = "qry_Lookup.desc = CSS_Rpt1.desc AND qry_Lookup.lcdu1 =
CSS_Rpt1.lcdu"

pat = "(.+=)+?(.+)"
^^
| This is your problem.

The ".+" is greedy and is matching up to the last "=". Try "(.+?=)(.
+)" instead.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,983
Messages
2,570,187
Members
46,747
Latest member
jojoBizaroo

Latest Threads

Top