regex question

M

mathieu

I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4

Thanks
-Mathieu

import re

line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
Auto Window Width SL 1 "
patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
m = patt.match(line)
if m:
print m.group(3)
print m.group(4)
 
W

Wanja Chresta

Hey Mathieu

Due to word wrap I'm not sure what you want to do. What result do you
expect? I get:('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings Auto Window
Width ', ' ', 'SL', '1')
But only when I insert a space in the 3rd char group (I'm not sure if
your original pattern has a space there or not). So the third group is:
([A-Za-z0-9./:_ -]+). If I do not insert the space, the pattern does not
match the line.

I also cant see how the format of your line is. If it is like this:
line = "...Siemens: Thorax/Multix FD Lab Settings Auto Window Width..."
where "Auto Window Width" should be the 4th group, you have to mark the
+ in the 3rd group as non-greedy (it's done with a "?"):
http://docs.python.org/lib/re-syntax.html
([A-Za-z0-9./:_ -]+?)
With that I get:('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings', 'Auto Window
Width ', 'SL', '1')
Which probably is what you want. You can also add the non-greedy marker
in the fourth group, to get rid of the tailing spaces.

HTH
Wanja
 
B

bearophileHUGS

mathieu, stop writing complex REs like obfuscated toys, use the
re.VERBOSE flag and split that RE into several commented and
*indented* lines (indented just like Python code), the indentation
level has to be used to denote nesting. With that you may be able to
solve the problem by yourself. If not, you can offer us a much more
readable thing to fix.

Bye,
bearophile
 
G

grflanagan

I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4

Thanks
-Mathieu

import re

line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
Auto Window Width SL 1 "
patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
m = patt.match(line)
if m:
print m.group(3)
print m.group(4)


I don't know if it solves your problem, but if you want to match a
dash (-), then it must be either escaped or be the first element in a
character class.

Gerard
 
P

Paul McGuire

I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4

Thanks
-Mathieu

import re

line = "      (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
Auto Window Width          SL   1 "
patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
<snip>

I love the smell of regex'es in the morning!

For more legible posting (and general maintainability), try breaking
up your quoted strings like this:

line = \
" (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
"Auto Window Width SL 1 "

patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")


Of course, the problem is that you have a greedy match in the part of
the regex that is supposed to stop between "Settings" and "Auto".
Change patt to:

patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+?)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")

or if you prefer:

patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")

It looks like you wrote this regex to process this specific input
string - it has a fragile feel to it, as if you will have to go back
and tweak it to handle other data that might come along, such as

(xx42,xx0A) Honeywell: Inverse Flitznoid (Kelvin)
80 SL 1


Just out of curiosity, I wondered what a pyparsing version of this
would look like. See below:

from pyparsing import Word,hexnums,delimitedList,printables,\
White,Regex,nums

line = \
" (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
"Auto Window Width SL 1 "

# define fields
hexint = Word(hexnums+"x")
text = delimitedList(Word(printables),
delim=White(" ",exact=1), combine=True)
type_label = Regex("[A-Z][A-Z]_?O?W?")
int_label = Word(nums+"n-")

# define line structure - give each field a name
line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + \
text("desc") + text("window") + type_label("type") + \
int_label("int")

line_parts = line_defn.parseString(line)
print line_parts.dump()
print line_parts.desc

Prints:
['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
Settings', 'Auto Window Width', 'SL', '1']
- desc: Siemens: Thorax/Multix FD Lab Settings
- int: 1
- type: SL
- window: Auto Window Width
- x: 0021
- y: xx0A
Siemens: Thorax/Multix FD Lab Settings

I was just guessing on the field names, but you can see where they are
defined and change them to the appropriate values.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top