regex question

mathieu · Feb 13, 2008

I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4

Thanks
-Mathieu

import re

line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
Auto Window Width SL 1 "
patt = re.compile("^\s*$([0-9A-Z]+),([0-9A-Zx]+)$\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
m = patt.match(line)
if m:
print m.group(3)
print m.group(4)

Wanja Chresta · Feb 13, 2008

Hey Mathieu

Due to word wrap I'm not sure what you want to do. What result do you
expect? I get

'0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings Auto Window
Width ', ' ', 'SL', '1')
But only when I insert a space in the 3rd char group (I'm not sure if
your original pattern has a space there or not). So the third group is:
([A-Za-z0-9./:_ -]+). If I do not insert the space, the pattern does not
match the line.

I also cant see how the format of your line is. If it is like this:
line = "...Siemens: Thorax/Multix FD Lab Settings Auto Window Width..."
where "Auto Window Width" should be the 4th group, you have to mark the
+ in the 3rd group as non-greedy (it's done with a "?"):
http://docs.python.org/lib/re-syntax.html
([A-Za-z0-9./:_ -]+?)
With that I get

'0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings', 'Auto Window
Width ', 'SL', '1')
Which probably is what you want. You can also add the non-greedy marker
in the fourth group, to get rid of the tailing spaces.

HTH
Wanja

bearophileHUGS · Feb 13, 2008

mathieu, stop writing complex REs like obfuscated toys, use the
re.VERBOSE flag and split that RE into several commented and
*indented* lines (indented just like Python code), the indentation
level has to be used to denote nesting. With that you may be able to
solve the problem by yourself. If not, you can offer us a much more
readable thing to fix.

Bye,
bearophile

grflanagan · Feb 13, 2008

I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4

Thanks
-Mathieu

import re

line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
Auto Window Width SL 1 "
patt = re.compile("^\s*$([0-9A-Z]+),([0-9A-Zx]+)$\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
m = patt.match(line)
if m:
print m.group(3)
print m.group(4)

I don't know if it solves your problem, but if you want to match a
dash (-), then it must be either escaped or be the first element in a
character class.

Gerard

Paul McGuire · Feb 13, 2008

I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4

Thanks
-Mathieu

import re

line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
Auto Window Width SL 1 "
patt = re.compile("^\s*$([0-9A-Z]+),([0-9A-Zx]+)$\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")

<snip>

I love the smell of regex'es in the morning!

For more legible posting (and general maintainability), try breaking
up your quoted strings like this:

line = \
" (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
"Auto Window Width SL 1 "

patt = re.compile(
"^\s*"
"$"
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"$\s+"
"([A-Za-z0-9./:_ -]+)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")

Of course, the problem is that you have a greedy match in the part of
the regex that is supposed to stop between "Settings" and "Auto".
Change patt to:

patt = re.compile(
"^\s*"
"$"
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"$\s+"
"([A-Za-z0-9./:_ -]+?)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")

or if you prefer:

patt = re.compile("^\s*$([0-9A-Z]+),([0-9A-Zx]+)$\s+([A-Za-z0-9./:_
-]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")

It looks like you wrote this regex to process this specific input
string - it has a fragile feel to it, as if you will have to go back
and tweak it to handle other data that might come along, such as

(xx42,xx0A) Honeywell: Inverse Flitznoid (Kelvin)
80 SL 1

Just out of curiosity, I wondered what a pyparsing version of this
would look like. See below:

from pyparsing import Word,hexnums,delimitedList,printables,\
White,Regex,nums

line = \
" (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
"Auto Window Width SL 1 "

# define fields
hexint = Word(hexnums+"x")
text = delimitedList(Word(printables),
delim=White(" ",exact=1), combine=True)
type_label = Regex("[A-Z][A-Z]_?O?W?")
int_label = Word(nums+"n-")

# define line structure - give each field a name
line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + \
text("desc") + text("window") + type_label("type") + \
int_label("int")

line_parts = line_defn.parseString(line)
print line_parts.dump()
print line_parts.desc

Prints:
['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
Settings', 'Auto Window Width', 'SL', '1']
- desc: Siemens: Thorax/Multix FD Lab Settings
- int: 1
- type: SL
- window: Auto Window Width
- x: 0021
- y: xx0A
Siemens: Thorax/Multix FD Lab Settings

I was just guessing on the field names, but you can see where they are
defined and change them to the appropriate values.

-- Paul

regex walktrough	4	Dec 8, 2012
Creating a regex to get multiple values and print	0	Jan 10, 2021
Help with python code!	18	Mar 31, 2013
Php modal form to email	1	Aug 28, 2024
Collect Excel Data from Website	5	Apr 30, 2022
regex line by line over file	8	Mar 27, 2014
Why is regex so slow?	21	Jun 18, 2013
Simple regex with whitespaces	6	Sep 11, 2006

regex question

mathieu

Wanja Chresta

bearophileHUGS

grflanagan

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads