I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4
Thanks
-Mathieu
import re
line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
Auto Window Width SL 1 "
patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
<snip>
I love the smell of regex'es in the morning!
For more legible posting (and general maintainability), try breaking
up your quoted strings like this:
line = \
" (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
"Auto Window Width SL 1 "
patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")
Of course, the problem is that you have a greedy match in the part of
the regex that is supposed to stop between "Settings" and "Auto".
Change patt to:
patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+?)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")
or if you prefer:
patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
It looks like you wrote this regex to process this specific input
string - it has a fragile feel to it, as if you will have to go back
and tweak it to handle other data that might come along, such as
(xx42,xx0A) Honeywell: Inverse Flitznoid (Kelvin)
80 SL 1
Just out of curiosity, I wondered what a pyparsing version of this
would look like. See below:
from pyparsing import Word,hexnums,delimitedList,printables,\
White,Regex,nums
line = \
" (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
"Auto Window Width SL 1 "
# define fields
hexint = Word(hexnums+"x")
text = delimitedList(Word(printables),
delim=White(" ",exact=1), combine=True)
type_label = Regex("[A-Z][A-Z]_?O?W?")
int_label = Word(nums+"n-")
# define line structure - give each field a name
line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + \
text("desc") + text("window") + type_label("type") + \
int_label("int")
line_parts = line_defn.parseString(line)
print line_parts.dump()
print line_parts.desc
Prints:
['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
Settings', 'Auto Window Width', 'SL', '1']
- desc: Siemens: Thorax/Multix FD Lab Settings
- int: 1
- type: SL
- window: Auto Window Width
- x: 0021
- y: xx0A
Siemens: Thorax/Multix FD Lab Settings
I was just guessing on the field names, but you can see where they are
defined and change them to the appropriate values.
-- Paul