D
david.karr
My code is in Java, but my problem is a complicated regexp.
Ironically, I think I'm more likely to get a better response in here
than elsewhere. It's too bad there's no "regular expressions"
newsgroup (that I can find).
My sample data is the following (abstracted from real data):
--------------
*XXXlkjsflkw34lkjsfd
2XXXlkjsdfojsfjoimf344
3XXXabcdef9999999
4XXX9f9f9f9f9f9f9f9f
5XXXg8g8g8g8g8g8g8g
6XXXe6e6e6e6e6e6e6e6e
YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
/XXX 2
--------------
The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
"YYY", "ZZZ", and "AAA" could be in any order, and some could be
missing, or others like it could be added. What I'd like to build is a
regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
"associated data", up to either the next "[A-Z]{3}=", or the ending
"/XXX". If I can get the "associated data" into group values, I can
use other regexps for the detail in those group values.
The regexp that I've built so far comes close to solving this, but not
quite. This is what I have so far (translated from Java string syntax
to Perl):
--------------
"(?sm)\\*.{3}.*\n" .
"2.{3}.*\n" .
"3.{3}.*\n" .
"4.{3}.*\n" .
"5.{3}.*\n" .
"6.{3}.*\n" .
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" .
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" .
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" .
"/[A-Z]{3}.*"
--------------
You can ignore for now the fact that I'm not verifying that all the
places that require "XXX" are all "XXX". The problem area is the
"[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
able to simplify those three repeated lines into a single expression,
which would handle any number of those. I tried the following, to
replace those three lines:
"( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"
but that didn't seem to work, and I'm not sure why.
The following is the output from my Java program, using the working
regexp, where it iterated through the found groups. I provide this
just as another view of what I'm trying to capture:
--------------
group[YYY=]
group[D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
]
group[ZZZ=]
group[gggggggggggg
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
--------------
Ironically, I think I'm more likely to get a better response in here
than elsewhere. It's too bad there's no "regular expressions"
newsgroup (that I can find).
My sample data is the following (abstracted from real data):
--------------
*XXXlkjsflkw34lkjsfd
2XXXlkjsdfojsfjoimf344
3XXXabcdef9999999
4XXX9f9f9f9f9f9f9f9f
5XXXg8g8g8g8g8g8g8g
6XXXe6e6e6e6e6e6e6e6e
YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
/XXX 2
--------------
The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
"YYY", "ZZZ", and "AAA" could be in any order, and some could be
missing, or others like it could be added. What I'd like to build is a
regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
"associated data", up to either the next "[A-Z]{3}=", or the ending
"/XXX". If I can get the "associated data" into group values, I can
use other regexps for the detail in those group values.
The regexp that I've built so far comes close to solving this, but not
quite. This is what I have so far (translated from Java string syntax
to Perl):
--------------
"(?sm)\\*.{3}.*\n" .
"2.{3}.*\n" .
"3.{3}.*\n" .
"4.{3}.*\n" .
"5.{3}.*\n" .
"6.{3}.*\n" .
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" .
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" .
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" .
"/[A-Z]{3}.*"
--------------
You can ignore for now the fact that I'm not verifying that all the
places that require "XXX" are all "XXX". The problem area is the
"[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
able to simplify those three repeated lines into a single expression,
which would handle any number of those. I tried the following, to
replace those three lines:
"( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"
but that didn't seem to work, and I'm not sure why.
The following is the output from my Java program, using the working
regexp, where it iterated through the found groups. I provide this
just as another view of what I'm trying to capture:
--------------
group[YYY=]
group[D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
]
group[ZZZ=]
group[gggggggggggg
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
--------------