Help simplify complex regexp needing positive lookahead and reluctant quantifers

D

david.karr

I'm trying to build a regexp to handle somewhat complex data.

My sample data is the following (abstracted from real data):
--------------
*XXXlkjsflkw34lkjsfd
2XXXlkjsdfojsfjoimf344
3XXXabcdef9999999
4XXX9f9f9f9f9f9f9f9f
5XXXg8g8g8g8g8g8g8g
6XXXe6e6e6e6e6e6e6e6e
YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
/XXX 2
--------------

The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
"YYY", "ZZZ", and "AAA" could be in any order, and some could be
missing, or others like it could be added. What I'd like to build is a
regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
"associated data", up to either the next "[A-Z]{3}=", or the ending
"/XXX". If I can get the "associated data" into group values, I can
use other regexps for the detail in those group values.

The regexp that I've built so far comes close to solving this, but not
quite. This is what I have so far:

--------------
"(?sm)\\*.{3}.*\n" +
"2.{3}.*\n" +
"3.{3}.*\n" +
"4.{3}.*\n" +
"5.{3}.*\n" +
"6.{3}.*\n" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
"/[A-Z]{3}.*";
--------------

You can ignore for now the fact that I'm not verifying that all the
places that require "XXX" are all "XXX". The problem area is the
"[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
able to simplify those three repeated lines into a single expression,
which would handle any number of those. I tried the following, to
replace those three lines:

"( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"

but that didn't seem to work, and I'm not sure why.

The following is the output from my Java program, using the working
regexp, where it iterated through the found groups. I provide this
just as another view of what I'm trying to capture:

--------------
group[YYY=]
group[D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
]
group[ZZZ=]
group[gggggggggggg
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
--------------
 
L

Lisa

I'm trying to build a regexp to handle somewhat complex data.

My sample data is the following (abstracted from real data):
--------------
*XXXlkjsflkw34lkjsfd
2XXXlkjsdfojsfjoimf344
3XXXabcdef9999999
4XXX9f9f9f9f9f9f9f9f
5XXXg8g8g8g8g8g8g8g
6XXXe6e6e6e6e6e6e6e6e
YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
/XXX 2
--------------

The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
"YYY", "ZZZ", and "AAA" could be in any order, and some could be
missing, or others like it could be added. What I'd like to build is a
regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
"associated data", up to either the next "[A-Z]{3}=", or the ending
"/XXX". If I can get the "associated data" into group values, I can
use other regexps for the detail in those group values.

The regexp that I've built so far comes close to solving this, but not
quite. This is what I have so far:

--------------
"(?sm)\\*.{3}.*\n" +
"2.{3}.*\n" +
"3.{3}.*\n" +
"4.{3}.*\n" +
"5.{3}.*\n" +
"6.{3}.*\n" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
"/[A-Z]{3}.*";
--------------

You can ignore for now the fact that I'm not verifying that all the
places that require "XXX" are all "XXX". The problem area is the
"[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
able to simplify those three repeated lines into a single expression,
which would handle any number of those. I tried the following, to
replace those three lines:

"( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"

but that didn't seem to work, and I'm not sure why.

The following is the output from my Java program, using the working
regexp, where it iterated through the found groups. I provide this
just as another view of what I'm trying to capture:

--------------
group[YYY=]
group[D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
]
group[ZZZ=]
group[gggggggggggg
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]

did you consider having a simpler expression and passing
over the data in two passes like unix folks like to do

grep "pat1" filename | grep "pat2" | grep "pat3"
 
A

Alan Moore

I'm trying to build a regexp to handle somewhat complex data.

My sample data is the following (abstracted from real data):
--------------
*XXXlkjsflkw34lkjsfd
2XXXlkjsdfojsfjoimf344
3XXXabcdef9999999
4XXX9f9f9f9f9f9f9f9f
5XXXg8g8g8g8g8g8g8g
6XXXe6e6e6e6e6e6e6e6e
YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
/XXX 2
--------------

The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
"YYY", "ZZZ", and "AAA" could be in any order, and some could be
missing, or others like it could be added. What I'd like to build is a
regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
"associated data", up to either the next "[A-Z]{3}=", or the ending
"/XXX". If I can get the "associated data" into group values, I can
use other regexps for the detail in those group values.

The regexp that I've built so far comes close to solving this, but not
quite. This is what I have so far:

--------------
"(?sm)\\*.{3}.*\n" +
"2.{3}.*\n" +
"3.{3}.*\n" +
"4.{3}.*\n" +
"5.{3}.*\n" +
"6.{3}.*\n" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
"/[A-Z]{3}.*";
--------------

You can ignore for now the fact that I'm not verifying that all the
places that require "XXX" are all "XXX". The problem area is the
"[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
able to simplify those three repeated lines into a single expression,
which would handle any number of those. I tried the following, to
replace those three lines:

"( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"

but that didn't seem to work, and I'm not sure why.

The following is the output from my Java program, using the working
regexp, where it iterated through the found groups. I provide this
just as another view of what I'm trying to capture:

--------------
group[YYY=]
group[D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
]
group[ZZZ=]
group[gggggggggggg
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
--------------

The "(?sm)" at the beginnng puts the whole regex in DOTALL and
MULTILINE mode. The 'm' is having no effect, since you aren't using
any line anchors; the 's' is what's causing your problem. Each ".*"
initially gobbles up the whole rest of the input, then backs off as
far as necessary to permit the next part of the regex to match. That
works as intended until the line starting with '6' is reached. After
the dot-star there wolfs everything down, it starts regurgitating as
usual. When it reaches the '/' at the beginning of the last line, the
rest of the regex is able to match, because your combined
subexpression is optional. The dot-star in the '6' line ends up
keeping all the text the subexpression was supposed to match.
Changing the "*" that controls the subexpression to a "+" won't
help--it will only force the subexpression to match once, letting the
dot-star keep anything else.

You could fix that by making all the dot-stars reluctant, but a better
way (more efficient, less error-prone) would be to remove the "(?sm)"
and add "(?s)" to the subexpression, since that's the only place you
actually need DOTALL mode:

--------------
"\\*.{3}.*\n" +
"2.{3}.*\n" +
"3.{3}.*\n" +
"4.{3}.*\n" +
"5.{3}.*\n" +
"6.{3}.*\n" +
"((?s: ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*)" +
"/[A-Z]{3}.*";
--------------

Note that I also changed the subexpression's enclosing group to
non-capturing, and put the capturing group around it and its
quantifier. That way, all the YYY|ZZZ|AAA entries with their
associated data are captured in group(1). The way you had it, only
the last entry would have been retained.
 
D

david.karr

Ok, this looks very promising, but it doesn't quite work yet. I'll
provide both the regexp I'm using a sample string, so you could
validate what I see, if you can.

I'm also wondering whether you meant to enter "?s:", or "(?s)" instead.
I tried both variations, with the same result.

The regexp I'm now using is this:
---------------
"\\*.{3}.*\n" +
"2.{3}.*\n" +
"3.{3}.*\n" +
"4.{3}.*\n" +
"5.{3}.*\n" +
"6.{3}.*\n" +
"((?s: ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*)" +
"/[A-Z]{3}.*";
---------------

My sample data is this:
---------------
*XXXlkjsflkw34lkjsfd
2XXXlkjsdfojsfjoimf344
3XXXabcdef9999999
4XXX9f9f9f9f9f9f9f9f
5XXXg8g8g8g8g8g8g8g
6XXXe6e6e6e6e6e6e6e6e
YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
/XXX 2
---------------

My code is roughly this:
---------------
Pattern pattern = Pattern.compile(patternMask);
Matcher matcher = pattern.matcher(readSample);
System.out.println("groupCount[" + matcher.groupCount() + "]");
boolean found = matcher.find();
System.out.println("found[" + found + "]");
---------------

Where "patternMask" and "readSample" correspond to my regexp and the
sample data.

With this regexp and sample data, the "groupCount" prints out as "3",
and "found" is false.
 
A

Alan Moore

Ok, this looks very promising, but it doesn't quite work yet. I'll
provide both the regexp I'm using a sample string, so you could
validate what I see, if you can.

That looks like what I'm doing; here's my test code:

//==== code ========================================================

import java.util.regex.*;

public class Test
{
public static void main(String[] args)
{
String regex = "\\*.{3}.*\n"
+ "2.{3}.*\n"
+ "3.{3}.*\n"
+ "4.{3}.*\n"
+ "5.{3}.*\n"
+ "6.{3}.*\n"
+ "((?s: ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*)"
+ "/[A-Z]{3}.*";

String input = "*XXXlkjsflkw34lkjsfd\n"
+ "2XXXlkjsdfojsfjoimf344\n"
+ "3XXXabcdef9999999\n"
+ "4XXX9f9f9f9f9f9f9f9f\n"
+ "5XXXg8g8g8g8g8g8g8g\n"
+ "6XXXe6e6e6e6e6e6e6e6e\n"
+ " YYY=D/23333333\n"
+ " -xxxxxxxxxxxx\n"
+ " -yyyyyyyyyyyy\n"
+ " ZZZ=gggggggggggg\n"
+ " AAA=hhhhhhhhhh\n"
+ " -jjjjjjjjjjj\n"
+ " -kkkkkkkkkkk\n"
+ "/XXX 2";

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
if (m.find())
{
System.out.println(m.group(1));
}
}
}

//==================================================================

This prints:

YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
I'm also wondering whether you meant to enter "?s:", or "(?s)" instead.
I tried both variations, with the same result.

"(?s)" sets the DOTALL flag for the rest of the rest of the regex or
until you cancel it with "(?-s)". "(?s:<expr>)" both creates a
non-capturing group and sets the flag, but the flag is in effect only
within that group.
 
D

david.karr

Ok, the difference between our two was that my sample has "\r\n" for
eols. Once I changed my pattern to check for that explicitly, I get
similar output. I tried some variations with "$" and "(?m)", but it
only got past this if I specifically used "\r\n".

However, now I have to go deeper into this, and the current expression
doesn't quite do what I need.

What I really need to capture in individual groups would be the
following (each group surrounded by brackets):

[YYY=]
[D/23333333
xxxxxxxxxxxx
yyyyyyyyyyyy]
[ZZZ=]
[gggggggggggg]
[AAA=]
[hhhhhhhhhh
jjjjjjjjjjj
kkkkkkkkkkk]

Note that I've removed the initial spaces and dashes. That's my end
state, but I can work to that step by step.

When my code steps through all the groups it found, it finds this:

---------------
group[ YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
 
A

Alan Moore

Ok, the difference between our two was that my sample has "\r\n" for
eols. Once I changed my pattern to check for that explicitly, I get
similar output. I tried some variations with "$" and "(?m)", but it
only got past this if I specifically used "\r\n".

However, now I have to go deeper into this, and the current expression
doesn't quite do what I need.

What I really need to capture in individual groups would be the
following (each group surrounded by brackets):

[YYY=]
[D/23333333
xxxxxxxxxxxx
yyyyyyyyyyyy]
[ZZZ=]
[gggggggggggg]
[AAA=]
[hhhhhhhhhh
jjjjjjjjjjj
kkkkkkkkkkk]

Note that I've removed the initial spaces and dashes. That's my end
state, but I can work to that step by step.

When my code steps through all the groups it found, it finds this:

---------------
group[ YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
---------------

I don't care about the first group, because that surrounds all of the
subrecords. I would have hoped that the next group would be "YYY=",
followed by the group with its associated data, and so on.

When you have a capturing group that's controlled by a quantifier, the
only thing you can retrieve after a successful match is the *last*
thing that was matched by that group. Remember that the groupCount()
method only tells you how many capturing groups there are in the
Matcher's parent Pattern; it doesn't say anything about what was
actually matched.

You initially changed your regex to match all the subrecords with a
quantified subexpression because you didn't know how many subrecords
there would be. When you did that, you gave up the ability to break
out the individual subrecords in a single pass. What you have to do
now is take the substring containing the subrecords and process it
separately to break them out. In the following code, I went ahead and
added a third layer of processing to get rid of those initial spaces
and dashes as well.

//==== code ========================================================

import java.util.regex.*;

public class Test
{
public static void main(String[] args)
{
String regex1 = "\\*.{3}.*\r?\n"
+ "2.{3}.*\r?\n"
+ "3.{3}.*\r?\n"
+ "4.{3}.*\r?\n"
+ "5.{3}.*\r?\n"
+ "6.{3}.*\r?\n"
+ "((?s: [A-Z]{3}=.*?(?=[ /][A-Z]{3}))*)"
+ "/[A-Z]{3}.*";
Pattern p1 = Pattern.compile(regex1);

String regex2 = "(?s) ([A-Z]{3}=)(.*?)(?=\r?\n [A-Z]{3}|$)";
Pattern p2 = Pattern.compile(regex2);

String regex3 = "(?: -)?(.+)";
Pattern p3 = Pattern.compile(regex3);

String input = "*XXXlkjsflkw34lkjsfd\n"
+ "2XXXlkjsdfojsfjoimf344\n"
+ "3XXXabcdef9999999\n"
+ "4XXX9f9f9f9f9f9f9f9f\n"
+ "5XXXg8g8g8g8g8g8g8g\n"
+ "6XXXe6e6e6e6e6e6e6e6e\n"
+ " YYY=D/23333333\n"
+ " -xxxxxxxxxxxx\n"
+ " -yyyyyyyyyyyy\n"
+ " ZZZ=gggggggggggg\n"
+ " AAA=hhhhhhhhhh\n"
+ " -jjjjjjjjjjj\n"
+ " -kkkkkkkkkkk\n"
+ "/XXX 2";

Matcher m1 = p1.matcher(input);
if (m1.find())
{
String sub = m1.group(1);
Matcher m2 = p2.matcher(sub);
while (m2.find())
{
System.out.println("[" + m2.group(1) + "]");
String subsub = m2.group(2);
System.out.print("[");
Matcher m3 = p3.matcher(subsub);
while (m3.find())
{
System.out.println(m3.group(1));
}
System.out.println("]");
}
}
}
}

//==================================================================

result:

[YYY=]
[D/23333333
xxxxxxxxxxxx
yyyyyyyyyyyy
]
[ZZZ=]
[gggggggggggg
]
[AAA=]
[hhhhhhhhhh
jjjjjjjjjjj
kkkkkkkkkkk
]
 
D

david.karr

Excellent. Thanks for the thorough detail. This could have been a
whole chapter in "Regular Expression Recipes" :) .
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,999
Messages
2,570,244
Members
46,839
Latest member
MartinaBur

Latest Threads

Top