Java Regex Problem

stevengarcia · Mar 27, 2006

I want to extract all the content between HTML <li> tags. I'm using
regular expressions and I'm not capturing every match with my regex.
What I have is:

String regex = "<li>(.*)</li>";
String content = "<html><li>aaa</li><li>bbb</li></html>";

Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

The result of this is "aaa</li><li>bbb" and that is not what I want. I
instead want to just print "aaa" and "bbb". What am I doing wrong?

Thanks for your help.

Lars-Åke Aspelin · Mar 27, 2006

I want to extract all the content between HTML <li> tags. I'm using
regular expressions and I'm not capturing every match with my regex.
What I have is:

String regex = "<li>(.*)</li>";
String content = "<html><li>aaa</li><li>bbb</li></html>";

Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

The result of this is "aaa</li><li>bbb" and that is not what I want. I
instead want to just print "aaa" and "bbb". What am I doing wrong?

Thanks for your help.

If you add a '?' you will prevent the greedy behaviour of the pattern
matching and gives you the expected result.

String regex = "<li>(.*?)</li>";

Hope this helps

Lars-Åke

Oliver Wong · Mar 27, 2006

I want to extract all the content between HTML <li> tags. I'm using
regular expressions and I'm not capturing every match with my regex.
What I have is:

String regex = "<li>(.*)</li>";
String content = "<html><li>aaa</li><li>bbb</li></html>";

Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

The result of this is "aaa</li><li>bbb" and that is not what I want. I
instead want to just print "aaa" and "bbb". What am I doing wrong?

In general, regular expressions are not sufficient to solve this
problem, since list-items in HTML can be nested, e.g.

<exampleHtmlSnippet>
<ul>
<li>
<ol>
<li>Foo</li>
<li>Bar</li>
<ol>
<li>
<li>Buntz</li>
</ul>
</exampleHtmlSnippet>

To solve the problem in general, you might look into an XML parser (if
your HTML is valid XML).

If you somehow "know" that you'll never get nested list-items, then the
problem is that your regular expression is behaving greedily; i.e. it's
matching as-much-as-possible, as opposed to as-little-as-possible.

- Oliver

stevengarcia · Mar 28, 2006

In general, regular expressions are not sufficient to solve this
problem, since list-items in HTML can be nested, e.g.

<exampleHtmlSnippet>
<ul>
<li>
<ol>
<li>Foo</li>
<li>Bar</li>
<ol>
<li>
<li>Buntz</li>
</ul>
</exampleHtmlSnippet>

Generally yes I would use an XML parser but I don't think the HTML will
ever change in my case. And using regex I think is easier than using
an XML parser and trying to locate particular nodes. I guess XPath
would help in that case but I'm confident this will work.

To solve the problem in general, you might look into an XML parser (if
your HTML is valid XML).

I'm not sure if it's valid or not. I guess I could parse it and find
out.

If you somehow "know" that you'll never get nested list-items, then the
problem is that your regular expression is behaving greedily; i.e. it's
matching as-much-as-possible, as opposed to as-little-as-possible.

I'm looking for something quick and easy, this is not for some big
company project. Thanks for your time.

-- Steve

Oliver Wong · Mar 28, 2006

Generally yes I would use an XML parser but I don't think the HTML will
ever change in my case. And using regex I think is easier than using
an XML parser and trying to locate particular nodes. I guess XPath
would help in that case but I'm confident this will work.

Not sure I understand; my objection is not that the HTML might change
during the program execution, but that list-items can be nested, as per the
example above (notice that the first <li> you encounter contains further
<li> elements).

If you honestly mean that the HTML will never ever change at all, why
not just hard-code the return result into your function?

I'm not sure if it's valid or not. I guess I could parse it and find
out.

I'm looking for something quick and easy, this is not for some big
company project. Thanks for your time.

Are you writing some sort of throw-away program which you'll run once,
and then throw away afterwards? I guess you're trying to do some analysis on
one particular HTML file. You should "describe the goal, not the step". See
http://www.catb.org/~esr/faqs/smart-questions.html#goal

- Oliver

stevengarcia · Mar 28, 2006

Oliver said:
Not sure I understand; my objection is not that the HTML might change
during the program execution, but that list-items can be nested, as per the
example above (notice that the first <li> you encounter contains further
<li> elements).

For the HTML I'm parsing, there won't be any nested list items.

If you honestly mean that the HTML will never ever change at all, why
not just hard-code the return result into your function?

What's between the list items can be variable, but it will always be
text and not embedded HTML. So I don't think hardcoding the return
result would work (I'm not sure what I would hard-code anyway.)

Are you writing some sort of throw-away program which you'll run once,
and then throw away afterwards? I guess you're trying to do some analysis on
one particular HTML file. You should "describe the goal, not the step". See
http://www.catb.org/~esr/faqs/smart-questions.html#goal

I'm writing something for my own personal use - it's a program that
will screen scrap a website for information that I want. Because I do
not expect this code to work in perpetuity, I'm looking for a quick and
easy way to reliably extract information from an HTML page. It's kind
of like a prototype of sorts.

As for not stating the goal, I have actually abstracted more from you
(and everyone else) because I ran into a problem that was not inherent
to my task. I recognized that regular expressions can be greedy or
reluctant, and I did some research on those, but I didn't get enough
information to help me. So the problem is really not whether I'm
finding the "most right" solution for parsing HTML. I am confident
that the program, when I'm finished, will satisfactorily accomplish my
task, despite the real risks you identified (which, BTW, I've already
determined to be low enough risk not to warrant another solution, like
XML parsing.)

The problem I wanted to state to the group was how do I prevent my
regular expression from grouping too much information? I happen to use
HTML as my example (which has caused the confusion) but could have made
up some other example as well.

Oliver Wong · Mar 28, 2006

Oliver said:
Oliver said:

Are you writing some sort of throw-away program which you'll run
once,
and then throw away afterwards? I guess you're trying to do some analysis
on
one particular HTML file. You should "describe the goal, not the step".
See
http://www.catb.org/~esr/faqs/smart-questions.html#goal

Click to expand...

[...]

As for not stating the goal, I have actually abstracted more from you
(and everyone else) because I ran into a problem that was not inherent
to my task. I recognized that regular expressions can be greedy or
reluctant, and I did some research on those, but I didn't get enough
information to help me. So the problem is really not whether I'm
finding the "most right" solution for parsing HTML. I am confident
that the program, when I'm finished, will satisfactorily accomplish my
task, despite the real risks you identified (which, BTW, I've already
determined to be low enough risk not to warrant another solution, like
XML parsing.)

The problem I wanted to state to the group was how do I prevent my
regular expression from grouping too much information? I happen to use
HTML as my example (which has caused the confusion) but could have made
up some other example as well.

Okay, fair enough. You saw Lars' post, right? Use '?' to disable greedy
matching:

<quote>
"<li>(.*?)</li>";
</quote>

- Oliver

Roedy Green · Mar 28, 2006

it's a program that
will screen scrap a website for information that I want.

You can use plain old indexOf to find the stuff surrounding what you
want and substring to extract it. It is fast and impervious to all
kinds of non-grammatical stuff in there.

Dropdown menu in a fixed navbar	2	Apr 1, 2024
Only one table shows up with the information	2	Mar 29, 2023
Embarrassing regex question	5	Jun 20, 2010
SQL Connection string regex pattern to parse sections	1	May 9, 2024
complex regex	1	Oct 10, 2007
complex regex	1	Oct 10, 2007
regex problem	9	Aug 28, 2008
Regex is correct but java won't parse it ?	18	Aug 14, 2007

Java Regex Problem

stevengarcia

Lars-Åke Aspelin

Oliver Wong

stevengarcia

Oliver Wong

stevengarcia

Oliver Wong

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads