Java Regex Problem

S

stevengarcia

I want to extract all the content between HTML <li> tags. I'm using
regular expressions and I'm not capturing every match with my regex.
What I have is:

String regex = "<li>(.*)</li>";
String content = "<html><li>aaa</li><li>bbb</li></html>";

Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

The result of this is "aaa</li><li>bbb" and that is not what I want. I
instead want to just print "aaa" and "bbb". What am I doing wrong?

Thanks for your help.
 
L

Lars-Åke Aspelin

I want to extract all the content between HTML <li> tags. I'm using
regular expressions and I'm not capturing every match with my regex.
What I have is:

String regex = "<li>(.*)</li>";
String content = "<html><li>aaa</li><li>bbb</li></html>";

Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

The result of this is "aaa</li><li>bbb" and that is not what I want. I
instead want to just print "aaa" and "bbb". What am I doing wrong?

Thanks for your help.


If you add a '?' you will prevent the greedy behaviour of the pattern
matching and gives you the expected result.

String regex = "<li>(.*?)</li>";

Hope this helps

Lars-Åke
 
O

Oliver Wong

I want to extract all the content between HTML <li> tags. I'm using
regular expressions and I'm not capturing every match with my regex.
What I have is:

String regex = "<li>(.*)</li>";
String content = "<html><li>aaa</li><li>bbb</li></html>";

Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

The result of this is "aaa</li><li>bbb" and that is not what I want. I
instead want to just print "aaa" and "bbb". What am I doing wrong?

In general, regular expressions are not sufficient to solve this
problem, since list-items in HTML can be nested, e.g.

<exampleHtmlSnippet>
<ul>
<li>
<ol>
<li>Foo</li>
<li>Bar</li>
<ol>
<li>
<li>Buntz</li>
</ul>
</exampleHtmlSnippet>

To solve the problem in general, you might look into an XML parser (if
your HTML is valid XML).

If you somehow "know" that you'll never get nested list-items, then the
problem is that your regular expression is behaving greedily; i.e. it's
matching as-much-as-possible, as opposed to as-little-as-possible.

- Oliver
 
S

stevengarcia

In general, regular expressions are not sufficient to solve this
problem, since list-items in HTML can be nested, e.g.

<exampleHtmlSnippet>
<ul>
<li>
<ol>
<li>Foo</li>
<li>Bar</li>
<ol>
<li>
<li>Buntz</li>
</ul>
</exampleHtmlSnippet>

Generally yes I would use an XML parser but I don't think the HTML will
ever change in my case. And using regex I think is easier than using
an XML parser and trying to locate particular nodes. I guess XPath
would help in that case but I'm confident this will work.
To solve the problem in general, you might look into an XML parser (if
your HTML is valid XML).

I'm not sure if it's valid or not. I guess I could parse it and find
out. :)
If you somehow "know" that you'll never get nested list-items, then the
problem is that your regular expression is behaving greedily; i.e. it's
matching as-much-as-possible, as opposed to as-little-as-possible.

I'm looking for something quick and easy, this is not for some big
company project. Thanks for your time.

-- Steve
 
O

Oliver Wong

Generally yes I would use an XML parser but I don't think the HTML will
ever change in my case. And using regex I think is easier than using
an XML parser and trying to locate particular nodes. I guess XPath
would help in that case but I'm confident this will work.

Not sure I understand; my objection is not that the HTML might change
during the program execution, but that list-items can be nested, as per the
example above (notice that the first <li> you encounter contains further
<li> elements).

If you honestly mean that the HTML will never ever change at all, why
not just hard-code the return result into your function?
I'm not sure if it's valid or not. I guess I could parse it and find
out. :)


I'm looking for something quick and easy, this is not for some big
company project. Thanks for your time.

Are you writing some sort of throw-away program which you'll run once,
and then throw away afterwards? I guess you're trying to do some analysis on
one particular HTML file. You should "describe the goal, not the step". See
http://www.catb.org/~esr/faqs/smart-questions.html#goal

- Oliver
 
S

stevengarcia

Oliver said:
Not sure I understand; my objection is not that the HTML might change
during the program execution, but that list-items can be nested, as per the
example above (notice that the first <li> you encounter contains further
<li> elements).

For the HTML I'm parsing, there won't be any nested list items.
If you honestly mean that the HTML will never ever change at all, why
not just hard-code the return result into your function?

What's between the list items can be variable, but it will always be
text and not embedded HTML. So I don't think hardcoding the return
result would work (I'm not sure what I would hard-code anyway.)
Are you writing some sort of throw-away program which you'll run once,
and then throw away afterwards? I guess you're trying to do some analysis on
one particular HTML file. You should "describe the goal, not the step". See
http://www.catb.org/~esr/faqs/smart-questions.html#goal

I'm writing something for my own personal use - it's a program that
will screen scrap a website for information that I want. Because I do
not expect this code to work in perpetuity, I'm looking for a quick and
easy way to reliably extract information from an HTML page. It's kind
of like a prototype of sorts.

As for not stating the goal, I have actually abstracted more from you
(and everyone else) because I ran into a problem that was not inherent
to my task. I recognized that regular expressions can be greedy or
reluctant, and I did some research on those, but I didn't get enough
information to help me. So the problem is really not whether I'm
finding the "most right" solution for parsing HTML. I am confident
that the program, when I'm finished, will satisfactorily accomplish my
task, despite the real risks you identified (which, BTW, I've already
determined to be low enough risk not to warrant another solution, like
XML parsing.)

The problem I wanted to state to the group was how do I prevent my
regular expression from grouping too much information? I happen to use
HTML as my example (which has caused the confusion) but could have made
up some other example as well.
 
O

Oliver Wong

Oliver said:
Are you writing some sort of throw-away program which you'll run
once,
and then throw away afterwards? I guess you're trying to do some analysis
on
one particular HTML file. You should "describe the goal, not the step".
See
http://www.catb.org/~esr/faqs/smart-questions.html#goal
[...]

As for not stating the goal, I have actually abstracted more from you
(and everyone else) because I ran into a problem that was not inherent
to my task. I recognized that regular expressions can be greedy or
reluctant, and I did some research on those, but I didn't get enough
information to help me. So the problem is really not whether I'm
finding the "most right" solution for parsing HTML. I am confident
that the program, when I'm finished, will satisfactorily accomplish my
task, despite the real risks you identified (which, BTW, I've already
determined to be low enough risk not to warrant another solution, like
XML parsing.)

The problem I wanted to state to the group was how do I prevent my
regular expression from grouping too much information? I happen to use
HTML as my example (which has caused the confusion) but could have made
up some other example as well.

Okay, fair enough. You saw Lars' post, right? Use '?' to disable greedy
matching:

<quote>
"<li>(.*?)</li>";
</quote>

- Oliver
 
R

Roedy Green

it's a program that
will screen scrap a website for information that I want.

You can use plain old indexOf to find the stuff surrounding what you
want and substring to extract it. It is fast and impervious to all
kinds of non-grammatical stuff in there.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,734
Latest member
manin

Latest Threads

Top