Regular Expression extract all links in a page.

S

smartestdesign

I am trying to extract all urls for a perticular page, but without a
success.

java.util.regex.Pattern p = Pattern.compile("<a
href=\"http://(.*)\">",Pattern.MULTILINE);
java.util.regex.Matcher m = p.matcher(strhtmpage);
while ( m.find() )
{
System.out.println( "LINKS: " + m.group(1) );
}
 
L

lordy

I am trying to extract all urls for a perticular page, but without a
success.

java.util.regex.Pattern p = Pattern.compile("<a
href=\"http://(.*)\">",Pattern.MULTILINE);
java.util.regex.Matcher m = p.matcher(strhtmpage);
while ( m.find() )
{
System.out.println( "LINKS: " + m.group(1) );
}

Your ".*" is greedy by default. You want a reluctant matcher. Or use
something like [^"]* instead. (Which will be more efficient).

Read Javadoc or perlre to understand greedy regexps and all will become
clear.

Lordy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,738
Latest member
JinaMacvit

Latest Threads

Top