Extract links from HTML

N

Noel

Hello,

I have a string containing the HTML code of a web search engine
result. The web search result particularly contains links that are of
interest to my application, and the goal is to extract the links.

Does anyone know of any java method (or package or something similar)
that is able to retrieve the URLs from a given block of HTML? I'd like
something simple like a method that takes a string argument
(containing the HTML text) and returning an array or vector of URLs.

Thanks

N
 
S

softwarepearls_com

Hello,

I have a string containing the HTML code of a web search engine
result. The web search result particularly contains links that are of
interest to my application, and the goal is to extract the links.

Does anyone know of any java method (or package or something similar)
that is able to retrieve the URLs from a given block of HTML? I'd like
something simple like a method that takes a string argument
(containing the HTML text) and returning an array or vector of URLs.

Thanks

N

These days, you've got Java's regular expressions support to help you.
See package java.util.regex.
 
S

Stefan Ram

Noel said:
Does anyone know of any java method (or package or something similar)
that is able to retrieve the URLs from a given block of HTML?

(If answering to this post, please do not quote all of it,
but only the parts you directly refer to.)

public class Main
{ public final static void main( final java.lang.String[] args )
{ try
{ java.io.Reader reader = new java.io.StringReader
( "<html><head><title></title></head>" +
"<body><p>" +
"<a href=\"alpha\">beta</a>" +
"<!-- <a href=\"gamma\">delta</a> -->" +
"<i class='<a href=\"epsilon\">zeta</a>'></i>" +
"<a href=\"eta\">theta</a>" +
"</p></body>" );
final javax.swing.text.html.parser.ParserDelegator parserDelegator =
new javax.swing.text.html.parser.ParserDelegator();
final javax.swing.text.html.HTMLEditorKit.ParserCallback
parserCallback =
new javax.swing.text.html.HTMLEditorKit.ParserCallback()
{ public void handleText( final char[] data, final int pos ){}
public void handleStartTag
( final javax.swing.text.html.HTML.Tag tag,
final javax.swing.text.MutableAttributeSet attribute,
final int pos )
{ if( tag == javax.swing.text.html.HTML.Tag.A )
{ final java.lang.String address =( java.lang.String )
attribute.getAttribute
( javax.swing.text.html.HTML.Attribute.HREF );
java.lang.System.out.println( address ); }}
public void handleEndTag
( final javax.swing.text.html.HTML.Tag t, final int pos ){}
public void handleSimplTag
( final javax.swing.text.html.HTML.Tag t,
final javax.swing.text.MutableAttributeSet a, final int pos ){}
public void handleComment
( final char[] data, final int pos ){}
public void handleError
( final java.lang.String errMsg, final int pos ){} };
parserDelegator.parse( reader, parserCallback, false );
java.lang.System.out.println(); }
catch( final java.io.IOException iOException )
{ java.lang.System.err.println( iOException ); }}}

/* prints:
alpha
eta
*/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top