html parsing

D

Damo_Suzuki

Hi,
I'm new to this html parsing lark. I want to parse a search engine
result html page to extract the title,summary and URL of every result.
I've made an attempt at it with the following code:

HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument)
htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(buffer, callback, true);
StringBuffer text = new StringBuffer();
StringBuffer snippet = new StringBuffer();

ElementIterator iterator = new ElementIterator(htmlDoc);
Element element;
while ((element = iterator.next()) != null)
{
AttributeSet attributes = element.getAttributes();
Object name =
attributes.getAttribute(StyleConstants.NameAttribute);

if ((name instanceof HTML.Tag)&& (name == HTML.Tag.H2))
{
// Build up content text as it may be within multiple
elements
//StringBuffer text = new StringBuffer();
int count = element.getElementCount();
for (int i = 0; i < count; i++)
{
Element child = element.getElement(i);
AttributeSet childAttributes = child.getAttributes();
if
(childAttributes.getAttribute(StyleConstants.NameAttribute) ==
HTML.Tag.CONTENT)
{
int startOffset = child.getStartOffset();
int endOffset = child.getEndOffset();
int length = endOffset - startOffset;
text.append(htmlDoc.getText(startOffset,
length));
}
}

}

if (!(name instanceof HTML.Tag)&& (name == HTML.Tag.TD))
{
element=iterator.next();
}
else
{
// Build up content text as it may be within multiple
elements
int count = element.getElementCount();
for (int i = 0; i < count; i++)
{
Element child = element.getElement(i);
AttributeSet childAttributes =
child.getAttributes();
if
(childAttributes.getAttribute(StyleConstants.NameAttribute) ==
HTML.Tag.CONTENT)
{
int startOffset = child.getStartOffset();
int endOffset = child.getEndOffset();
int length = endOffset - startOffset;
snippet.append(htmlDoc.getText(startOffset,
length));
}
}
}


}

ArrayList result = new ArrayList();
result.add(text);
result.add(snippet);
in.close();
return result;
}

currently it returns an arraylist with two long strings in it. a string
made of all the titles and a string made up of all the rest. The
problem is the summary and the URLs are in one table and to get summary
you also get the URL together with it.

the html of one result looks like this:
<h2 class=r>
<a class=l href="http://www.java.com/" onmousedown="return
clk(this.href,'','','res','1','')">
<b>java</b>.com: Hot Games, Cool Apps</a></h2>

<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td class=j><font size=-1>
Get the latest <b>Java</b> Software and explore how <b>Java
</b> technology provides a better digital experience.<br>
<span class=a>www.<b>java</b>.com/ - 16k - </span><nobr>
<a class=fl href="http://66.102.9.104/search?q=cache:gzY4gL02EzEJ
:www.java.com/+java&hl=en&gl=ie&ct=clnk&cd=1">Cached</a> -
<a class=fl href="/search?hl=en&lr=&q=related:www.java.com/">
Similar pages</a></nobr></font>
</td>
</tr>
</table>

Does anyone know a better way of doing this, or know how to seperate
the URL from the summary?
Any help would be greatly appreciated
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,701
Latest member
XavierQ83

Latest Threads

Top