html parsing

Damo_Suzuki · Dec 2, 2006

Hi,
I'm new to this html parsing lark. I want to parse a search engine
result html page to extract the title,summary and URL of every result.
I've made an attempt at it with the following code:

HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument)
htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(buffer, callback, true);
StringBuffer text = new StringBuffer();
StringBuffer snippet = new StringBuffer();

ElementIterator iterator = new ElementIterator(htmlDoc);
Element element;
while ((element = iterator.next()) != null)
{
AttributeSet attributes = element.getAttributes();
Object name =
attributes.getAttribute(StyleConstants.NameAttribute);

if ((name instanceof HTML.Tag)&& (name == HTML.Tag.H2))
{
// Build up content text as it may be within multiple
elements
//StringBuffer text = new StringBuffer();
int count = element.getElementCount();
for (int i = 0; i < count; i++)
{
Element child = element.getElement(i);
AttributeSet childAttributes = child.getAttributes();
if
(childAttributes.getAttribute(StyleConstants.NameAttribute) ==
HTML.Tag.CONTENT)
{
int startOffset = child.getStartOffset();
int endOffset = child.getEndOffset();
int length = endOffset - startOffset;
text.append(htmlDoc.getText(startOffset,
length));
}
}

}

if (!(name instanceof HTML.Tag)&& (name == HTML.Tag.TD))
{
element=iterator.next();
}
else
{
// Build up content text as it may be within multiple
elements
int count = element.getElementCount();
for (int i = 0; i < count; i++)
{
Element child = element.getElement(i);
AttributeSet childAttributes =
child.getAttributes();
if
(childAttributes.getAttribute(StyleConstants.NameAttribute) ==
HTML.Tag.CONTENT)
{
int startOffset = child.getStartOffset();
int endOffset = child.getEndOffset();
int length = endOffset - startOffset;
snippet.append(htmlDoc.getText(startOffset,
length));
}
}
}

}

ArrayList result = new ArrayList();
result.add(text);
result.add(snippet);
in.close();
return result;
}

currently it returns an arraylist with two long strings in it. a string
made of all the titles and a string made up of all the rest. The
problem is the summary and the URLs are in one table and to get summary
you also get the URL together with it.

the html of one result looks like this:
<h2 class=r>
<a class=l href="http://www.java.com/" onmousedown="return
clk(this.href,'','','res','1','')">
<b>java</b>.com: Hot Games, Cool Apps</a></h2>

<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td class=j><font size=-1>
Get the latest <b>Java</b> Software and explore how <b>Java
</b> technology provides a better digital experience.<br>
<span class=a>www.<b>java</b>.com/ - 16k - </span><nobr>
<a class=fl href="http://66.102.9.104/search?q=cache:gzY4gL02EzEJ
:www.java.com/+java&hl=en&gl=ie&ct=clnk&cd=1">Cached</a> -
<a class=fl href="/search?hl=en&lr=&q=related:www.java.com/">
Similar pages</a></nobr></font>
</td>
</tr>
</table>

Does anyone know a better way of doing this, or know how to seperate
the URL from the summary?
Any help would be greatly appreciated

Undo/Redo Java?	2	May 4, 2004
Java matrix problem	3	Sep 10, 2023
LEETCODE 3	3	Jun 22, 2024
Implementing Many Stacks in the Same Program	1	Aug 10, 2021
How to try a range of hex values in C# code ?	0	Nov 19, 2022
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
parse HTML	4	Apr 25, 2006
Issue with textbox script?	0	Sep 5, 2022

html parsing

Damo_Suzuki

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads