HTML Parser Help Please

ZOCOR · Sep 30, 2004

Hi

I am using HTMLEditorKit.Parser class to parse a HTML file. However, I have
found this Swing HTML parser extremely difficult to use.

I am trying to parse a HTML file and extracting specific information from it
into a table. Consider the snippet of my HTML and the table I like it to
generate:

HTML source:

<HTML>
<TITLE></TITLE>
<BODY>
<PRE>
Identifer: ABCDEFG
</PRE>
data: 123456
<PRE>
</PRE>
</BODY>
</HTML>

TABLE:

ABCDEFG 123456

Here is the code I have so far:

import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;

public class HTMLParser extends HTMLEditorKit
{
public HTMLEditorKit.Parser getParser()
{
return super.getParser();
}

public static void main (String[] args)
{
try
{
Reader r = new FileReader("html_file.html");
HTMLEditor.Parser parse = new HTMLParser.getParser()
HTMLEditorKit.ParserCallback cb =
{
public void handleStartTag(HTML.Tag t, MutableAttributeSet
a, int a)
{
if (t==HTML.Tag.PRE)
{
//print whats between the pre tag
}
}
public void handleText(char[] data, int pos)
{
//print whats between the pre tags
}
};

parse.parse(r, cb, true);
}
catch (IOException e)
{
System.out.println(e);
}
}
}

I would appreciate it very much if someone could solve this problem for me.
I tried the sun tutortial, but the examples aren't that clear enough for me.

Thanks

ZOCOR

Nathan Zumwalt · Sep 30, 2004

I've never used this HTML Parser before, but I've done similar things
when scraping HTML off websites. My general solution is to:

1. Get the HTML as text (which you already have).
2. Run it through an HTML to XHTML cleanser (I lik JTidy)
3. Parse the XHTML using Java's XML parsers.
4. Use XPath statements to get the values I want.

This probably isn't very efficient for getting small bits of data, but
it works.

//Nathan

Paul Lutus · Sep 30, 2004

ZOCOR said:
Hi

I am using HTMLEditorKit.Parser class to parse a HTML file. However, I
have found this Swing HTML parser extremely difficult to use.

Problem: "difficult".

I am trying to parse a HTML file and extracting specific information from
it into a table.

Problem: "trying".

Consider the snippet of my HTML and the table I like it
to generate:

You left out the table, the final goal of your program.

/ ...

I would appreciate it very much if someone could solve this problem for
me.

Which problem, "difficult" or "trying"? Children and both difficult and
trying, but this is not a specific complaint. Neither is yours.

Tell us what you wanted, what you got, and how they differ.

I tried the sun tutortial, but the examples aren't that clear enough
for me.

Clear enough to do what?

John K · Sep 30, 2004

TagSoup [http://mercury.ccil.org/~cowan/XML/tagsoup/] might fit the
bill.

-John K

ParserCallback - Html Parser in Java	1	Apr 1, 2007
HTML Parser - problem with multiple instances	0	Apr 29, 2005
html parser, some site work only	0	Jun 13, 2004
html parsing	0	Dec 2, 2006
Simple/pojo loc parser for java	2	Sep 6, 2010
please help	1	Feb 18, 2008
Extract text from HTML (unicode)	1	Jan 29, 2005
Getting HTML title using HTMLEditorKit.ParserCallback	1	Apr 22, 2004

HTML Parser Help Please

ZOCOR

Nathan Zumwalt

Paul Lutus

John K

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads