how to parse "real life" HTML ?

B

Bru, Pierre

hello,

I want to load web pages from the web but some (most?) pages are
illformed or at least, not conformant to HTML4. for ex. missing </p>,
missing <tbody> in <table>, invalid <tr> or <td>, missing
double-quotes, deprecated or private tags, etc.

is there a class that can parse those HTML and, if possible, do its
best effort to fix those problems (but keep the potentially private
tag).

TIA,
Pierre.
 
C

Christophe Vanfleteren

hello,

I want to load web pages from the web but some (most?) pages are
illformed or at least, not conformant to HTML4. for ex. missing </p>,
missing <tbody> in <table>, invalid <tr> or <td>, missing
double-quotes, deprecated or private tags, etc.

is there a class that can parse those HTML and, if possible, do its
best effort to fix those problems (but keep the potentially private
tag).

TIA,
Pierre.

http://jtidy.sourceforge.net/

I haven't tried it yet, but it might do the trick.
 
M

mromarkhan

Peace be unto you.
JTidy


---TestCase.java follows


<code>
import org.w3c.tidy.Tidy;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
class TestCase
{
public static void main(String [] s) throws IOException
{
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parse(new FileInputStream("GetIP.html"), new FileOutputStream("GetIPX.html"));
}
}
</code>


---Build


C:\downloads\jtidy-04aug2000r7-dev\jtidy-04aug2000r7-dev>javac -classpath "C:\do
wnloads\jtidy-04aug2000r7-dev\jtidy-04aug2000r7-dev\build\Tidy.jar;." TestCase.j
ava

C:\downloads\jtidy-04aug2000r7-dev\jtidy-04aug2000r7-dev>java -classpath "C:\dow
nloads\jtidy-04aug2000r7-dev\jtidy-04aug2000r7-dev\build\Tidy.jar;." TestCase

Tidy (vers 4th August 2000) Parsing "InputStream"
line 4 column 5 - Warning: <script> lacks "type" attribute
line 27 column 1 - Warning: <script> lacks "type" attribute
line 32 column 2 - Warning: missing </b> before <li>
line 32 column 2 - Warning: <li> isn't allowed in <body> elements
line 32 column 2 - Warning: inserting implicit <ul>
line 32 column 5 - Warning: inserting implicit <b>
line 34 column 1 - Warning: missing </ul> before </body>

InputStream: Document content looks like HTML 3.2
7 warnings/errors were found!


C:\downloads\jtidy-04aug2000r7-dev\jtidy-04aug2000r7-dev>
C:\downloads\jtidy-04aug2000r7-dev\jtidy-04aug2000r7-dev>


--- Original


<html>
<head>
<title>Is He Canadian</title>
<script>
function getGeoIP()
{
var site ='http://www.showmyip.com/xml/';
var XMLHTTP = new ActiveXObject( 'Msxml2.XMLHTTP' );
XMLHTTP.Open("GET", site, false);
XMLHTTP.Send(null);
var xmlDoc;
var root;
//alert(XMLHTTP.responseText);
xmlDoc = new ActiveXObject("Microsoft.XMLDOM");
xmlDoc.async = false;
xmlDoc.loadXML(XMLHTTP.responseText);
root = xmlDoc.documentElement;
var countryList = xmlDoc.getElementsByTagName("country");
var country = countryList.item(0).firstChild.nodeValue;
return country;
}
</script>
</head>
<body>
Omar Khan lives in
<b>
<script>
document.write(getGeoIP());
</script>


<li>Intentional mistakes</li>
</body>

</html>


--- JTidy results


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<title>Is He Canadian</title>
<script type="text/javascript">
function getGeoIP()
{
var site ='http://www.showmyip.com/xml/';
var XMLHTTP = new ActiveXObject( 'Msxml2.XMLHTTP' );
XMLHTTP.Open("GET", site, false);
XMLHTTP.Send(null);
var xmlDoc;
var root;
//alert(XMLHTTP.responseText);
xmlDoc = new ActiveXObject("Microsoft.XMLDOM");
xmlDoc.async = false;
xmlDoc.loadXML(XMLHTTP.responseText);
root = xmlDoc.documentElement;
var countryList = xmlDoc.getElementsByTagName("country");
var country = countryList.item(0).firstChild.nodeValue;
return country;
}

</script>
</head>
<body>
Omar Khan lives in <b>
<script type="text/javascript">
document.write(getGeoIP());
</script>
</b>
<ul class="noindent">
<li><b>Intentional mistakes</b></li>
</ul>
</body>
</html>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top