R
Robert Maas, see http://tinyurl.com/uh3t
For years I've had needs for parsing HTML, but avoided writing a
full HTML parser because I thought it'd be too much work. So
instead I wrote various hacks that gleaned particular data from
special formats of HTML files (such as Yahoo! Mail folders and
individual messages) while ignoring the bulk of the HTML file.
But since I have a whole bunch of current needs for parsing various
kinds of HTML files, and I don't want to have to write a separate
hack for each format, all flakey/bugridden, I finally decided to
<cliche>bite the bullet</cliche> and write a genuine HTML parser.
Yesterday (Wednesday) I started work on the tokenizer, using one of
my small Web pages from years ago as the test data:
<http://www.rawbw.com/~rem/WAP.html>
As I was using TDD (Test-Driven Development) I discovered that the
file was still using the *wrong* syntax <p /> to make blank lines
between parts of the text, so I changed them to use valid code, so
now my HTML tokenizer would successfully work on the file, finished
to that point last night.
Then I switched to using the Google-Group Advanced-Search Web page
as test data, and finally got the tokenizer working for it after a
few more hours work today (Thursday).
Then I wrote the routine to take the list of tokens and find all
matching pairs of open tag and closing tag, replacing them with a
single container cell that included everyting between the tags.
For example TAG "font" ...) TEXT "hello") INPUT ...) ENDTAG "font")
would be replaced by ("CONTAIN "font" (...) (TEXT "hello") INPUT ...))).
I single-stepped it at the level of full collapses, all the way to
the end of the test file, so I could watch it and get a feel for
what was happening. It worked perfectly the first time, but I saw
an awful lot of bad HTML in the Google-Groups Advanced-Search page,
such as many <b> and <font> that were opened but never closed, and
also lots of <p> <p> <p> that weren't closed either. Even some
unclosed elements of tables.
Anyway, after spending an hour single-stepping it all, and finding
it working perfectly, I had a DOM (Document Object Model)
structure, i.e. the parse tree, for the HTML file, inside CMUCL, so
then of course I prettyprinted it to disk. Have a look if you're
curious:
<http://www.rawbw.com/~rem/NewPub/parsed-ggadv.dat.txt>
Any place you see a :TAG that means an opening tag without any
matching close tag. For <br>, and for the various <option> inside a
<select>, that's perfectly correct. But for the other stuff I
mentionned such as <b> and <font> that isn't valid HTML and never
was, right? I wonder what the w3c validator says about the HTML?
<http://validator.w3.org/check?uri=http://www.google.com/advanced_group_search?hl=en>
Result: Failed validation, 707 errors
No kidding!!! Over seven hundred mistakes in a one-page document!!!
It's amazing my parser actually parses it successfully!!
Actually, to be fair, many of the errors are because the doctype
declaraction claims it's XHTML transitional, which requires
lower-case tags, but in fact most tags are upper case. (And my
parser is case-insensitive, and *only* parses, doesn't validate at
all.) I wonder if all the tags were changed to lower case, how
fewer errors would show up in w3c validator? Modified GG page:
<http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
<http://validator.w3.org/check?uri=http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
Result: Failed validation, 693 errors
Hmmm, this validation error concerns me:
145. Error Line 174 column 49: end tag for "br" omitted, but OMITTAG
NO was specified.
My guess is some smartypants at Google thought it'd make good P.R.
to declare the document as XHTML instead of HTML, without realizing
that the document wasn't valid XHTML at all, and the DTD used was
totally inappropriate for this document. Does anybody know, from
eyeballing the entire WebPage source, which DOCTYPE/DTD
declaraction would be appropriate to make it almost pass
validation? I bet, with the correct DOCTYPE declaraction, there'd
be only fifty or a hundred validation errors, mostly the kind I
mentionned earlier which I discovered when testing my new parser.
full HTML parser because I thought it'd be too much work. So
instead I wrote various hacks that gleaned particular data from
special formats of HTML files (such as Yahoo! Mail folders and
individual messages) while ignoring the bulk of the HTML file.
But since I have a whole bunch of current needs for parsing various
kinds of HTML files, and I don't want to have to write a separate
hack for each format, all flakey/bugridden, I finally decided to
<cliche>bite the bullet</cliche> and write a genuine HTML parser.
Yesterday (Wednesday) I started work on the tokenizer, using one of
my small Web pages from years ago as the test data:
<http://www.rawbw.com/~rem/WAP.html>
As I was using TDD (Test-Driven Development) I discovered that the
file was still using the *wrong* syntax <p /> to make blank lines
between parts of the text, so I changed them to use valid code, so
now my HTML tokenizer would successfully work on the file, finished
to that point last night.
Then I switched to using the Google-Group Advanced-Search Web page
as test data, and finally got the tokenizer working for it after a
few more hours work today (Thursday).
Then I wrote the routine to take the list of tokens and find all
matching pairs of open tag and closing tag, replacing them with a
single container cell that included everyting between the tags.
For example TAG "font" ...) TEXT "hello") INPUT ...) ENDTAG "font")
would be replaced by ("CONTAIN "font" (...) (TEXT "hello") INPUT ...))).
I single-stepped it at the level of full collapses, all the way to
the end of the test file, so I could watch it and get a feel for
what was happening. It worked perfectly the first time, but I saw
an awful lot of bad HTML in the Google-Groups Advanced-Search page,
such as many <b> and <font> that were opened but never closed, and
also lots of <p> <p> <p> that weren't closed either. Even some
unclosed elements of tables.
Anyway, after spending an hour single-stepping it all, and finding
it working perfectly, I had a DOM (Document Object Model)
structure, i.e. the parse tree, for the HTML file, inside CMUCL, so
then of course I prettyprinted it to disk. Have a look if you're
curious:
<http://www.rawbw.com/~rem/NewPub/parsed-ggadv.dat.txt>
Any place you see a :TAG that means an opening tag without any
matching close tag. For <br>, and for the various <option> inside a
<select>, that's perfectly correct. But for the other stuff I
mentionned such as <b> and <font> that isn't valid HTML and never
was, right? I wonder what the w3c validator says about the HTML?
<http://validator.w3.org/check?uri=http://www.google.com/advanced_group_search?hl=en>
Result: Failed validation, 707 errors
No kidding!!! Over seven hundred mistakes in a one-page document!!!
It's amazing my parser actually parses it successfully!!
Actually, to be fair, many of the errors are because the doctype
declaraction claims it's XHTML transitional, which requires
lower-case tags, but in fact most tags are upper case. (And my
parser is case-insensitive, and *only* parses, doesn't validate at
all.) I wonder if all the tags were changed to lower case, how
fewer errors would show up in w3c validator? Modified GG page:
<http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
<http://validator.w3.org/check?uri=http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
Result: Failed validation, 693 errors
Hmmm, this validation error concerns me:
145. Error Line 174 column 49: end tag for "br" omitted, but OMITTAG
NO was specified.
My guess is some smartypants at Google thought it'd make good P.R.
to declare the document as XHTML instead of HTML, without realizing
that the document wasn't valid XHTML at all, and the DTD used was
totally inappropriate for this document. Does anybody know, from
eyeballing the entire WebPage source, which DOCTYPE/DTD
declaraction would be appropriate to make it almost pass
validation? I bet, with the correct DOCTYPE declaraction, there'd
be only fifty or a hundred validation errors, mostly the kind I
mentionned earlier which I discovered when testing my new parser.