Unexpected behaviour with HTMLParser...

  • Thread starter Just Another Victim of the Ambient Morality
  • Start date
J

Just Another Victim of the Ambient Morality

HTMLParser is behaving in, what I find to be, strange ways and I would
like to better understand what it is doing and why.

First, it doesn't appear to translate HTML escape characters. I don't
know the actual terminology but things like & don't get translated into
& as one would like. Furthermore, not only does HTMLParser not translate it
properly, it seems to omit it altogether! This prevents me from even doing
the translation myself, so I can't even working around the issue.
Why is it doing this? Is there some mode I need to set? Can anyone
else duplicate this behaviour? Is it a bug?

Secondly, HTMLParser often calls handle_data() consecutively, without
any calls to handle_starttag() in between. I did not expect this. In HTML,
you either have text or you have tags. Why split up my text into successive
handle_data() calls? This makes no sense to me. At the very least, it does
this in response to text with & like escape sequences (or whatever
they're called), so that it may successively avoid those translations.
Again, why is it doing this? Is there some mode I need to set? Can
anyone else duplicate this behaviour? Is it a bug?

These are serious problems for me and I would greatly appreciate a
deeper understanding of these issues.
Thank you...
 
D

Diez B. Roggisch

Just said:
HTMLParser is behaving in, what I find to be, strange ways and I would
like to better understand what it is doing and why.

First, it doesn't appear to translate HTML escape characters. I don't
know the actual terminology but things like & don't get translated into
& as one would like. Furthermore, not only does HTMLParser not translate it
properly, it seems to omit it altogether! This prevents me from even doing
the translation myself, so I can't even working around the issue.
Why is it doing this? Is there some mode I need to set? Can anyone
else duplicate this behaviour? Is it a bug?

Without code, that's hard to determine. But you are aware of e.g.

handle_entityref(name)
handle_charref(ref)

?
Secondly, HTMLParser often calls handle_data() consecutively, without
any calls to handle_starttag() in between. I did not expect this. In HTML,
you either have text or you have tags. Why split up my text into successive
handle_data() calls? This makes no sense to me. At the very least, it does
this in response to text with & like escape sequences (or whatever
they're called), so that it may successively avoid those translations.

That's the way XML/HTML is defined - there is no guarantee that you get
text as whole. If you must, you can collect the snippets yourself, and
on the next end-tag deliver them as whole.

Again, why is it doing this? Is there some mode I need to set? Can
anyone else duplicate this behaviour? Is it a bug?

No. It's the way it is, because it would require buffering with
unlimited capacity to ensure this property.
These are serious problems for me and I would greatly appreciate a
deeper understanding of these issues.

HTH, and read the docs.

Diez
 
J

Just Another Victim of the Ambient Morality

Diez B. Roggisch said:
Without code, that's hard to determine. But you are aware of e.g.

handle_entityref(name)
handle_charref(ref)

?

Actually, I am not aware of these methods but I will certainly look into
them!
I was hoping that the issue would be known or simple before I commited
to posting code, something that is, to my chagrin, not easily done with my
news client...

That's the way XML/HTML is defined - there is no guarantee that you get
text as whole. If you must, you can collect the snippets yourself, and on
the next end-tag deliver them as whole.

I think there's some miscommunication, here.
You can't mean "That's the way XML/HTML is defined" because those format
specifications say nothing about how the format must be parsed. As far as I
can tell, you either meant to say that that's the way HTMLParser is
specified or you're referring to how text in XML/HTML can be broken up by
tags, in which case I've already addressed that in my post. I expected to
see handle_starttag() calls in between calls to handle_data().
Unless I'm missing something, it simply makes no sense to break up
contiguous text into multiple handle_data() calls...

No. It's the way it is, because it would require buffering with unlimited
capacity to ensure this property.

It depends on what you mean by "unlimited capacity." Is it so bad to
buffer with as much memory as you have? ...or, at least, have a setting for
such operation? Moreover, you know that you'll never have to buffer more
than there is HTML, so you hardly need "unlimited capacity..." For
instance, I believe Xerces does this translation for you 'cause, really, why
wouldn't you want it to?

HTH, and read the docs.

This does help, thank you. I have obviously read the docs, since I can
use HTMLParser enough to find this behaviour. I don't find the docs to be
very explanatory (perhaps I'm reading the wrong docs) and I think they
assume you already know a lot about HTML and parsing, which may be necessary
assumptions but are not necessarily true...
 
D

Diez B. Roggisch

Just said:
Actually, I am not aware of these methods but I will certainly look into
them!
I was hoping that the issue would be known or simple before I commited
to posting code, something that is, to my chagrin, not easily done with my
news client...



I think there's some miscommunication, here.
You can't mean "That's the way XML/HTML is defined" because those format
specifications say nothing about how the format must be parsed. As far as I
can tell, you either meant to say that that's the way HTMLParser is
specified or you're referring to how text in XML/HTML can be broken up by
tags, in which case I've already addressed that in my post. I expected to
see handle_starttag() calls in between calls to handle_data().
Unless I'm missing something, it simply makes no sense to break up
contiguous text into multiple handle_data() calls...


I meant that's the way XML/HTML-parsing is defined, yes.
It depends on what you mean by "unlimited capacity." Is it so bad to
buffer with as much memory as you have? ...or, at least, have a setting for
such operation? Moreover, you know that you'll never have to buffer more
than there is HTML, so you hardly need "unlimited capacity..." For
instance, I believe Xerces does this translation for you 'cause, really, why
wouldn't you want it to?

I've been dealing with XML-files that are several gigbytes of size and
never fit into physical memory. So buffering would severely impact the
whole system if it was the default of the parser.

And you are wrong - xerces (the SAX-parser, which is the equivalent to
HTMLParser) explicitly does not do that. It is not guaranteed that the
character-data is passed in one chunk.

DOM is an etirely different subject, it _has_ to be fully parsed. But
then, it's often problematic because of that.
This does help, thank you. I have obviously read the docs, since I can
use HTMLParser enough to find this behaviour. I don't find the docs to be
very explanatory (perhaps I'm reading the wrong docs) and I think they
assume you already know a lot about HTML and parsing, which may be necessary
assumptions but are not necessarily true...

Well, you at least overlooked the methods I mentioned.

Diez
 
S

Stefan Behnel

Just said:
HTMLParser is behaving in, what I find to be, strange ways and I would
like to better understand what it is doing and why.

In case you also want an HTML library that is easy to use (and powerful and
flexible and...), look at lxml.html.

http://codespeak.net/lxml/dev/lxmlhtml.html

It's part of lxml 2.0, which is currently in alpha status (which does not mean
it's unstable or something, just not as complete as its authors want it to be).

http://codespeak.net/lxml/dev/

Stefan
 
A

Andrew Durdin

Actually, I am not aware of these methods but I will certainly look into
them!
I was hoping that the issue would be known or simple before I commited
to posting code, something that is, to my chagrin, not easily done with my
news client...

For example, here's something simple/simplistic you can do to handle
character and entity references:

from htmlentitydefs import name2codepoint

....

def handle_charref(self, ref):
try:
if ref.startswith('x'):
char = unichr(int(ref[1:], 16))
else:
char = unichr(int(ref))
except (TypeError, ValueError):
char = ' '
# Do something with char

def handle_entityref(self, ref):
try:
char = unichr(name2codepoint[ref])
except (KeyError, ValueError):
char = ' '
# Do something with char


A.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,225
Members
46,815
Latest member
treekmostly22

Latest Threads

Top