I
ioscas
Hi, I am looking for a HTML parser who can parse a given page into
a DOM tree, and can reconstruct the exact original html sources.
Strictly speaking, I should be allowed to retrieve the original
sources at each internal nodes of the DOM tree.
I have tried Beautiful Soup who is really nice when dealing with
those god damned ill-formed documents, but it's a pity for me to find
that this guy cannot retrieve original sources due to its great tidy
job.
Since Beautiful Soup, like most of the other HTML parsers in
python, is a subclass of sgmllib.SGMLParser to some extent, I have
investigated the source code of sgmllib.SGMLParser, see if there is
anything I can do to tell Beautiful Soup where he can find every tag
segment from HTML source, but this will be a time-consuming job.
so... any ideas?
cheers
kai liu
a DOM tree, and can reconstruct the exact original html sources.
Strictly speaking, I should be allowed to retrieve the original
sources at each internal nodes of the DOM tree.
I have tried Beautiful Soup who is really nice when dealing with
those god damned ill-formed documents, but it's a pity for me to find
that this guy cannot retrieve original sources due to its great tidy
job.
Since Beautiful Soup, like most of the other HTML parsers in
python, is a subclass of sgmllib.SGMLParser to some extent, I have
investigated the source code of sgmllib.SGMLParser, see if there is
anything I can do to tell Beautiful Soup where he can find every tag
segment from HTML source, but this will be a time-consuming job.
so... any ideas?
cheers
kai liu