H
Henry
Folks:
Here's a problem I encountered and what I'm doing about it. I'd welcome
suggestions about how best to solve issues like this. Please do _not_ spend
time solving the specific problem -- just help me understand the best
approaches to such issues.
The problem: I'm sniffing the text in some HTML using Perl. The only way
(believe me!) to do the job is scan from line breaks, so I need to preserve
these. It looks like everything I need to sniff is preformatted (between
<pre> and </pre>> tags) with /r (carriage returns) at just the right places.
Thus far, I've been prototyping using a executable binary filter called
"html2text". I can certainly continue using this utility, but this
approach..... hmmm, lacks a certain elegance. Let's do this _all_ in Perl!
In short, I need extract text from HTML, preserving breaks.
1. First trip to CPAN. Using HTML::TreeBuilder followed by HTML::FormatText
as documented at
http://search.cpan.org/~sburke/HTML-Format-2.03/lib/HTML/FormatText.pm
seems to do the trick, _except_ that all the output is one long byte stream
-- no breaks at all. I fooled around with the HTML::FormatText parameters
"leftmargin" and "rightmargin". No effect.
I briefly looked at the source for clues. No other options, and it seems
capable of outputting newlines ("\n") -- I think -- but my Perl skills are
not sufficient to be sure, and under what conditions.
Hack a CPAN module? Not at my skill level.
2. Second trip to CPAN. HTML:arser is different, but it looks
significantly more complex/advanced than HTML:arser. I'm not good at OO
technology and I can't even tell if it will do the trick.
3. Googling, I found a classic Tom Christiansen script called "striphtml" at
http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz
But it seems to be designed in early 1996 for HTML 2.0. Instructive, for
sure, but doesn't look like a good bet.
4. My colleague, A.D., suggests a real hack: preprocess the html files,
jamming in a unique tag (maybe "xyzzy"?) at line breaks. Easy to
reconstitute at a later pass. Sure, this will work, but .... lacks a
certain elegance, and doesn't teach me much about Perl.
5. Googled up a 2000/07/02 post to this group that fully quotes a Perl
Journal article on HTML:arse. If I understand this article correctly, not
only can I do what I need, but I'm going to understand more perl subtleties
in the bargain. Thanks to the author, Ken MacFarlane!
As it stands, HTML:arse seems my best bet. Comments?
It it fairly typical to load a couple of different modules, trying them on
for size until the best fit is found?
Any particular penalty besides disk space used for leaving unused modules
lying around?
Thanks,
Henry
(e-mail address removed) remove 'zzz'
Here's a problem I encountered and what I'm doing about it. I'd welcome
suggestions about how best to solve issues like this. Please do _not_ spend
time solving the specific problem -- just help me understand the best
approaches to such issues.
The problem: I'm sniffing the text in some HTML using Perl. The only way
(believe me!) to do the job is scan from line breaks, so I need to preserve
these. It looks like everything I need to sniff is preformatted (between
<pre> and </pre>> tags) with /r (carriage returns) at just the right places.
Thus far, I've been prototyping using a executable binary filter called
"html2text". I can certainly continue using this utility, but this
approach..... hmmm, lacks a certain elegance. Let's do this _all_ in Perl!
In short, I need extract text from HTML, preserving breaks.
1. First trip to CPAN. Using HTML::TreeBuilder followed by HTML::FormatText
as documented at
http://search.cpan.org/~sburke/HTML-Format-2.03/lib/HTML/FormatText.pm
seems to do the trick, _except_ that all the output is one long byte stream
-- no breaks at all. I fooled around with the HTML::FormatText parameters
"leftmargin" and "rightmargin". No effect.
I briefly looked at the source for clues. No other options, and it seems
capable of outputting newlines ("\n") -- I think -- but my Perl skills are
not sufficient to be sure, and under what conditions.
Hack a CPAN module? Not at my skill level.
2. Second trip to CPAN. HTML:arser is different, but it looks
significantly more complex/advanced than HTML:arser. I'm not good at OO
technology and I can't even tell if it will do the trick.
3. Googling, I found a classic Tom Christiansen script called "striphtml" at
http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz
But it seems to be designed in early 1996 for HTML 2.0. Instructive, for
sure, but doesn't look like a good bet.
4. My colleague, A.D., suggests a real hack: preprocess the html files,
jamming in a unique tag (maybe "xyzzy"?) at line breaks. Easy to
reconstitute at a later pass. Sure, this will work, but .... lacks a
certain elegance, and doesn't teach me much about Perl.
5. Googled up a 2000/07/02 post to this group that fully quotes a Perl
Journal article on HTML:arse. If I understand this article correctly, not
only can I do what I need, but I'm going to understand more perl subtleties
in the bargain. Thanks to the author, Ken MacFarlane!
As it stands, HTML:arse seems my best bet. Comments?
It it fairly typical to load a couple of different modules, trying them on
for size until the best fit is found?
Any particular penalty besides disk space used for leaving unused modules
lying around?
Thanks,
Henry
(e-mail address removed) remove 'zzz'