html-->text, keep line breaks, best strategy is?

H

Henry

Folks:

Here's a problem I encountered and what I'm doing about it. I'd welcome
suggestions about how best to solve issues like this. Please do _not_ spend
time solving the specific problem -- just help me understand the best
approaches to such issues.

The problem: I'm sniffing the text in some HTML using Perl. The only way
(believe me!) to do the job is scan from line breaks, so I need to preserve
these. It looks like everything I need to sniff is preformatted (between
<pre> and </pre>> tags) with /r (carriage returns) at just the right places.

Thus far, I've been prototyping using a executable binary filter called
"html2text". I can certainly continue using this utility, but this
approach..... hmmm, lacks a certain elegance. Let's do this _all_ in Perl!

In short, I need extract text from HTML, preserving breaks.

1. First trip to CPAN. Using HTML::TreeBuilder followed by HTML::FormatText
as documented at

http://search.cpan.org/~sburke/HTML-Format-2.03/lib/HTML/FormatText.pm

seems to do the trick, _except_ that all the output is one long byte stream
-- no breaks at all. I fooled around with the HTML::FormatText parameters
"leftmargin" and "rightmargin". No effect.

I briefly looked at the source for clues. No other options, and it seems
capable of outputting newlines ("\n") -- I think -- but my Perl skills are
not sufficient to be sure, and under what conditions.

Hack a CPAN module? Not at my skill level.

2. Second trip to CPAN. HTML::parser is different, but it looks
significantly more complex/advanced than HTML::parser. I'm not good at OO
technology and I can't even tell if it will do the trick.

3. Googling, I found a classic Tom Christiansen script called "striphtml" at

http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz

But it seems to be designed in early 1996 for HTML 2.0. Instructive, for
sure, but doesn't look like a good bet.

4. My colleague, A.D., suggests a real hack: preprocess the html files,
jamming in a unique tag (maybe "xyzzy"?) at line breaks. Easy to
reconstitute at a later pass. Sure, this will work, but .... lacks a
certain elegance, and doesn't teach me much about Perl.

5. Googled up a 2000/07/02 post to this group that fully quotes a Perl
Journal article on HTML::parse. If I understand this article correctly, not
only can I do what I need, but I'm going to understand more perl subtleties
in the bargain. Thanks to the author, Ken MacFarlane!

As it stands, HTML::parse seems my best bet. Comments?

It it fairly typical to load a couple of different modules, trying them on
for size until the best fit is found?

Any particular penalty besides disk space used for leaving unused modules
lying around?

Thanks,

Henry

(e-mail address removed) remove 'zzz'
 
H

Henry

Folks:

Here's an addendum to my previous post:
Here's a problem I encountered and what I'm doing about it. I'd welcome
suggestions about how best to solve issues like this. Please do _not_ spend
time solving the specific problem -- just help me understand the best
approaches to such issues.

The problem: I'm sniffing the text in some HTML using Perl. The only way
(believe me!) to do the job is scan from line breaks, so I need to preserve
these. It looks like everything I need to sniff is preformatted (between
<pre> and </pre>> tags) with /r (carriage returns) at just the right places.

<snip>
<snip>

D'oh! Double D'oh!

The best way to solve this problem is for me to open my #$%*@%$#$%! eyes
and see that the HTML I'm sniffing is clearly script-generated and very
simply so. Thus it is senseless to convert it to an intermediate textual
form, and quite reasonable to write custom Perl code to directly sniff the
html for the content I need

I did experiment with HTML::parse. It preserves line breaks but obliterates
spacing that's really helpful in parsing the contents. (These pages contain
massive outlines, by the way, up to maybe 2000 lines going up to six levels
deep.)

OK, in this case, there's underlying regularity that allows a simpler
solution. Supposing I did need to convert to plain text, as my original
post described... was my approach reasonable?

(I don't know about this talking to myself. Not a good sign...)

Thanks,

Henry

(e-mail address removed) remove 'zzz'
 
D

David K. Wall

Henry said:
In short, I need extract text from HTML, preserving breaks. [snip for space]
As it stands, HTML::parse seems my best bet. Comments?

HTML::parse is deprecated, use HTML::parser. IF HTML::parser seems
weird you can use the alternative interface provided by
HTML::TokeParser.
It it fairly typical to load a couple of different modules, trying
them on for size until the best fit is found?

I don't know about other people, but that's what I often do. You
don't actually have to install them, though. You can read the docs
on the CPAN site and see if they do what you want in a way that
you're comfortable with. Running the examples can help. I often
write short programs to test the particular functions that interest
me before I try to use the module in the real program I'm writing.
Any particular penalty besides disk space used for leaving unused
modules lying around?

None that I know of. I suppose you could argue that it increases the
size of the directory it's in and makes filesystem lookups slower for
other module files in that directory, but that's a bit too esoteric
for me to worry about.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top