Convert HTML to plain text

M

Marcel Kessler

Hi there

Does anyone know a good way of converting HTML to plain text, keeping as
much of the formatting as possible?

The HTML will be produced by an editor like FCKEditor, and
transformation should happen in Java.

So far I've found the following options, none of them really convincing:

# Using w3m or lynx to convert html to plain text
(http://www.biglist.com/lists/xsl-list/archives/200406/msg00689.html)
+ neat output
- need to call C from java

# Google gdata routine
(http://www.biglist.com/lists/xsl-list/archives/200406/msg00689.html)
+ java source available
- only basic stripping, no tables etc

# Use xml & xslt
(http://www-128.ibm.com/developerworks/java/library/x-xmlist1/)
+ good result
- complicated approach, cannot use wysiwyg-editor like FCKEditor

# use other tools like docfraq, detagger, notetab etc.
- no better results than with w3m

Thanks and regars
Marcel
 
A

Andy Dingley

Marcel said:
Does anyone know a good way of converting HTML to plain text, keeping as
much of the formatting as possible?

Of course not. "Plain text" doesn't have formatting. If you want to
"keep some formatting", then you first have to know just how much is
preservable. Some people claim "RTF" is "plain text" because it's
editable with a text editor rather than in binary -- how much are you
expecting to preserve?

Converting all HTML block elements to a marker, stripping out
everything except text and markers, normalizing whitespace and markers
and then converting markers to something local is usually a good start.

If you're already in a web context, then a DOM walker that returns the
set of text nodes might be easier.

if the HTML is crap to begin with, pre-process it with Tidy.
 
M

Marcel Kessler

Andy said:
Of course not. "Plain text" doesn't have formatting. If you want to
"keep some formatting", then you first have to know just how much is
preservable. Some people claim "RTF" is "plain text" because it's
editable with a text editor rather than in binary -- how much are you
expecting to preserve?

Thanks, Andy!
Obviously we can't keep e.g. a header in big letters, but one thing we
need for example is if we have a <li> tag, we don't want

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.

but rather

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec
est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.

i.e. something that keeps the indention...
If there is some Java library out there that does this kind of thing,
that would be great... the HTML itself should already be quite nice.
 
K

Karl Uppiano

Marcel Kessler said:
Thanks, Andy!
Obviously we can't keep e.g. a header in big letters, but one thing we
need for example is if we have a <li> tag, we don't want

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque nec
est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut aliquet
risus ac velit eleifend scelerisque.

but rather

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque nec
est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.

i.e. something that keeps the indention...
If there is some Java library out there that does this kind of thing, that
would be great... the HTML itself should already be quite nice.

It sounds like you want an HTML parser with pluggable handlers that are
customizable. A SAX parser comes pretty close. If you could first convert
the HTML to well-formed HTML (with matching open and close tags, for
example) you might be able to get a non-validating SAX parser to work. Just
a thought. My guess is that it would take a fair bit of work to implement.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,222
Members
46,810
Latest member
Kassie0918

Latest Threads

Top