converting html to plain text

T

tandon.sourabh

Is there a way to convert HTML formated text into plain text?

For example, how to convert

my html text <br>

TO

my plain text

The actual example will be more complicated with tons of HTML
formatting. I am trying to download a html web page in textual format
and convert it into plain text (i.e. what we see in the browser). I
know this will have some side effects such as loss of white space and
undesirable formatting for the human eye, but I am willing to live
with that.

Thanks for all your help.
 
N

Neredbojias

Is there a way to convert HTML formated text into plain text?

For example, how to convert

my html text <br>

TO

my plain text

The actual example will be more complicated with tons of HTML
formatting. I am trying to download a html web page in textual format
and convert it into plain text (i.e. what we see in the browser). I
know this will have some side effects such as loss of white space and
undesirable formatting for the human eye, but I am willing to live
with that.

Thanks for all your help.

Googling "html, plain text conversion" got me 3,410,000 answers. After
reading all them, I know how to do it but am unable to convey it
intelligibly.
 
J

John Hosking

Googling "html, plain text conversion" got me 3,410,000 answers. After
reading all them, I know how to do it but am unable to convey it
intelligibly.

I hate it when GoogleGroupers don't use Google. :-(
 
T

tandon.sourabh

I should have have clarified this before -- I need a to do this
automatically. I ha ve aperl script which downloads the web pages
(there are hundreds of them) and before parsing them I want to have
them in text (for easy parsing).

For the above purpose, IE, firefox etc. won't be useful as automatic
conversion is desired -- preferably a utility or a function that can
be called from within a perl program.

I did find a text to html convertor function, but nothing that can
remove HTML markups.

Thanks.
 
J

Jeremy J Starcher

Is there a way to convert HTML formated text into plain text?

For example, how to convert

my html text <br>

TO

my plain text


You did not say what platform you are on, but you might want to check out
the 'lynx' browser. It is a text-mode browser that has the ability to
dump out rendered text-only.

lynx -dump -nolist http://example.com/

Executables available for most forms of *nix, Windows and I have heard of
a Mac port.
 
N

Neredbojias

I hate it when GoogleGroupers don't use Google. :-(

Yeah...:) I guess I was a little smartassy there but hope the guy
didn't take it wrong and actually _tries_ Googling for the info.
 
R

Raymond Schmit

Yeah...:) I guess I was a little smartassy there but hope the guy
didn't take it wrong and actually _tries_ Googling for the info.

when i need to convert a simple HTML document into a plain text ... i
After selecting the html-document-part i do "Copy" then "Paste" into
notepad.
 
N

Nicole

Is there a way to convert HTML formated text into plain text?


php can do that for you.
This function will do it for you:


function remove_html_tags($text)
{
$text = strip_tags($text);
return $text;
}


Use the function in a program with this line:

$myPlainText = remove_html_tags($myHTMLtext);


Nicole
 
D

dorayme

when i need to convert a simple HTML document into a plain text ... i
After selecting the html-document-part i do "Copy" then "Paste" into
notepad.

Has anyone mentioned selecting the text from the browser display and
pasting or (a Mac thing) dragging a clipping to desktop?
 
T

Travis Newbury

Is there a way to convert HTML formated text into plain text?
For example, how to convert
my html text <br>
my plain text

Uh, looking at it in a browser then select all and copy/paste into
notepad?
 
R

rf

dorayme said:
Has anyone mentioned selecting the text from the browser display and
pasting or (a Mac thing) dragging a clipping to desktop?

I think somebody just did.

I only *think* this, though. It may have been something entirely different.
 
D

dorayme

"rf said:
I think somebody just did.

What happened to our little bet? By the way, that logo example I gave in
a recent thread, sizing it in ems, looks great on my machine on modern
browsers within a reasonable range of text sizes.

I became *very* disenchanted with this technique after observing it in
IE6 years ago. But our bet is not about IE.

<http://dorayme.netweaver.com.au/ruebner.html>

Please don't go and spend that last $50 of yours on grog this weekend,
think of it as already mine... pal!

(btw, I recall Alan Flavell discussing this technique years ago and that
is when I made various experiments.)
 
J

Johannes Hafner

Nicole said:
> function remove_html_tags($text)
> {
> $text = strip_tags($text);
> return $text;
> }

is there any reason to wrap your own function around strip_tags() if it
has exactly the same parameter list AND return value as strip_tags()
already has?

| echo remove_html_tags($text);

would be exactly the same as

| echo strip_tags($text);


greets,
Johannes
 
R

Raymond Schmit

Has anyone mentioned selecting the text from the browser display and
pasting or (a Mac thing) dragging a clipping to desktop?

Yes ! 30 minutes later than me :)
 
J

Jukka K. Korpela

I should have have clarified this before

Yes, and now you should have quoted or paraphrased something in order to
give context. Please take a crash course on "how to quote on Usenet". I'm
sure Google will be your friend in finding a few dozens of alternatives.
-- I need a to do this automatically.

You also need to define what "this" is.
I ha ve aperl script which downloads the web pages
(there are hundreds of them) and before parsing them I want to have
them in text (for easy parsing).

You have an odd definition for "parsing", which usually means (in a context
like this) recognizing markup.

Anyhoo, there is a large number of ways to convert marked-up text into plain
text. Just omitting tags is easy and can be performed with a Perl one-liner
(though it is slightly easier and therefore more common to try doing it the
wrong way, ignoring the default greedyness of Perl matching). But is that
correct? If you just omit the tag from
foo<br>bar
you change the meaning (since "foo bar" and "foobar" are different at plain
text level). Similarly, omitting an <img src="foo" alt="bar"> tag does
injustice to the content, as the author has clearly provided a plain text
alternative to be used in place of the image. There are so many relevant
issues here that it is much easier to write a program for the job than write
down the exact specifications, i.e. to describe what really should be done.
 
N

Neredbojias

when i need to convert a simple HTML document into a plain text ... i
After selecting the html-document-part i do "Copy" then "Paste" into
notepad.

Well, with that approach, one could just change the file's extension to
"txt", too. Anyway, I gathered and a later OP post confirmed that a
more-or-less automatic process was desired.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,810
Latest member
Kassie0918

Latest Threads

Top