How to convert markup text to plain text in python?

G

geoffbache

I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

Regards,
Geoff
 
T

Tim Chase

I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


Well, if all you want to do is remove everything from a "<" to a
">", you can use
>>> s = "<B>Today</B> is <U>Friday</U>"
>>> import re
>>> r = re.compile('<[^>]*>')
>>> print r.sub('', s)
Today is Friday

it should even work for semi-pathological cases such as

s = """You can find my <a
href='http://example.com'>thesis</a
> online"""

where the tag contents are split across lines. There are more
pathological cases where tags aren't well-formed, e.g.

s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"

in which case you get what you deserve for making such
pathological conditions ;-)

-tkc
 
P

ph

I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

Quick but very dirty way:

data=urllib.urlopen('http://google.com').read()
data=''.join([x.split('>',1)[-1] for x in data.split('<')])
 
S

Steve Holden

Tim said:
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


Well, if all you want to do is remove everything from a "<" to a
">", you can use
s = "<B>Today</B> is <U>Friday</U>"
import re
r = re.compile('<[^>]*>')
print r.sub('', s)
Today is Friday

it should even work for semi-pathological cases such as

s = """You can find my <a
href='http://example.com'>thesis</a
online"""

where the tag contents are split across lines. There are more
pathological cases where tags aren't well-formed, e.g.

s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"

in which case you get what you deserve for making such
pathological conditions ;-)
The real answer to this question is "learn how to use Beautiful Soup" --
see http://www.crummy.com/software/BeautifulSoup/

regards
Steve
 
T

Tim Chase

Well, if all you want to do is remove everything from a "<" to a
">", you can use
s = "<B>Today</B> is <U>Friday</U>"
import re
r = re.compile('<[^>]*>')
print r.sub('', s)
Today is Friday
[Tim's ramblings about pathological cases snipped]
The real answer to this question is "learn how to use Beautiful Soup" --
see http://www.crummy.com/software/BeautifulSoup/

Yes, for more pathological cases, BS does a great job of parsing
junk :)

However, as BS isn't batteries-included [Aside: BS and pyparsing
are two common solutions to problems that would make great
additions to the standard library], using a RE to make a
best-effort guess is a good first approximation of a solution
without needing to download extra packages--no matter how useful
those extra packages may be.

-tkc
 
P

Paul McGuire

Well, if all you want to do is remove everything from a "<" to a
">", you can use
  >>> s = "<B>Today</B> is <U>Friday</U>"
  >>> import re
  >>> r = re.compile('<[^>]*>')
  >>> print r.sub('', s)
  Today is Friday

[Tim's ramblings about pathological cases snipped]

pyparsing includes an example script for stripping tags from HTML
source. See it on the wiki at http://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.

-- Paul
 
Z

Zentrader

I have some marked up text and would like to convert it to plain text,

If this is just a quick and dirty problem, you can also use one of the
lynx/elinks/links2 browsers and dump the contents to a file. On Linux
it would be
lynx -dump http://www.etc > text.txt
Lynx is also available for MS Windows, but am not sure about the other
two.
 
S

Stefan Behnel

geoffbache said:
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")
u'Today is Friday'


http://codespeak.net/lxml

Stefan
 
S

Stefan Behnel

geoffbache said:
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

This might be of interest:

http://pypi.python.org/pypi/haufe.stripml

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,805
Latest member
ClydeHeld1

Latest Threads

Top