Parsing

T

Timothy Wu

I'm parsing Firefox bookmarks and writing the same bookmark to another
file. To make sure I read and write utf-8 correctly I make open files
like this for read and write:

codecs.open(file, "r", "utf-8")

For regular expression I parse like this:

m = re.search("<TITLE>(.*?)</TITLE>", line, re.I)

How do I tell the regular expression to parse in utf-8? From the docs it
seems like I can do re.compile("<TITLE>(.*?)</TITLE>", 'U') for unicode.
But does it need to be specified to be utf-8 instead of some other
unicode standards? Or does that matter at all?

And, I'm not calling compile() directly at all. I'm simply calling
re.search(). How would I specify unicode? Is it simply re.flags = 'U'
before any call search?

Timothy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,184
Messages
2,570,978
Members
47,561
Latest member
gjsign

Latest Threads

Top