xhtml encoding question

T

Tim Arnold

I have to follow a specification for producing xhtml files.
The original files are in cp1252 encoding and I must reencode them to utf-8.
Also, I have to replace certain characters with html entities.

I think I've got this right, but I'd like to hear if there's something
I'm doing that is dangerous or wrong.

Please see the appended code, and thanks for any comments or suggestions.

I have two functions, translate (replaces high characters with entities)
and reencode (um, reencodes):
---------------------------------
import codecs, StringIO
from lxml import etree
high_chars = {
0x2014:'—', # 'EM DASH',
0x2013:'–', # 'EN DASH',
0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
0x2122:'™', # 'TRADE MARK SIGN',
0x00A9:'©', # 'COPYRIGHT SYMBOL',
}
def translate(string):
s = ''
for c in string:
if ord(c) in high_chars:
c = high_chars.get(ord(c))
s += c
return s

def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
with codecs.open(filename,encoding=in_encoding) as f:
s = f.read()
sio = StringIO.StringIO(translate(s))
parser = etree.HTMLParser(encoding=in_encoding)
tree = etree.parse(sio, parser)
result = etree.tostring(tree.getroot(), method='html',
pretty_print=True,
encoding=out_encoding)
with open(filename,'wb') as f:
f.write(result)

if __name__ == '__main__':
fname = 'mytest.htm'
reencode(fname)
 
S

Stefan Behnel

Tim Arnold, 31.01.2012 19:09:
I have to follow a specification for producing xhtml files.
The original files are in cp1252 encoding and I must reencode them to utf-8.
Also, I have to replace certain characters with html entities.

I think I've got this right, but I'd like to hear if there's something I'm
doing that is dangerous or wrong.

Please see the appended code, and thanks for any comments or suggestions.

I have two functions, translate (replaces high characters with entities)
and reencode (um, reencodes):
---------------------------------
import codecs, StringIO
from lxml import etree
high_chars = {
0x2014:'—', # 'EM DASH',
0x2013:'–', # 'EN DASH',
0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
0x2122:'™', # 'TRADE MARK SIGN',
0x00A9:'©', # 'COPYRIGHT SYMBOL',
}
def translate(string):
s = ''
for c in string:
if ord(c) in high_chars:
c = high_chars.get(ord(c))
s += c
return s

I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?

def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
with codecs.open(filename,encoding=in_encoding) as f:
s = f.read()
sio = StringIO.StringIO(translate(s))
parser = etree.HTMLParser(encoding=in_encoding)
tree = etree.parse(sio, parser)

Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?

result = etree.tostring(tree.getroot(), method='html',
pretty_print=True,
encoding=out_encoding)
with open(filename,'wb') as f:
f.write(result)

Use tree.write(f, ...)

Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:

tree = etree.parse(in_path)
tree.write(out_path, encoding='utf8', pretty_print=True)

Note that I didn't provide an input encoding. XML is safe in that regard.

Stefan
 
U

Ulrich Eckhardt

Am 31.01.2012 19:09, schrieb Tim Arnold:
high_chars = {
0x2014:'—', # 'EM DASH',
0x2013:'–', # 'EN DASH',
0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
0x2122:'™', # 'TRADE MARK SIGN',
0x00A9:'©', # 'COPYRIGHT SYMBOL',
}

You could use Unicode string literals directly instead of using the
codepoint, making it a bit more self-documenting and saving you the
later call to ord():

high_chars = {
u'\u2014': '—',
u'\u2013': '–',
...
}
for c in string:
if ord(c) in high_chars:
c = high_chars.get(ord(c))
s += c
return s

Instead of checking if there is a replacement and then looking up the
replacement again, just use the default:

for c in string:
s += high_chars.get(c, c)

Alternatively, if you find that clearer, you could also check if the
returnvalue of get() is None to find out if there is a replacement:

for c in string:
r = high_chars.get(c)
if r is None:
s += c
else:
s += r


Uli
 
P

Peter Otten

Ulrich said:
Am 31.01.2012 19:09, schrieb Tim Arnold:

You could use Unicode string literals directly instead of using the
codepoint, making it a bit more self-documenting and saving you the
later call to ord():

high_chars = {
u'\u2014': '—',
u'\u2013': '–',
...
}


Instead of checking if there is a replacement and then looking up the
replacement again, just use the default:

for c in string:
s += high_chars.get(c, c)

Alternatively, if you find that clearer, you could also check if the
returnvalue of get() is None to find out if there is a replacement:

for c in string:
r = high_chars.get(c)
if r is None:
s += c
else:
s += r

It doesn't matter for the OP (see Stefan Behnel's post), but If you want to
replace characters in a unicode string the best way is probably the
translate() method:
u'©™'
 
U

Ulrich Eckhardt

Am 01.02.2012 10:32, schrieb Peter Otten:
It doesn't matter for the OP (see Stefan Behnel's post), but If you want to
replace characters in a unicode string the best way is probably the
translate() method:

u'©™'

Yes, this is both more expressive and at the same time probably even
more efficient.


Question though:
u'abc'

I would call this a chance to improve Python. According to the
documentation, using a string is invalid, but it neither raises an
exception nor does it do the obvious and accept single-character strings
as keys.


Thoughts?


Uli
 
T

Tim Arnold

Tim Arnold, 31.01.2012 19:09:

I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?
I wasn't aware of it, but I am now--code's embarassing now.
The spec I must follow forces me to do the translation.

I am actually working with html not xhtml; which makes a huge
difference, sorry for that.

Ulrich's line of code for translate is elegant.
for c in string:
s += high_chars.get(c,c)
Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?
I see that I'm decoding twice now, thanks.

Also, I now see that when lxml writes the result back out the entities I
got from my translate function are resolved, which defeats the whole
purpose.
Use tree.write(f, ...)

From the all the info I've received on this thread, plus some
additional reading, I think I need the following code.

Use the HTMLParser because the source files are actually HTML, and use
output from etree.tostring() as input to translate() as the very last step.

def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
parser = etree.HTMLParser(encoding=in_encoding)
tree = etree.parse(filename, parser)
result = etree.tostring(tree.getroot(), method='html',
pretty_print=True,
encoding=out_encoding)
with open(filename, 'wb') as f:
f.write(translate(result))

not simply tree.write(f...) because I have to do the translation at the
end, so I get the entities instead of the resolved entities from lxml.

Again, it would be simpler if this was xhtml, but I misspoke
(mis-wrote?) when I said xhtml; this is for html.
Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:

tree = etree.parse(in_path)
tree.write(out_path, encoding='utf8', pretty_print=True)

Note that I didn't provide an input encoding. XML is safe in that regard.

Stefan

thanks everyone for the help.

--Tim Arnold
 
S

Stefan Behnel

Tim Arnold, 01.02.2012 19:15:
I wasn't aware of it, but I am now--code's embarassing now.
The spec I must follow forces me to do the translation.

I am actually working with html not xhtml; which makes a huge difference,

We all learn.

Ulrich's line of code for translate is elegant.
for c in string:
s += high_chars.get(c,c)

Still not efficient because it builds the string one character at a time
and needs to reallocate (and potentially copy) the string buffer quite
frequently in order to do that. You are lucky with CPython, because it has
an internal optimisation that mitigates this overhead on some platforms.
Other Python implementations don't have that, and even the optimisation in
CPython is platform specific (works well on Linux, for example).

Peter Otten presented the a better way of doing it.

From the all the info I've received on this thread, plus some additional
reading, I think I need the following code.

Use the HTMLParser because the source files are actually HTML, and use
output from etree.tostring() as input to translate() as the very last step.

def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
parser = etree.HTMLParser(encoding=in_encoding)
tree = etree.parse(filename, parser)
result = etree.tostring(tree.getroot(), method='html',
pretty_print=True,
encoding=out_encoding)
with open(filename, 'wb') as f:
f.write(translate(result))

not simply tree.write(f...) because I have to do the translation at the
end, so I get the entities instead of the resolved entities from lxml.

Yes, that's better.

Still one thing (since you didn't show us your final translate() function):
you do the character escaping on a UTF-8 encoded string and made the
encoding configurable. That means that the characters you are looking for
must also be encoded with the same encoding in order to find matches.
However, if you ever choose a different target encoding that doesn't have
the nice properties of UTF-8's byte sequences, you may end up with
ambiguous byte sequences in the output that your translate() function
accidentally matches on, thus potentially corrupting your data.

Assuming that you are using Python 2, you may even be accidentally doing
the replacement using Unicode character strings, which then only happens to
work on systems that use UTF-8 as their default encoding. Python 3 has
fixed this trap, but you have to take care to avoid it in Python 2.

I'd prefer serialising the documents into a unicode string
(encoding='unicode'), then post-processing that and finally encoding it to
the target encoding when writing it out. But you'll have to see how that
works out together with your escaping step, and also how it impacts the
HTML <meta> tag that states the document encoding.

Stefan
 
P

Peter Otten

Ulrich said:
Am 01.02.2012 10:32, schrieb Peter Otten:

Yes, this is both more expressive and at the same time probably even
more efficient.


Question though:

u'abc'

I would call this a chance to improve Python. According to the
documentation, using a string is invalid, but it neither raises an
exception nor does it do the obvious and accept single-character strings
as keys.


Thoughts?

How could this raise an exception? You'd either need a typed dictionary (int
--> unicode) or translate() would have to verify that all keys are indeed
integers. The former would go against the grain of Python, the latter would
make the method less flexible as the set of keys currently need not be
predefined:
.... def __getitem__(self, key):
.... return unichr(key).upper()
....u'ALPHA'

Using unicode instead of integer keys would be nice but breaks backwards
compatibility, using both could double the number of dictionary lookups.
 
U

Ulrich Eckhardt

Am 02.02.2012 12:02, schrieb Peter Otten:
Ulrich said:
u'abc'.translate({u'a': u'A'})
u'abc'

I would call this a chance to improve Python. According to the
documentation, using a string [as key] is invalid, but it neither raises
an exception nor does it do the obvious and accept single-character
strings as keys.


Thoughts?

How could this raise an exception? You'd either need a typed dictionary (int
--> unicode) or translate() would have to verify that all keys are indeed
integers.

The latter is exactly what I would have done, i.e. scan the dictionary
for invalid values, in the spirit of not letting errors pass unnoticed.

The former would go against the grain of Python, the latter would
make the method less flexible as the set of keys currently need not be
predefined:

... def __getitem__(self, key):
... return unichr(key).upper()
...
u'ALPHA'

Working with __getitem__ is a point. I'm not sure if it is reasonable to
expect this to work though. I'm -0 on that. I could also imagine a
completely separate path for iterable and non-iterable mappings.

Using unicode instead of integer keys would be nice but breaks backwards
compatibility, using both could double the number of dictionary lookups.

Dictionary lookups are constant time and well-optimized, so I'd actually
go for allowing both and paying that price. I could even imagine
preprocessing the supplied dictionary while checking for invalid values.
The result could be a structure that makes use of the fact that Unicode
codepoints are < 22 bits and that makes the way from the elements of the
source sequence to the according map entry as short as possible (I'm not
sure if using codepoints or single-character strings is faster).
However, those are early optimizations of which I'm not sure if they are
worth it.

Anyway, thanks for your thoughts, they are always appreciated!

Uli
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,965
Messages
2,570,148
Members
46,710
Latest member
FredricRen

Latest Threads

Top