textwrap and combining diacritical marks

B

Berteun Damman

Hello,

When using the textwrap module, the wrap will always use len() to
determine the length of the string being wrapped. This might be a
sensible thing to do in many circumstances, but I think there are
circumstances where this does not lead to the desired result.

I assume many applications of this module are found in applications
where text is formatted to be presented to a user, e.g. a console
application. The number of characters in the string, as determined by
len() might not be the number of columns occupied. Some of the
characters might be combining diacritical marks, which go on top of the
previous character, i.e. the string de'ge'ne're' (where the ' indicate
combing accute accents) will only display with a width of 8 characters.

The string might also include some characters that'll switch the console
to bold or underline mode, which have zero display width. If this
happens a lot, the resuling text might seem very badly formatted because
of all these zerowidth character-strings.

It is of course impossible to handle all these scenario's in which some
characters might influence the width of the displayed string, but
wouldn't it be convenient to have a 'chunk_width' method or something
which can be overridden in a derived class, so that a user might give a
custom implementation? The default of this chunk_width might just be
'len()'.

And that leasts to another question, does Python have a function akin to
wcwidth() which gives the number of column positions a unicode character
needs?

Berteun
 
B

Berteun Damman

And that leasts to another question, does Python have a function akin to
wcwidth() which gives the number of column positions a unicode character
needs?

After playing around a bit with unicodedata.normalize, but seeing how
this fails when there is no precomposed form, I've decided to take
Marcus Kuhns implementation [1], and made a Python version [2].

This will try to guess the column width of a character. Non printable
characters will report a -1 width (this includes '\n' and '\t' for
example.), except for \0, which has width 0. Composing characters will
report '0', normal latin characters 1 and full-width forms for example
'2'.

Of course, real output depends on the capabilities of the display
device. xterm is capable of handling combining characters, whereas OS
X's Terminal.app can not do it for Greek or Russian characters for
example.

All in all, I think it is a reasonable start. There is one issue though,
namely involving Plane 1 chars. On 64 bit systems, so it seems, these
are stored as one character, on 32 bit systems as a surrogate pair. I
don't know how this works exactly, but the code should basically ignore
Plane 1 characters on 32 bit systems (i.e. always report display width
'1' even though they're combining or full-width).

Berteun

[1] http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
[2] http://berteun.nl/tmp/wcwidth.py
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,255
Members
46,856
Latest member
MyronKatz6

Latest Threads

Top