textwrap and combining diacritical marks

Berteun Damman · Jun 28, 2007

Hello,

When using the textwrap module, the wrap will always use len() to
determine the length of the string being wrapped. This might be a
sensible thing to do in many circumstances, but I think there are
circumstances where this does not lead to the desired result.

I assume many applications of this module are found in applications
where text is formatted to be presented to a user, e.g. a console
application. The number of characters in the string, as determined by
len() might not be the number of columns occupied. Some of the
characters might be combining diacritical marks, which go on top of the
previous character, i.e. the string de'ge'ne're' (where the ' indicate
combing accute accents) will only display with a width of 8 characters.

The string might also include some characters that'll switch the console
to bold or underline mode, which have zero display width. If this
happens a lot, the resuling text might seem very badly formatted because
of all these zerowidth character-strings.

It is of course impossible to handle all these scenario's in which some
characters might influence the width of the displayed string, but
wouldn't it be convenient to have a 'chunk_width' method or something
which can be overridden in a derived class, so that a user might give a
custom implementation? The default of this chunk_width might just be
'len()'.

And that leasts to another question, does Python have a function akin to
wcwidth() which gives the number of column positions a unicode character
needs?

Berteun

Berteun Damman · Jun 28, 2007

And that leasts to another question, does Python have a function akin to
wcwidth() which gives the number of column positions a unicode character
needs?

After playing around a bit with unicodedata.normalize, but seeing how
this fails when there is no precomposed form, I've decided to take
Marcus Kuhns implementation [1], and made a Python version [2].

This will try to guess the column width of a character. Non printable
characters will report a -1 width (this includes '\n' and '\t' for
example.), except for \0, which has width 0. Composing characters will
report '0', normal latin characters 1 and full-width forms for example
'2'.

Of course, real output depends on the capabilities of the display
device. xterm is capable of handling combining characters, whereas OS
X's Terminal.app can not do it for Greek or Russian characters for
example.

All in all, I think it is a reasonable start. There is one issue though,
namely involving Plane 1 chars. On 64 bit systems, so it seems, these
are stored as one character, on 32 bit systems as a surrogate pair. I
don't know how this works exactly, but the code should basically ignore
Plane 1 characters on 32 bit systems (i.e. always report display width
'1' even though they're combining or full-width).

Berteun

[1] http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
[2] http://berteun.nl/tmp/wcwidth.py

Idioms combining 'next(items)' and 'for item in items:'	1	Sep 10, 2011
Form mailto and diacritical marks ?	5	Jun 23, 2006
Diacritical marks in HTML?	11	Nov 27, 2004
French diacritical marks	4	Dec 13, 2004
Diacritical marks in array don't translate	15	Nov 11, 2005
Unicode BOM marks	9	Mar 7, 2005
Problem combining Scientific (leastSquaresFit) and scipy (odeint)	1	Nov 21, 2009
Combining C and Python programs	4	Aug 29, 2009

textwrap and combining diacritical marks

Berteun Damman

Berteun Damman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads