Embedding a literal "\u" in a unicode raw string.

Romano Giannetti · Feb 25, 2008

Hi,

while writing some LaTeX preprocessing code, I stumbled into this problem: (I
have a -*- coding: utf-8 -*- line, obviously)

s = ur"aÃ±ado $\uparrow$"

Which gave an error because the \u escape is interpreted in raw unicode strings,
too. So I found that the only way to solve this is to write:

s = unicode(r"aÃ±ado $\uparrow$", "utf-8")

or

s = ur"aÃ±ado $\u005cuparrow$"

The second one is too ugly to live, while the first is at least acceptable; but
looking around the Python 3.0 doc, I saw that the first one will fail, too.

Am I doing something wrong here or there is another solution for this?

Romano

Diez B. Roggisch · Feb 25, 2008

Romano said:
Hi,

while writing some LaTeX preprocessing code, I stumbled into this problem:
(I have a -*- coding: utf-8 -*- line, obviously)

s = ur"aÃ±ado $\uparrow$"

Which gave an error because the \u escape is interpreted in raw unicode
strings, too. So I found that the only way to solve this is to write:

s = unicode(r"aÃ±ado $\uparrow$", "utf-8")

or

s = ur"aÃ±ado $\u005cuparrow$"

The second one is too ugly to live, while the first is at least
acceptable; but looking around the Python 3.0 doc, I saw that the first
one will fail, too.

Am I doing something wrong here or there is another solution for this?

Why don't you rid yourself of the raw-string? Then you need to do

s = u"anando $\\uparrow$"

which is considerably easier to read than both other variants above.

Diez

OKB (not okblacke) · Feb 25, 2008

Romano said:
Hi,

while writing some LaTeX preprocessing code, I stumbled into this
problem: (I have a -*- coding: utf-8 -*- line, obviously)

s = ur"aÃ±ado $\uparrow$"

Which gave an error because the \u escape is interpreted in raw
unicode strings, too. So I found that the only way to solve this is
to write:

s = unicode(r"aÃ±ado $\uparrow$", "utf-8")

or

s = ur"aÃ±ado $\u005cuparrow$"

The second one is too ugly to live, while the first is at least
acceptable; but looking around the Python 3.0 doc, I saw that the
first one will fail, too.

Am I doing something wrong here or there is another solution for
this?

I too encountered this problem, in the same situation (making
strings that contain LaTeX commands). One possibility is to separate
out just the bit that has the \u, and use string juxtaposition to attach
it to the others:

s = ur"aÃ±ado " u"$\\uparrow$"

It's not ideal, but I think it's easier to read than your solution
#2.

--
--OKB (not okblacke)
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is
no path, and leave a trail."
--author unknown

romano.giannetti · Feb 25, 2008

I too encountered this problem, in the same situation (making
strings that contain LaTeX commands). One possibility is to separate
out just the bit that has the \u, and use string juxtaposition to attach
it to the others:

s = ur"añado " u"$\\uparrow$"

It's not ideal, but I think it's easier to read than your solution
#2.

Yes, I think I will do something like that, although... I really do
not understand why \x5c is not interpreted in a raw string but \u005c
is interpreted in a unicode raw string... is, well, not elegant. Raw
should be raw...

Thanks anyway

Martin v. Löwis · Feb 25, 2008

Yes, I think I will do something like that, although... I really do

not understand why \x5c is not interpreted in a raw string but \u005c
is interpreted in a unicode raw string... is, well, not elegant. Raw
should be raw...

Right. IMO, this is just a plain design mistake in the Python Unicode
handling. Unfortunately, there was discussion about this specific issue
in the past, and the proponent of the status quo always defended it,
with the rationale (IIUC) that a) without that, you can't put arbitrary
Unicode characters into a string, and b) the semantics of \u in Java and
C is so that \u gets processed even before tokenization even starts, and
it should be the same in Python.

Regards,
Martin

rmano · Feb 25, 2008

Right. IMO, this is just a plain design mistake in the Python Unicode
handling. Unfortunately, there was discussion about this specific issue
in the past, and the proponent of the status quo always defended it,
with the rationale (IIUC) that a) without that, you can't put arbitrary
Unicode characters into a string, and b) the semantics of \u in Java and
C is so that \u gets processed even before tokenization even starts, and
it should be the same in Python.

Well, I do not know Java, but C AFAIK has no raw strings, so you have
nevertheless
to use double backslashes. Raw strings are a handy shorthand when you
can generate
the characters with your keyboard, and this asymmetry quite defeat it.

Is it decided or it is possible to lobby for it?

Thanks,
Romano

BTW, 2to3.py should warn when a raw string (not unicode) with \u in
it, I think.
I tried it and it seems to ignore the problem...

NickC · Mar 4, 2008

BTW, 2to3.py should warn when a raw string (not unicode) with \u in
it, I think.
I tried it and it seems to ignore the problem...

Python 3.0a3+ (py3k:61229, Mar 4 2008, 21:38:15)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.'\\'

2to3.py may be ignoring a problem, but existing raw 8-bit string
literals containing a '\u' aren't going to be it. If anything is going
to have a problem with conversion to Py3k at this point, it is raw
Unicode literals that contain a Unicode escape.

rmano · Mar 7, 2008

Python 3.0a3+ (py3k:61229, Mar 4 2008, 21:38:15)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> r"\u"
'\\u''\\uparrow'

Nice to know... so it seems that the 3.0 doc was not updated. I think
this is the correct
behaviour. Thanks

Unicode raw string containing \u	3	Oct 28, 2007
A bug for raw string literals in Py3k?	1	Oct 31, 2010
How to Encode String of Raw UTF-8 into Unicode?	0	Mar 7, 2008
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
unicode string literals and "u" prefix	6	Nov 8, 2004
Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
How to convert Unicode string to raw string escaped with HTML Entities	3	May 10, 2007
Why can't I set sys.ps1 to a unicode string?	3	Aug 12, 2010

Embedding a literal "\u" in a unicode raw string.

Romano Giannetti

Diez B. Roggisch

OKB (not okblacke)

romano.giannetti

Martin v. Löwis

rmano

NickC

rmano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads