overriding character escapes during file input

David J Birnbaum · Sep 3, 2006

Dear Python-list,

I need to read a Unicode (utf-8) file that contains text like:

blah \fR40\fC blah

I get my input and then process it with something like:

inputFile = codecs.open(sys.argv[1],'r', 'utf-8')

for line in inputFile:

When Python encounters the "\f" substring in an input line, it wants to
treat it as an escape sequence representing a form-feed control
character, which means that it gets interpreted as (or, from my
perspective, translated to) "\x0c". Were I entering this string myself
within my program code, I could use a raw string (r"\f") to avoid this
translation, but I don't know how to do this when I am reading a line
from a file. If all I cared about was getting my code to work, I could
simply let the translation take place and then undo it within my
program, but, as Humpty Dumpty said, "it's a question of which is to be
master," and I would prefer to coerce Python into reading the line the
way I want it to be read, rather than let it do as it pleases and then
clean up afterwards.

Can anyone advise?

In case it matters, I'm using ActivePython 2.4 under Windows XP.

Thanks,

David
(e-mail address removed)

John Machin · Sep 3, 2006

David said:
Dear Python-list,

I need to read a Unicode (utf-8) file that contains text like:

blah \fR40\fC blah

Click to expand...

I get my input and then process it with something like:

inputFile = codecs.open(sys.argv[1],'r', 'utf-8')

for line in inputFile:

Click to expand...

When Python encounters the "\f" substring in an input line, it wants to
treat it as an escape sequence representing a form-feed control
character,

Even if it were as sentient as "wanting" to muck about with the input,
it doesn't. Those escape sequences are interpreted by the compiler, and
in other functions (e.g. re.compile) but *not* when reading a text
file.

Example:
|>>> guff = r"blah \fR40\fC blah"
|>>> print repr(guff)
'blah \\fR40\\fC blah'
|>>> # above is ASCII so it is automatically also UTF8

Comment: It contains backslash followed by 'f' ...

|... fname = "guff.utf8"
|>>> f = open(fname, "w")
|>>> f.write(guff)
|>>> f.close()
|>>> import codecs
|>>> f = codecs.open(fname,'r', 'utf-8')
|>>> guff2 = f.read()
|>>> print guff2 == guff
|True
No interpretation of the r"\f" has been done.

which means that it gets interpreted as (or, from my
perspective, translated to) "\x0c". Were I entering this string myself
within my program code, I could use a raw string (r"\f") to avoid this
translation, but I don't know how to do this when I am reading a line
from a file.

What I suggest you do is:
print repr(open('yourfile', 'r').read()
[or at least one of the offending lines]
and inspect it closely. You may find (1) that the file has formfeeds in
it or (2) it has r"\f" in in it and you were mistaken about the
interpretation or (3) something else.

If you maintain (3) is the case, then make up a small example file,
show a dump of it using print repr(.....) as above, plus the (short)
code where you decode it and dump the result.

HTH,
John

John Machin · Sep 3, 2006

John said:
David said:

Dear Python-list,

I need to read a Unicode (utf-8) file that contains text like:

blah \fR40\fC blah

Click to expand...

I get my input and then process it with something like:

inputFile = codecs.open(sys.argv[1],'r', 'utf-8')

for line in inputFile:

Click to expand...

When Python encounters the "\f" substring in an input line, it wants to
treat it as an escape sequence representing a form-feed control
character,

Click to expand...

Even if it were as sentient as "wanting" to muck about with the input,
it doesn't. Those escape sequences are interpreted by the compiler, and
in other functions (e.g. re.compile) but *not* when reading a text
file.

Example:
|>>> guff = r"blah \fR40\fC blah"
|>>> print repr(guff)
'blah \\fR40\\fC blah'
|>>> # above is ASCII so it is automatically also UTF8

Comment: It contains backslash followed by 'f' ...

|... fname = "guff.utf8"
|>>> f = open(fname, "w")
|>>> f.write(guff)
|>>> f.close()
|>>> import codecs
|>>> f = codecs.open(fname,'r', 'utf-8')
|>>> guff2 = f.read()
|>>> print guff2 == guff
|True
No interpretation of the r"\f" has been done.

which means that it gets interpreted as (or, from my
perspective, translated to) "\x0c". Were I entering this string myself
within my program code, I could use a raw string (r"\f") to avoid this
translation, but I don't know how to do this when I am reading a line
from a file.

Click to expand...

What I suggest you do is:
print repr(open('yourfile', 'r').read()
[or at least one of the offending lines]
and inspect it closely. You may find (1) that the file has formfeeds in
it or (2) it has r"\f" in in it and you were mistaken about the
interpretation or (3) something else.

If you maintain (3) is the case, then make up a small example file,
show a dump of it using print repr(.....) as above, plus the (short)
code where you decode it and dump the result. =========================================================

Dear John,

Thank you for the quick response. Ultimately I need to remap the "f" in
"\f" to something else, so I worked around the problem by doing the
remapping first, and I'm now getting the desired result.

Please reply on-list.

How could you read the file to remap an "f" if you were getting '\0x0C'
when you tried to read it? Are we to assume that it was case (2) i.e.
not a Python problem?

Cheers,
John

python3 raw strings and \u escapes	10	May 30, 2012
unescape escapes in strings	4	Feb 23, 2009
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Resolving unicode escapes to unicode character	1	Jul 29, 2011
Taking list as an input from Python to C	1	Jul 21, 2022
unicode shutil.copy() changes a file name during copy?	6	Feb 16, 2011
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
UnicodeEncodeError during repr()	3	Apr 19, 2010

overriding character escapes during file input

David J Birnbaum

John Machin

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads