overriding character escapes during file input

  • Thread starter David J Birnbaum
  • Start date
D

David J Birnbaum

Dear Python-list,

I need to read a Unicode (utf-8) file that contains text like:
blah \fR40\fC blah
I get my input and then process it with something like:
inputFile = codecs.open(sys.argv[1],'r', 'utf-8')

for line in inputFile:
When Python encounters the "\f" substring in an input line, it wants to
treat it as an escape sequence representing a form-feed control
character, which means that it gets interpreted as (or, from my
perspective, translated to) "\x0c". Were I entering this string myself
within my program code, I could use a raw string (r"\f") to avoid this
translation, but I don't know how to do this when I am reading a line
from a file. If all I cared about was getting my code to work, I could
simply let the translation take place and then undo it within my
program, but, as Humpty Dumpty said, "it's a question of which is to be
master," and I would prefer to coerce Python into reading the line the
way I want it to be read, rather than let it do as it pleases and then
clean up afterwards.

Can anyone advise?

In case it matters, I'm using ActivePython 2.4 under Windows XP.

Thanks,

David
(e-mail address removed)
 
J

John Machin

David said:
Dear Python-list,

I need to read a Unicode (utf-8) file that contains text like:
blah \fR40\fC blah
I get my input and then process it with something like:
inputFile = codecs.open(sys.argv[1],'r', 'utf-8')

for line in inputFile:
When Python encounters the "\f" substring in an input line, it wants to
treat it as an escape sequence representing a form-feed control
character,

Even if it were as sentient as "wanting" to muck about with the input,
it doesn't. Those escape sequences are interpreted by the compiler, and
in other functions (e.g. re.compile) but *not* when reading a text
file.

Example:
|>>> guff = r"blah \fR40\fC blah"
|>>> print repr(guff)
'blah \\fR40\\fC blah'
|>>> # above is ASCII so it is automatically also UTF8

Comment: It contains backslash followed by 'f' ...

|... fname = "guff.utf8"
|>>> f = open(fname, "w")
|>>> f.write(guff)
|>>> f.close()
|>>> import codecs
|>>> f = codecs.open(fname,'r', 'utf-8')
|>>> guff2 = f.read()
|>>> print guff2 == guff
|True
No interpretation of the r"\f" has been done.
which means that it gets interpreted as (or, from my
perspective, translated to) "\x0c". Were I entering this string myself
within my program code, I could use a raw string (r"\f") to avoid this
translation, but I don't know how to do this when I am reading a line
from a file.

What I suggest you do is:
print repr(open('yourfile', 'r').read()
[or at least one of the offending lines]
and inspect it closely. You may find (1) that the file has formfeeds in
it or (2) it has r"\f" in in it and you were mistaken about the
interpretation or (3) something else.

If you maintain (3) is the case, then make up a small example file,
show a dump of it using print repr(.....) as above, plus the (short)
code where you decode it and dump the result.

HTH,
John
 
J

John Machin

John said:
David said:
Dear Python-list,

I need to read a Unicode (utf-8) file that contains text like:
blah \fR40\fC blah
I get my input and then process it with something like:
inputFile = codecs.open(sys.argv[1],'r', 'utf-8')

for line in inputFile:
When Python encounters the "\f" substring in an input line, it wants to
treat it as an escape sequence representing a form-feed control
character,

Even if it were as sentient as "wanting" to muck about with the input,
it doesn't. Those escape sequences are interpreted by the compiler, and
in other functions (e.g. re.compile) but *not* when reading a text
file.

Example:
|>>> guff = r"blah \fR40\fC blah"
|>>> print repr(guff)
'blah \\fR40\\fC blah'
|>>> # above is ASCII so it is automatically also UTF8

Comment: It contains backslash followed by 'f' ...

|... fname = "guff.utf8"
|>>> f = open(fname, "w")
|>>> f.write(guff)
|>>> f.close()
|>>> import codecs
|>>> f = codecs.open(fname,'r', 'utf-8')
|>>> guff2 = f.read()
|>>> print guff2 == guff
|True
No interpretation of the r"\f" has been done.
which means that it gets interpreted as (or, from my
perspective, translated to) "\x0c". Were I entering this string myself
within my program code, I could use a raw string (r"\f") to avoid this
translation, but I don't know how to do this when I am reading a line
from a file.

What I suggest you do is:
print repr(open('yourfile', 'r').read()
[or at least one of the offending lines]
and inspect it closely. You may find (1) that the file has formfeeds in
it or (2) it has r"\f" in in it and you were mistaken about the
interpretation or (3) something else.

If you maintain (3) is the case, then make up a small example file,
show a dump of it using print repr(.....) as above, plus the (short)
code where you decode it and dump the result. =========================================================

Dear John,

Thank you for the quick response. Ultimately I need to remap the "f" in
"\f" to something else, so I worked around the problem by doing the
remapping first, and I'm now getting the desired result.

Please reply on-list.

How could you read the file to remap an "f" if you were getting '\0x0C'
when you tried to read it? Are we to assume that it was case (2) i.e.
not a Python problem?

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top