reading from file

Sydoruk Yaroslav · Jun 11, 2009

Hello all,

In a text file aword.txt, there is a string:
"\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc".

There is a first script:
f = open ("aword.txt", "r")
for line in f:
print chardet.detect(line)
b = line.decode('cp1251')
print b

_RESULT_
{'confidence': 1.0, 'encoding': 'ascii'}
\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc

There is a second script:
line = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
print chardet.detect(line)
b = line.decode('cp1251')
print b

_RESULT_
{'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
ÐºÐ°Ðº+Ð¿Ð¾Ð·Ð²Ð¾Ð½Ð¸Ñ‚ÑŒ

Why is reading from a file into a string variable is defined as ascii,
but when it is clearly defined in the script is defined as cp1251.
How do I solve this problem.

Jeff McNeil · Jun 11, 2009

Hello all,

In a text file aword.txt, there is a string:
Â Â "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc".

There is a first script:
f = open ("aword.txt", "r")
for line in f:
Â Â print chardet.detect(line)
Â Â b = line.decode('cp1251')
Â Â print b

_RESULT_
{'confidence': 1.0, 'encoding': 'ascii'}
\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc

There is a second script:
line = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
print chardet.detect(line)
b = line.decode('cp1251')
print b

_RESULT_
{'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
ÐºÐ°Ðº+Ð¿Ð¾Ð·Ð²Ð¾Ð½Ð¸Ñ‚ÑŒ

Why is reading from a file into a string variable is defined as ascii,
but when it is clearly defined in the script is defined as cp1251.
How do I solve this problem.

Is the string in your text file literally "\xea\xe0\xea+\xef\xee
\xe7\xe2\xee\xed\xe8\xf2\xfc" as "plain text?" My assumption is that
when you're reading that in, Python is interpreting each byte as an
ASCII value (and rightfully so) rather than the corresponding '\x'
escapes.

As an experiment:

(t)jeff@marvin:~/t$ cat test.py
import chardet

s = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
with open('test.txt', 'w') as f:
print >>f, s

print chardet.detect(open('test.txt').read())
(t)jeff@marvin:~/t$ python test.py
{'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
(t)jeff@marvin:~/t$

HTH,

Jeff
mcjeff.blogspot.com

Sydoruk Yaroslav · Jun 11, 2009

Jeff McNeil said:
Is the string in your text file literally "\xea\xe0\xea+\xef\xee
\xe7\xe2\xee\xed\xe8\xf2\xfc" as "plain text?" My assumption is that
when you're reading that in, Python is interpreting each byte as an
ASCII value (and rightfully so) rather than the corresponding '\x'
escapes.

As an experiment:

(t)jeff@marvin:~/t$ cat test.py
import chardet

s = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
with open('test.txt', 'w') as f:
print >>f, s

print chardet.detect(open('test.txt').read())
(t)jeff@marvin:~/t$ python test.py
{'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
(t)jeff@marvin:~/t$

HTH,

Jeff
mcjeff.blogspot.com

Thank you for your reply.
You are right, Python reads data form the file in bytes and all data in this
case is ASCII

I solved the problem, just added line = line.decode('string_escape')

f = open ("aword.txt", "r")
for line in f:
line = line.decode('string_escape')
Â Â print chardet.detect(line)
Â Â b = line.decode('cp1251')
Â Â print b

Why?	3	Feb 6, 2007
os.listdir("\\\\delta\\public")	5	Dec 4, 2004
u'a' in string.letters fails: a Python 2.3 bug?	2	Oct 10, 2003
WSGI/wsgiref: modifying output on windows ?	2	Jun 3, 2007
windows active directory ldap output encoding	2	Jul 8, 2008
Porting a c program	11	May 9, 2009
Script execution blocked when reading from a socket	9	Jan 17, 2006
Pipe input from a Text-File	4	Dec 3, 2004

reading from file

Sydoruk Yaroslav

Jeff McNeil

Sydoruk Yaroslav

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads