reading from file

S

Sydoruk Yaroslav

Hello all,

In a text file aword.txt, there is a string:
"\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc".

There is a first script:
f = open ("aword.txt", "r")
for line in f:
print chardet.detect(line)
b = line.decode('cp1251')
print b

_RESULT_
{'confidence': 1.0, 'encoding': 'ascii'}
\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc

There is a second script:
line = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
print chardet.detect(line)
b = line.decode('cp1251')
print b

_RESULT_
{'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
как+позвонить

Why is reading from a file into a string variable is defined as ascii,
but when it is clearly defined in the script is defined as cp1251.
How do I solve this problem.
 
J

Jeff McNeil

Hello all,

In a text file aword.txt, there is a string:
    "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc".

There is a first script:
f = open ("aword.txt", "r")
for line in f:
    print chardet.detect(line)
    b = line.decode('cp1251')
    print b

_RESULT_
{'confidence': 1.0, 'encoding': 'ascii'}
\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc

There is a second script:
line = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
print chardet.detect(line)
b = line.decode('cp1251')
print b

_RESULT_
{'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
как+позвонить

Why is reading from a file into a string variable is defined as ascii,
but when it is clearly defined in the script is defined as cp1251.
How do I solve this problem.

Is the string in your text file literally "\xea\xe0\xea+\xef\xee
\xe7\xe2\xee\xed\xe8\xf2\xfc" as "plain text?" My assumption is that
when you're reading that in, Python is interpreting each byte as an
ASCII value (and rightfully so) rather than the corresponding '\x'
escapes.

As an experiment:

(t)jeff@marvin:~/t$ cat test.py
import chardet

s = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
with open('test.txt', 'w') as f:
print >>f, s

print chardet.detect(open('test.txt').read())
(t)jeff@marvin:~/t$ python test.py
{'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
(t)jeff@marvin:~/t$

HTH,

Jeff
mcjeff.blogspot.com
 
S

Sydoruk Yaroslav

Jeff McNeil said:
Is the string in your text file literally "\xea\xe0\xea+\xef\xee
\xe7\xe2\xee\xed\xe8\xf2\xfc" as "plain text?" My assumption is that
when you're reading that in, Python is interpreting each byte as an
ASCII value (and rightfully so) rather than the corresponding '\x'
escapes.

As an experiment:

(t)jeff@marvin:~/t$ cat test.py
import chardet

s = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
with open('test.txt', 'w') as f:
print >>f, s

print chardet.detect(open('test.txt').read())
(t)jeff@marvin:~/t$ python test.py
{'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
(t)jeff@marvin:~/t$

HTH,

Jeff
mcjeff.blogspot.com


Thank you for your reply.
You are right, Python reads data form the file in bytes and all data in this
case is ASCII


I solved the problem, just added line = line.decode('string_escape')

f = open ("aword.txt", "r")
for line in f:
line = line.decode('string_escape')
    print chardet.detect(line)
    b = line.decode('cp1251')
    print b
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,164
Messages
2,570,898
Members
47,439
Latest member
shasuze

Latest Threads

Top