B
Brian D
In an HTML page that I'm scraping using urllib2, a \xc2\xa0
bytestring appears.
The page's charset = utf-8, and the Chrome browser I'm using displays
the characters as a space.
The page requires authentication:
https://www.nolaready.info/myalertlog.php
When I try to concatenate strings containing the bytestring, Python
chokes because it refuses to coerce the bytestring into ascii.
wfile.write('|'.join(valueList))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
163: ordinal not in range(128)
In searching for help with this issue, I've learned that the
bytestring *might* represent a non-breaking space.
When I scrape the page using urllib2, however, the characters print
as   in a Windows command prompt (though I wouldn't be surprised if
this is some erroneous attempt by the antiquated command window to
handle something it doesn't understand).
If I use IDLE to attempt to decode the single byte referenced in the
error message, and convert it into UTF-8, another error message is
generated:
Traceback (most recent call last):
File "<pyshell#72>", line 1, in <module>
weird = unicode('\xc2', 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
unexpected end of data
If I attempt to decode the full bytestring, I don't obtain a human-
readable string (expecting, perhaps, a non-breaking space):
I suspect that the bytestring isn't UTF-8, but what is it? Latin1?
u'This just gets \xc2\xa0'
Or is it a Microsoft bytestring?
u'This just gets \xc2\xa0'
None of these codecs seem to work.
Back to the original purpose, as I'm scraping the page, I'm storing
the field/value pair in a dictionary with each iteration through table
elements on the page. This is all fine, until a value is found that
contains the offending bytestring. I have attempted to coerce all
value strings into an encoding, but Python doesn't seem to like that
when the string is already Unicode:
valuesDict[fieldString] = unicode(value, 'UTF-8')
TypeError: decoding Unicode is not supported
The solution I've arrived at is to specify the encoding for value
strings both when reading and writing value strings.
for k, v in valuesDict.iteritems():
valuePair = ':'.join([k, v.encode('UTF-8')])
[snip] ...
wfile.write('|'.join(valueList))
I'm not sure I have a question, but does this sound familiar to any
Unicode experts out there?
How should I handle these odd bytestring values? Am I doing it
correctly, or what could I improve?
Thanks!
bytestring appears.
The page's charset = utf-8, and the Chrome browser I'm using displays
the characters as a space.
The page requires authentication:
https://www.nolaready.info/myalertlog.php
When I try to concatenate strings containing the bytestring, Python
chokes because it refuses to coerce the bytestring into ascii.
wfile.write('|'.join(valueList))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
163: ordinal not in range(128)
In searching for help with this issue, I've learned that the
bytestring *might* represent a non-breaking space.
When I scrape the page using urllib2, however, the characters print
as   in a Windows command prompt (though I wouldn't be surprised if
this is some erroneous attempt by the antiquated command window to
handle something it doesn't understand).
If I use IDLE to attempt to decode the single byte referenced in the
error message, and convert it into UTF-8, another error message is
generated:
Traceback (most recent call last):
File "<pyshell#72>", line 1, in <module>
weird = unicode('\xc2', 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
unexpected end of data
If I attempt to decode the full bytestring, I don't obtain a human-
readable string (expecting, perhaps, a non-breaking space):
u'This is - \xa0'weird = unicode('\xc2\xa0', 'utf-8')
par = ' - '.join(['This is', weird])
par
I suspect that the bytestring isn't UTF-8, but what is it? Latin1?
u'This just gets \xc2\xa0'
Or is it a Microsoft bytestring?
u'This just gets \xc2\xa0'
None of these codecs seem to work.
Back to the original purpose, as I'm scraping the page, I'm storing
the field/value pair in a dictionary with each iteration through table
elements on the page. This is all fine, until a value is found that
contains the offending bytestring. I have attempted to coerce all
value strings into an encoding, but Python doesn't seem to like that
when the string is already Unicode:
valuesDict[fieldString] = unicode(value, 'UTF-8')
TypeError: decoding Unicode is not supported
The solution I've arrived at is to specify the encoding for value
strings both when reading and writing value strings.
for k, v in valuesDict.iteritems():
valuePair = ':'.join([k, v.encode('UTF-8')])
[snip] ...
wfile.write('|'.join(valueList))
I'm not sure I have a question, but does this sound familiar to any
Unicode experts out there?
How should I handle these odd bytestring values? Am I doing it
correctly, or what could I improve?
Thanks!