What the \xc2\xa0 ?!!

B

Brian D

In an HTML page that I'm scraping using urllib2, a \xc2\xa0
bytestring appears.

The page's charset = utf-8, and the Chrome browser I'm using displays
the characters as a space.

The page requires authentication:
https://www.nolaready.info/myalertlog.php

When I try to concatenate strings containing the bytestring, Python
chokes because it refuses to coerce the bytestring into ascii.

wfile.write('|'.join(valueList))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
163: ordinal not in range(128)

In searching for help with this issue, I've learned that the
bytestring *might* represent a non-breaking space.

When I scrape the page using urllib2, however, the characters print
as   in a Windows command prompt (though I wouldn't be surprised if
this is some erroneous attempt by the antiquated command window to
handle something it doesn't understand).

If I use IDLE to attempt to decode the single byte referenced in the
error message, and convert it into UTF-8, another error message is
generated:

Traceback (most recent call last):
File "<pyshell#72>", line 1, in <module>
weird = unicode('\xc2', 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
unexpected end of data

If I attempt to decode the full bytestring, I don't obtain a human-
readable string (expecting, perhaps, a non-breaking space):
weird = unicode('\xc2\xa0', 'utf-8')
par = ' - '.join(['This is', weird])
par
u'This is - \xa0'

I suspect that the bytestring isn't UTF-8, but what is it? Latin1?
u'This just gets \xc2\xa0'

Or is it a Microsoft bytestring?
u'This just gets \xc2\xa0'

None of these codecs seem to work.

Back to the original purpose, as I'm scraping the page, I'm storing
the field/value pair in a dictionary with each iteration through table
elements on the page. This is all fine, until a value is found that
contains the offending bytestring. I have attempted to coerce all
value strings into an encoding, but Python doesn't seem to like that
when the string is already Unicode:

valuesDict[fieldString] = unicode(value, 'UTF-8')
TypeError: decoding Unicode is not supported

The solution I've arrived at is to specify the encoding for value
strings both when reading and writing value strings.

for k, v in valuesDict.iteritems():
valuePair = ':'.join([k, v.encode('UTF-8')])
[snip] ...
wfile.write('|'.join(valueList))

I'm not sure I have a question, but does this sound familiar to any
Unicode experts out there?

How should I handle these odd bytestring values? Am I doing it
correctly, or what could I improve?

Thanks!
 
J

John Roth

In an HTML page that I'm scraping using urllib2, a  \xc2\xa0
bytestring appears.

The page's charset = utf-8, and the Chrome browser I'm using displays
the characters as a space.

The page requires authentication:https://www.nolaready.info/myalertlog.php

When I try to concatenate strings containing the bytestring, Python
chokes because it refuses to coerce the bytestring into ascii.

wfile.write('|'.join(valueList))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
163: ordinal not in range(128)

In searching for help with this issue, I've learned that the
bytestring *might* represent a non-breaking space.

When I scrape the page using urllib2, however, the characters print
as  ┬á  in a Windows command prompt (though I wouldn't be surprised if
this is some erroneous attempt by the antiquated command window to
handle something it doesn't understand).

If I use IDLE to attempt to decode the single byte referenced in the
error message, and convert it into UTF-8, another error message is
generated:

Traceback (most recent call last):
  File "<pyshell#72>", line 1, in <module>
    weird = unicode('\xc2', 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
unexpected end of data

If I attempt to decode the full bytestring, I don't obtain a human-
readable string (expecting, perhaps, a non-breaking space):
weird = unicode('\xc2\xa0', 'utf-8')
par = ' - '.join(['This is', weird])
par

u'This is - \xa0'

I suspect that the bytestring isn't UTF-8, but what is it? Latin1?

u'This just gets \xc2\xa0'

Or is it a Microsoft bytestring?

u'This just gets \xc2\xa0'

None of these codecs seem to work.

Back to the original purpose, as I'm scraping the page, I'm storing
the field/value pair in a dictionary with each iteration through table
elements on the page. This is all fine, until a value is found that
contains the offending bytestring. I have attempted to coerce all
value strings into an encoding, but Python doesn't seem to like that
when the string is already Unicode:

valuesDict[fieldString] = unicode(value, 'UTF-8')
TypeError: decoding Unicode is not supported

The solution I've arrived at is to specify the encoding for value
strings both when reading and writing value strings.

for k, v in valuesDict.iteritems():
    valuePair = ':'.join([k, v.encode('UTF-8')])
    [snip] ...
    wfile.write('|'.join(valueList))

I'm not sure I have a question, but does this sound familiar to any
Unicode experts out there?

How should I handle these odd bytestring values? Am I doing it
correctly, or what could I improve?

Thanks!

Since it's UTF-8, one should go to one of the UTF-8 pages that
describes how to decode it. As it turns out, its unicode hex value is
A0, which is indeed a non-breaking space.

This is probably as good as any page: http://en.wikipedia.org/wiki/UTF-8

John Roth
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top