[Note: parts of this message were removed to make it a legal post.]
I parse a webpage which encoded in gb2312, using Watir to get the
context of the page title, and want to replace the 'chinese character'
in title with english words.
When puts title which watir get, the chinese character displaied as
corrupt code there (under windows cmd,code page using cp936, display
works normal when change code page to utf-8). But I think cmd's code
page just display setting does not related with what I need (replace
chinese char). I did not know if string I get by Watir is also in
'gb2312' encoding or something others, the fact is fail happen when
convert the string to utf-8 encoding, message is complain the char is
invalid.
totally no idea what need to do.
I had a sneaking suspicion that your task was Watir related.
First off the solution to this is hard - I have done some Watir tasks that
involved
processing international text (even in UTF-8) and there are some nasty
gotchas.
It is not Watir's fault or anything, but there are 2 areas which really
annoy:
puts 'International text' in the windows cmd shell is almost useless as a
way of
debugging the problem. Windows CMD shell takes perverse satisfaction in
ignoring
any encoding you might have set your ruby code to work in. The platform
codepage
is what Windows does all its work in.
Watir itself is implemented on top of Win32OLE which is yet another area
where
the platform encoding can interfere against your wishes.
You might get better luck posting these questions on the watir mailing list.
Its been a
while for me, but useful workarounds to common and uncommon gotchas like
this
are discussed and answered there.
Also check the Watir wiki
http://wiki.openqa.org/display/WTR/Project+Home (odd
its currently down for me).
Testing of encoded text did come up a lot when I checked it last, and they
heavily document their
workarounds.
Lastly check out the JRuby equivalent to Watir: Celerity. It is API
compatible with Watir so your script will
probably still work with only minor modifications. It runs over a java http
library so you have good encoding
processing if you need it, and more importantly there are less entry points
for the Windows platform
encoding to impose itself. Note: Celerity has no visual component. FireWatir
(FF) or ChromeWatir may
also be useful to you for similar reasons.