Question about Encode (Windows-1252 to utf-8)

williams.wilkie · Jul 9, 2008

Hello! I have recently been turned on to Encode. We have some folks
who are copying and pasting from Word straight into our CMS and the
need to convert from "Windows-1252" to "utf-8" is now critical.

For a one liner I have been using this....
perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
file1.txt file2.txt

Works good for editing in place.

My quandry is that now I need to tackle multiple files in a directory
and another developer mentioned that if "UTF-8" and "Windows-1252" are
intermixed in a file that it may get confused and I should do a
transliteration like..

tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

I wonder if that's really true and when it comes to open and closing
file handles for this should I be using something like "binmode
OUTPUTFILEHANDLE, ':bytes';"

I am impressed with Encode but any advice or words that anyone wants
to throw in would be greatly appreciated.

Wilkie
flames go quietly to /dev/null

Ted Zlatanov · Jul 9, 2008

On Tue, 8 Jul 2008 16:40:53 -0700 (PDT) (e-mail address removed) wrote:

ww> Hello! I have recently been turned on to Encode. We have some folks
ww> who are copying and pasting from Word straight into our CMS and the
ww> need to convert from "Windows-1252" to "utf-8" is now critical.

ww> For a one liner I have been using this....
ww> perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
ww> file1.txt file2.txt

ww> Works good for editing in place.

ww> My quandry is that now I need to tackle multiple files in a directory
ww> and another developer mentioned that if "UTF-8" and "Windows-1252" are
ww> intermixed in a file that it may get confused

Why don't you try it? If it doesn't work for you, post an example and
what fails.

ww> and I should do a transliteration like..

ww> tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;

I would avoid that solution, it's extremely dangerous compared to
Encode. You may destroy valid UTF-8 data.

ww> I wonder if that's really true and when it comes to open and closing
ww> file handles for this should I be using something like "binmode
ww> OUTPUTFILEHANDLE, ':bytes';"

Maybe, depending on the file contents. Again, try it.

Ted

Jürgen Exner · Jul 9, 2008

My quandry is that now I need to tackle multiple files in a directory
and another developer mentioned that if "UTF-8" and "Windows-1252" are
intermixed in a file that it may get confused and I should do a
transliteration like..

Unless the file format supports multiple encodings within the same file
(like e.g. a MIME email) a file can have only one encoding.

tr/\x93/\N{LEFT DOUBLE QUOTATION MARK}/;
Nuts!

I am impressed with Encode but any advice or words that anyone wants
to throw in would be greatly appreciated.

The only way to survive the encoding nightmare and stay sane is to
standardize _ALL_ your data on _ONE SINGLE_ encoding. I strongly
recommend UTF-8, but that's up to you.
Any conversion between this standard format and other formats happens
(if at all) _ONLY_ for user interaction, e.g. to support legacy email
clients which don't support UTF-8 or accept input from a web page in ISO
8859-15 or even Greek, Arabic or Chinese or similar tasks. Of course, if
at all possible even this user interaction should use the agreed-upon
standard.

jue
(with a decade of internationalizing and localizing software)

worldcyclist · Jul 11, 2008

Unless the file format supports multiple encodings within the same file
(like e.g. a MIME email) a file can have only one encoding.

The only way to survive the encoding nightmare and stay sane is to
standardize _ALL_ your data on _ONE SINGLE_ encoding. I strongly
recommend UTF-8, but that's up to you.
Any conversion between this standard format and other formats happens
(if at all) _ONLY_ for user interaction, e.g. to support legacy email
clients which don't support UTF-8 or accept input from a web page in ISO
8859-15 or even Greek, Arabic or Chinese or similar tasks. Of course, if
at all possible even this user interaction should use the agreed-upon
standard.

jue
(with a decade of internationalizing and localizing software)

I have seen this before with other CMSs where someone types something
and then cuts
and pastes from Word and then the data is mixed when stored in MySQL.
MySQL doesn't care what you have it encoded in, but the
problem comes when automated routines create XML files that are then
stored with mixed
encoding (CMS data stored into MySQL, another routine generates static
XML files from the faulty data for usage by other places).

Certainly makes the point that the data needs to be validated before
going into the db, but I can
feel the poster's pain regarding this issue.

Maybe specifying your IN and OUT filehandles as ':bytes' would help
(to preserve data and inhibit automated encoding
that may result in unexpected changed to your already formatted
UTF-8).
Once you read in then use the transliteration method you described
before to change things. I'm not a huge fan of using that
method either but that's the way it was done not too many years ago.

I'd like to see other suggestions on this one too.
JC

Question regarding Encode	2	Jul 8, 2008
Problem converting euro from windows-1252 to UTF-8 !!	5	Jul 10, 2006
UTF-8 read & print?	6	Nov 25, 2012
From UTF-8 to windows-1252	3	Jan 6, 2011
CGI and UTF-8	14	Sep 28, 2009
Unicode (UTF-8) in C	13	Mar 16, 2014
utf-8	1	Dec 31, 2007
XML::LibXML UTF-8 toString() -vs- nodeValue()	36	Apr 8, 2009

Question about Encode (Windows-1252 to utf-8)

williams.wilkie

Ted Zlatanov

Jürgen Exner

worldcyclist

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads