UTF-8 without external modules on Perl 5.0

Yohan N. Leder · May 23, 2006

You can convert between ISO-8859-15 and UTF8, too.

What's the advantage rather than to just use ISO-8859-15 everywhere ?

Yohan N. Leder · May 23, 2006

I can't really recommend one or the other. I prefer vendor-independend
standards and I'm a Unix guy, so I would generally prefer iso-8859-15.
OTOH, you probably have more Windows than Unix users, and the Unix users
are probably more able to work around charset issues, so windows-1252
will probably be less trouble to support.

And, knowing the only difference between ISO-8859-1 and ISO-8859-15 is
the euro sign (from what I've understood), why not continue to use ISO-
8859-1 and manage to translate any euro sign to its HTML entity (&euro

if I encounter one in data from form or have to display one found in a
configurable constante.

The main concern about input, in this case, is to know when to convert
this euro sign : before submission (maybe using javascript) or at STDIN
parsing time. The second one requiring that STDIN be not corrupted by
the presence of this outside-charset char as done during the
euro/checkbox bug expressed at <http://yohannl.tripod.com/cgi-
bin/form2dump.pl> ; so, for example, I'll have to remove the checkbox.

What do you think about this way, Peter and Alan ?

Yohan N. Leder · May 23, 2006

Looking at some sites to see what charset they use, I've found something
which sounds strange for me : some use ISO-8859-1 (not UTF-8 nor ISO-
8859-15) and accept the euro sign (we can type it in a form submission
and it's accepted and well displayed in the resulting page with <meta
http-equiv="Content-Type" content="text/html; charset=iso-8859-1">) ?
How it's possible ?

For example, I've found this french site :
http://www.courseapied.net/forum/whowho/nouveau.php

Here you can create what they call a profil using an email and a pass,
then fill-in your detail. I've tried entering the ¤ sign in the profil
detail and all sounds right : this sign is well displayed in the final
page using charset="ISO-8859-1".

Maybe they convert the euro sign to HTML entity (as told about somewhere
else in the thread) € ? But, when, at what level of the process
(before submission, after submission ; their form doesn't include any
checkbox and doesn't fall in the bug I told about using IE) ?

What do you think aboyt that ?

Ben Morrow · May 23, 2006

Quoth Yohan N. Leder said:
And, knowing the only difference between ISO-8859-1 and ISO-8859-15 is
the euro sign (from what I've understood), why not continue to use ISO-
8859-1 and manage to translate any euro sign to its HTML entity (&euro

That's fine for output, but if forms are submitted in the same charset
as the page the form was on, people won't be able to submit an entry
containing a euro. At least, not in any form you will be able to
understand.

The main concern about input, in this case, is to know when to convert
this euro sign : before submission (maybe using javascript)
Yeuch!

or at STDIN
parsing time. The second one requiring that STDIN be not corrupted by
the presence of this outside-charset char

There is no way of identifying a euro sign, however the browser submits
it (and non-broken browsers won't, anyway, as it's not valid). Every
8-bit byte is a valid ISO8859-1 character, so whatever single- or
multi-byte sequence the browser transmits for euro will just look like a
sequence of perfectly valid, but wrong, ISO8859-1 characters.

I think I would recommend either using 8859-15, or, if you think that's
dodgy,

1. work internally in iso8859-15,

2. make sure your output data is plain 7-bit ascii (HTML-escape
everything else),

3. mark the data as UTF-8 (this is valid, as UTF-8 is a strict
superset of 7-bit ascii)

4. decode the UTF-8 submissions into iso8859-15 yourself. This
shouldn't be too hard: there will be some 128 two-byte sequences
you want to translate to single bytes, and any other top-bit-set
character is an error. If you're feeling lazy you could fork
iconv(1)

. You may be able to rip bits from one of the
Unicode::* modules, though I'd expect the actual decoding
routines to be in C (which I guess is no use to you).

Ben

Ben Morrow · May 23, 2006

Quoth Yohan N. Leder said:
What's the advantage rather than to just use ISO-8859-15 everywhere ?

UTF-8 is better supported. Even Notepad supports it.

Ben

Bart Lateur · May 23, 2006

Yohan said:
And, knowing the only difference between ISO-8859-1 and ISO-8859-15 is
the euro sign (from what I've understood)

You're wrong there, there's more thaén one difference in the conversion
table

<http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT>

Column 1 is the local, single byte character value, column 2 is Unicode,
which is identical to Latin-1 for characters with code under 256.

Bart Lateur · May 23, 2006

Yohan said:
What you say
here is that PHP can *include* a Perl script ?

No: PHP can load and execute other PHP files. It's the PHP equivalent of
modules in Perl.

Alan J. Flavell · May 23, 2006

You're wrong there, there's more than one difference in the
conversion table

Very true...

<http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT>

But any conversion should be done by using the available modules etc.,
which already know what these tables contain ;-)

Column 1 is the local, single byte character value, column 2 is
Unicode, which is identical to Latin-1 for characters with code
under 256.

(I'll come back to that in a fit of pedantry later.)

But in practical terms (without having really analyzed the
difficulties of using geriatric Perl versions in detail), I would say
this.

Storing the data in utf-8 is the future-proof thing to do, no matter
that the present requirement seems to be limited to English and
French.

*If*, and I stress *if*, the questioner definitely wants to stick with
an 8-bit character encoding, then I do not particularly recommend
iso-8859-15 in relation to HTML. In fact, I would advise against it.
If one is thinking about older browsers, they were generally
supporting windows-1252 and/or utf-8 encodings *before* support for
iso-8859-15 became widespread. So, putting HTML pages in iso-8859-15
seems to me to bring no benefits (no matter how good an idea it might
be for plain text, say).

One approach is to store the data using HTML's &-notations, in
particular "" where the number is the *Unicode* code point
for the character in question.

Another possibility, which is in fact supported by most browsers - and
I think I'd say by more browsers than support iso-8859-15, although by
now the browsers where it makes a different are really quite old - is
to use Windows-1252. This contains not only all of the graphic
characters which are in iso-8859-15, but some more in addition,
permitting some additional languages such as Czech. But it's a
dead-end compared with supporting Unicode (whether by utf-8 or by
notation).

I think that's about the most balanced advice I can offer.

Well, frankly it would be better to write the geriatric Perl versions
out of the way, and get on and do the job properly! Thats the best
answer, to be honest.

....

<pedantic>
"Latin-1" is the name for a character repertoire, without reference
to its encoding. When you referred to "Latin-1" above, I reckon you
could only possibly have meant its ISO encoding, i.e iso-8859-1. But
there exist other encodings which cover the Latin-1 repertoire,
including EBCDIC Latin-1 (CP1047), and the MS-DOS codepage CP850 also
covers the repertoire - in each case of course with the characters
differently arranged. So IMHO it's preferable to be explicit about
which encoding you are referring to, when the topic is an encoding
rather than just a repertoire.

regards

Yohan N. Leder · May 24, 2006

Well, frankly it would be better to write the geriatric Perl versions
out of the way, and get on and do the job properly! Thats the best
answer, to be honest.

Hmm; OK about the html entities to handle unicode values... Now, I've to
turn around and think deeper about your way and the Ben's one.

Well, however, imagine I'll release a second version without any support
for Perl before to 5.8 : how to support UTF-8 in full (i/o and
internally). What the key points to check and/or rewrite in scripts ?
Does all regex and built_in functions support to work from UTF-8 strings
? What about litteral strings (the configurable one I told about) ?

Yohan N. Leder · May 24, 2006

No: PHP can load and execute other PHP files. It's the PHP equivalent of
modules in Perl.

Understood

Yohan N. Leder · May 24, 2006

That's fine for output, but if forms are submitted in the same charset
as the page the form was on, people won't be able to submit an entry
containing a euro. At least, not in any form you will be able to
understand.

There is no way of identifying a euro sign, however the browser submits
it (and non-broken browsers won't, anyway, as it's not valid). Every
8-bit byte is a valid ISO8859-1 character, so whatever single- or
multi-byte sequence the browser transmits for euro will just look like a
sequence of perfectly valid, but wrong, ISO8859-1 characters.

I think I would recommend either using 8859-15, or, if you think that's
dodgy,

1. work internally in iso8859-15,

2. make sure your output data is plain 7-bit ascii (HTML-escape
everything else),

3. mark the data as UTF-8 (this is valid, as UTF-8 is a strict
superset of 7-bit ascii)

4. decode the UTF-8 submissions into iso8859-15 yourself. This
shouldn't be too hard: there will be some 128 two-byte sequences
you want to translate to single bytes, and any other top-bit-set
character is an error. If you're feeling lazy you could fork
iconv(1) . You may be able to rip bits from one of the
Unicode::* modules, though I'd expect the actual decoding
routines to be in C (which I guess is no use to you).

Ben

Hmm, I have to think deeper about your solution above and the one Alan
talks about (UTF-8 support trough full "&code;" encoding). I have to
decide but still a little bit undecided at this time : what don't the
unicode be not invented from the beginning :-?

Ben Morrow · May 24, 2006

Hmm, I have to think deeper about your solution above and the one Alan
talks about (UTF-8 support trough full "&code;" encoding).

I believe Alan and I are suggesting materially the same thing, if that
helps you understand it better (two ways of explaining things are
usually better than one

.

I have to
decide but still a little bit undecided at this time : what don't the
unicode be not invented from the beginning :-?

(ITYM 'why' not 'what')

Yes...

That would have made life easier.

Ben Morrow · May 24, 2006

Quoth Yohan N. Leder said:
Hmm; OK about the html entities to handle unicode values... Now, I've to
turn around and think deeper about your way and the Ben's one.

Well, however, imagine I'll release a second version without any support
for Perl before to 5.8

It may be easier to write that version first, and get the
algorithms/whatever right before worrying about the character encoding.
If you write clean code it should be fairly straightforward to add the
conversions afterwards.

: how to support UTF-8 in full (i/o and
internally). What the key points to check and/or rewrite in scripts ?
Does all regex and built_in functions support to work from UTF-8 strings
?

Yup. You need to mark filehandles with their encoding, using binmode or
3-arg open: see perlunicode. If you're getting data from CGI variables,
you may need to decode it into Perl's internal format: see Encode for
that.

What about litteral strings (the configurable one I told about) ?

You can use the encoding or utf8 pragmas to specify what charset your
source file is in, which includes the literal strings. See their
documentation.

Ben

Dr.Ruud · May 24, 2006

Bart Lateur schreef:

Yohan N. Leder:

No: PHP can load and execute other PHP files. It's the PHP equivalent
of modules in Perl.

And PHP5 has an Autoloader.

Yohan N. Leder · May 25, 2006

I believe Alan and I are suggesting materially the same thing, if that
helps you understand it better (two ways of explaining things are
usually better than one .

Sure ! I've copy/pasted your two replies in a text file on desktop and
will read them again in some days

Yohan N. Leder · May 25, 2006

It may be easier to write that version first, and get the
algorithms/whatever right before worrying about the character encoding.
If you write clean code it should be fairly straightforward to add the
conversions afterwards.

Yup. You need to mark filehandles with their encoding, using binmode or
3-arg open: see perlunicode. If you're getting data from CGI variables,
you may need to decode it into Perl's internal format: see Encode for
that.

You can use the encoding or utf8 pragmas to specify what charset your
source file is in, which includes the literal strings. See their
documentation.

Ben

OK, it's effectively less complex. It's a pity I've to provide these old
plateform first... But, a idea is born in my mind reading you : what do
you think about the tools which turn a Perl source in an exe ? Don't
they embedd a 5.8 Perl interpreter ?

Do you think I could write this 5.8 version and just convert to exe
using this kind of software (don't know their names) when I have to
install on a server with a too old interpreter ? Is-it realistic ?

Ben Morrow · May 25, 2006

Quoth Yohan N. Leder said:
OK, it's effectively less complex. It's a pity I've to provide these old
plateform first...

But you can write the new version first, if you like. Install perl on
your local machine.

But, a idea is born in my mind reading you : what do
you think about the tools which turn a Perl source in an exe ? Don't
they embedd a 5.8 Perl interpreter ?

They do. However, if your admins will let you install a custom CGI
binary, but won't upgrade perl, then they need their heads examining.

Ben

Yohan N. Leder · May 26, 2006

But you can write the new version first, if you like. Install perl on
your local machine.

Yes, I'll do.

They do. However, if your admins will let you install a custom CGI
binary, but won't upgrade perl, then they need their heads examining.

Oh, no, it's not our admin, but the company's admin for which we develop
; they are our customers. Also, in this same company with the old
interpreters, there is some devloper which work on totaly others
subjects using PHP mainly.

We already have developped some stuff using C (not me but some else) for
these same servers... So, I know there will be not any problem for them
if I install another exe if I don't ask to upgrade their own main Perl
interpreter. On my side, I'm not alone and, in this case, we will be
some to hold the same speech : we just install an exe, rather than a
Perl module, and it will be right for them.

So, my only question is : does the Perl source to .exe conversion imply
some drawback ? Does all regex and built-in fct will be supported ? Do
you have some nsoftware in mind for this kind of conversion ? What's the
most reliable one ?

UTF-8 read & print?	6	Nov 25, 2012
CGI and UTF-8	14	Sep 28, 2009
converting UTF-8 to unicode hex with perl	4	Jun 27, 2009
Apparent bug in Perl 5.10 regexes w. UTF-8 expression	4	Jul 13, 2008
Best way to output literal strings as UTF-8 ?	4	Jun 1, 2006
Writing a UTF-8 file	1	Jan 5, 2007
cpan module install woes - UTF-8 problem?	2	Oct 30, 2007
How to mark UTF-8 string as being UTF-8	9	Jun 2, 2006

UTF-8 without external modules on Perl 5.0

Yohan N. Leder

Yohan N. Leder

Yohan N. Leder

Ben Morrow

Ben Morrow

Bart Lateur

Bart Lateur

Alan J. Flavell

Yohan N. Leder

Yohan N. Leder

Yohan N. Leder

Ben Morrow

Ben Morrow

Dr.Ruud

Yohan N. Leder

Yohan N. Leder

Ben Morrow

Yohan N. Leder

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads