UTF-8 without external modules on Perl 5.0

Y

Yohan N. Leder

I can't really recommend one or the other. I prefer vendor-independend
standards and I'm a Unix guy, so I would generally prefer iso-8859-15.
OTOH, you probably have more Windows than Unix users, and the Unix users
are probably more able to work around charset issues, so windows-1252
will probably be less trouble to support.

And, knowing the only difference between ISO-8859-1 and ISO-8859-15 is
the euro sign (from what I've understood), why not continue to use ISO-
8859-1 and manage to translate any euro sign to its HTML entity (€)
if I encounter one in data from form or have to display one found in a
configurable constante.

The main concern about input, in this case, is to know when to convert
this euro sign : before submission (maybe using javascript) or at STDIN
parsing time. The second one requiring that STDIN be not corrupted by
the presence of this outside-charset char as done during the
euro/checkbox bug expressed at <http://yohannl.tripod.com/cgi-
bin/form2dump.pl> ; so, for example, I'll have to remove the checkbox.

What do you think about this way, Peter and Alan ?
 
Y

Yohan N. Leder

Looking at some sites to see what charset they use, I've found something
which sounds strange for me : some use ISO-8859-1 (not UTF-8 nor ISO-
8859-15) and accept the euro sign (we can type it in a form submission
and it's accepted and well displayed in the resulting page with <meta
http-equiv="Content-Type" content="text/html; charset=iso-8859-1">) ?
How it's possible ?

For example, I've found this french site :
http://www.courseapied.net/forum/whowho/nouveau.php

Here you can create what they call a profil using an email and a pass,
then fill-in your detail. I've tried entering the ¤ sign in the profil
detail and all sounds right : this sign is well displayed in the final
page using charset="ISO-8859-1".

Maybe they convert the euro sign to HTML entity (as told about somewhere
else in the thread) &euro; ? But, when, at what level of the process
(before submission, after submission ; their form doesn't include any
checkbox and doesn't fall in the bug I told about using IE) ?

What do you think aboyt that ?
 
B

Ben Morrow

Quoth Yohan N. Leder said:
And, knowing the only difference between ISO-8859-1 and ISO-8859-15 is
the euro sign (from what I've understood), why not continue to use ISO-
8859-1 and manage to translate any euro sign to its HTML entity (&euro;)

That's fine for output, but if forms are submitted in the same charset
as the page the form was on, people won't be able to submit an entry
containing a euro. At least, not in any form you will be able to
understand.
The main concern about input, in this case, is to know when to convert
this euro sign : before submission (maybe using javascript)
Yeuch!

or at STDIN
parsing time. The second one requiring that STDIN be not corrupted by
the presence of this outside-charset char

There is no way of identifying a euro sign, however the browser submits
it (and non-broken browsers won't, anyway, as it's not valid). Every
8-bit byte is a valid ISO8859-1 character, so whatever single- or
multi-byte sequence the browser transmits for euro will just look like a
sequence of perfectly valid, but wrong, ISO8859-1 characters.

I think I would recommend either using 8859-15, or, if you think that's
dodgy,

1. work internally in iso8859-15,

2. make sure your output data is plain 7-bit ascii (HTML-escape
everything else),

3. mark the data as UTF-8 (this is valid, as UTF-8 is a strict
superset of 7-bit ascii)

4. decode the UTF-8 submissions into iso8859-15 yourself. This
shouldn't be too hard: there will be some 128 two-byte sequences
you want to translate to single bytes, and any other top-bit-set
character is an error. If you're feeling lazy you could fork
iconv(1) :). You may be able to rip bits from one of the
Unicode::* modules, though I'd expect the actual decoding
routines to be in C (which I guess is no use to you).

Ben
 
B

Bart Lateur

Yohan said:
What you say
here is that PHP can *include* a Perl script ?

No: PHP can load and execute other PHP files. It's the PHP equivalent of
modules in Perl.
 
A

Alan J. Flavell

You're wrong there, there's more than one difference in the
conversion table

Very true...

But any conversion should be done by using the available modules etc.,
which already know what these tables contain ;-)
Column 1 is the local, single byte character value, column 2 is
Unicode, which is identical to Latin-1 for characters with code
under 256.

(I'll come back to that in a fit of pedantry later.)

But in practical terms (without having really analyzed the
difficulties of using geriatric Perl versions in detail), I would say
this.

Storing the data in utf-8 is the future-proof thing to do, no matter
that the present requirement seems to be limited to English and
French.

*If*, and I stress *if*, the questioner definitely wants to stick with
an 8-bit character encoding, then I do not particularly recommend
iso-8859-15 in relation to HTML. In fact, I would advise against it.
If one is thinking about older browsers, they were generally
supporting windows-1252 and/or utf-8 encodings *before* support for
iso-8859-15 became widespread. So, putting HTML pages in iso-8859-15
seems to me to bring no benefits (no matter how good an idea it might
be for plain text, say).

One approach is to store the data using HTML's &-notations, in
particular "" where the number is the *Unicode* code point
for the character in question.

Another possibility, which is in fact supported by most browsers - and
I think I'd say by more browsers than support iso-8859-15, although by
now the browsers where it makes a different are really quite old - is
to use Windows-1252. This contains not only all of the graphic
characters which are in iso-8859-15, but some more in addition,
permitting some additional languages such as Czech. But it's a
dead-end compared with supporting Unicode (whether by utf-8 or by
notation).

I think that's about the most balanced advice I can offer.

Well, frankly it would be better to write the geriatric Perl versions
out of the way, and get on and do the job properly! Thats the best
answer, to be honest.

....

<pedantic>
"Latin-1" is the name for a character repertoire, without reference
to its encoding. When you referred to "Latin-1" above, I reckon you
could only possibly have meant its ISO encoding, i.e iso-8859-1. But
there exist other encodings which cover the Latin-1 repertoire,
including EBCDIC Latin-1 (CP1047), and the MS-DOS codepage CP850 also
covers the repertoire - in each case of course with the characters
differently arranged. So IMHO it's preferable to be explicit about
which encoding you are referring to, when the topic is an encoding
rather than just a repertoire.

regards
 
Y

Yohan N. Leder

Well, frankly it would be better to write the geriatric Perl versions
out of the way, and get on and do the job properly! Thats the best
answer, to be honest.

Hmm; OK about the html entities to handle unicode values... Now, I've to
turn around and think deeper about your way and the Ben's one.

Well, however, imagine I'll release a second version without any support
for Perl before to 5.8 : how to support UTF-8 in full (i/o and
internally). What the key points to check and/or rewrite in scripts ?
Does all regex and built_in functions support to work from UTF-8 strings
? What about litteral strings (the configurable one I told about) ?
 
Y

Yohan N. Leder

That's fine for output, but if forms are submitted in the same charset
as the page the form was on, people won't be able to submit an entry
containing a euro. At least, not in any form you will be able to
understand.


There is no way of identifying a euro sign, however the browser submits
it (and non-broken browsers won't, anyway, as it's not valid). Every
8-bit byte is a valid ISO8859-1 character, so whatever single- or
multi-byte sequence the browser transmits for euro will just look like a
sequence of perfectly valid, but wrong, ISO8859-1 characters.

I think I would recommend either using 8859-15, or, if you think that's
dodgy,

1. work internally in iso8859-15,

2. make sure your output data is plain 7-bit ascii (HTML-escape
everything else),

3. mark the data as UTF-8 (this is valid, as UTF-8 is a strict
superset of 7-bit ascii)

4. decode the UTF-8 submissions into iso8859-15 yourself. This
shouldn't be too hard: there will be some 128 two-byte sequences
you want to translate to single bytes, and any other top-bit-set
character is an error. If you're feeling lazy you could fork
iconv(1) :). You may be able to rip bits from one of the
Unicode::* modules, though I'd expect the actual decoding
routines to be in C (which I guess is no use to you).

Ben

Hmm, I have to think deeper about your solution above and the one Alan
talks about (UTF-8 support trough full "&code;" encoding). I have to
decide but still a little bit undecided at this time : what don't the
unicode be not invented from the beginning :-?
 
B

Ben Morrow

Hmm, I have to think deeper about your solution above and the one Alan
talks about (UTF-8 support trough full "&code;" encoding).

I believe Alan and I are suggesting materially the same thing, if that
helps you understand it better (two ways of explaining things are
usually better than one :).
I have to
decide but still a little bit undecided at this time : what don't the
unicode be not invented from the beginning :-?

(ITYM 'why' not 'what')

Yes... :) That would have made life easier.
 
B

Ben Morrow

Quoth Yohan N. Leder said:
Hmm; OK about the html entities to handle unicode values... Now, I've to
turn around and think deeper about your way and the Ben's one.

Well, however, imagine I'll release a second version without any support
for Perl before to 5.8

It may be easier to write that version first, and get the
algorithms/whatever right before worrying about the character encoding.
If you write clean code it should be fairly straightforward to add the
conversions afterwards.
: how to support UTF-8 in full (i/o and
internally). What the key points to check and/or rewrite in scripts ?
Does all regex and built_in functions support to work from UTF-8 strings
?

Yup. You need to mark filehandles with their encoding, using binmode or
3-arg open: see perlunicode. If you're getting data from CGI variables,
you may need to decode it into Perl's internal format: see Encode for
that.
What about litteral strings (the configurable one I told about) ?

You can use the encoding or utf8 pragmas to specify what charset your
source file is in, which includes the literal strings. See their
documentation.

Ben
 
Y

Yohan N. Leder

I believe Alan and I are suggesting materially the same thing, if that
helps you understand it better (two ways of explaining things are
usually better than one :).

Sure ! I've copy/pasted your two replies in a text file on desktop and
will read them again in some days :)
 
Y

Yohan N. Leder

It may be easier to write that version first, and get the
algorithms/whatever right before worrying about the character encoding.
If you write clean code it should be fairly straightforward to add the
conversions afterwards.


Yup. You need to mark filehandles with their encoding, using binmode or
3-arg open: see perlunicode. If you're getting data from CGI variables,
you may need to decode it into Perl's internal format: see Encode for
that.


You can use the encoding or utf8 pragmas to specify what charset your
source file is in, which includes the literal strings. See their
documentation.

Ben

OK, it's effectively less complex. It's a pity I've to provide these old
plateform first... But, a idea is born in my mind reading you : what do
you think about the tools which turn a Perl source in an exe ? Don't
they embedd a 5.8 Perl interpreter ?

Do you think I could write this 5.8 version and just convert to exe
using this kind of software (don't know their names) when I have to
install on a server with a too old interpreter ? Is-it realistic ?
 
B

Ben Morrow

Quoth Yohan N. Leder said:
OK, it's effectively less complex. It's a pity I've to provide these old
plateform first...

But you can write the new version first, if you like. Install perl on
your local machine.
But, a idea is born in my mind reading you : what do
you think about the tools which turn a Perl source in an exe ? Don't
they embedd a 5.8 Perl interpreter ?

They do. However, if your admins will let you install a custom CGI
binary, but won't upgrade perl, then they need their heads examining. :)

Ben
 
Y

Yohan N. Leder

But you can write the new version first, if you like. Install perl on
your local machine.

Yes, I'll do.
They do. However, if your admins will let you install a custom CGI
binary, but won't upgrade perl, then they need their heads examining. :)

Oh, no, it's not our admin, but the company's admin for which we develop
; they are our customers. Also, in this same company with the old
interpreters, there is some devloper which work on totaly others
subjects using PHP mainly.

We already have developped some stuff using C (not me but some else) for
these same servers... So, I know there will be not any problem for them
if I install another exe if I don't ask to upgrade their own main Perl
interpreter. On my side, I'm not alone and, in this case, we will be
some to hold the same speech : we just install an exe, rather than a
Perl module, and it will be right for them.

So, my only question is : does the Perl source to .exe conversion imply
some drawback ? Does all regex and built-in fct will be supported ? Do
you have some nsoftware in mind for this kind of conversion ? What's the
most reliable one ?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,188
Messages
2,571,002
Members
47,591
Latest member
WoodrowBut

Latest Threads

Top