H
Helmut Richter
I have the task of describing for authors how to prepare forms by CGI scripts
in perl, in particular, how to modify existing scripts to conform to a new
CMS. Meanwhile the CGI-generated pages are all in code UTF-8.
If I have understood everything correctly, the cooperation of the standard CGI
module and the Encode module is utterly tedious, as explained below. Perhaps
I have not seen the obvious.
Dealing with UTF-8 requires that byte strings and texts strings are
meticulously kept apart. Now, one of the functions of the CGI module is the
reuse of the last input as default for the next time. But the input is a byte
string, so the default value must be a byte string as well. An example:
We want to ask for a location and provide the default answer "München"
(Munich's German name) as default in the form. The obvious, but wrong, way
would be
$cgi->textfield(-name =>'ort', -value => 'München', -size => 40)
but that would interpret the string 'München' as a text string. This is always
wrong: Either STDOUT is binary, then the wide character will hurt. Or else,
STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been done), then the
value, if not modified by the user of the form, comes back as something else,
in this case as 'München' with the two bytes of the one UTF-8 character
interpreted as two characters. After all, there is no way to do the equivalent
of binmode for the post method of CGI.
The only work-around which I have found is to consequently use byte strings:
$Muenchen = encode ('utf8', 'München');
$cgi->textfield(-name =>'ort', -value => $Muenchen, -size => 40)
This works but has the drawback that an extra step of decoding all input
values to text strings is required when the interaction with the user of
the form is over.
I have the suspicion that I am thinking to complicated and that there is a
simple -- and simple to explain -- method for dealing with CGI forms when the
code used is UTF-8.
in perl, in particular, how to modify existing scripts to conform to a new
CMS. Meanwhile the CGI-generated pages are all in code UTF-8.
If I have understood everything correctly, the cooperation of the standard CGI
module and the Encode module is utterly tedious, as explained below. Perhaps
I have not seen the obvious.
Dealing with UTF-8 requires that byte strings and texts strings are
meticulously kept apart. Now, one of the functions of the CGI module is the
reuse of the last input as default for the next time. But the input is a byte
string, so the default value must be a byte string as well. An example:
We want to ask for a location and provide the default answer "München"
(Munich's German name) as default in the form. The obvious, but wrong, way
would be
$cgi->textfield(-name =>'ort', -value => 'München', -size => 40)
but that would interpret the string 'München' as a text string. This is always
wrong: Either STDOUT is binary, then the wide character will hurt. Or else,
STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been done), then the
value, if not modified by the user of the form, comes back as something else,
in this case as 'München' with the two bytes of the one UTF-8 character
interpreted as two characters. After all, there is no way to do the equivalent
of binmode for the post method of CGI.
The only work-around which I have found is to consequently use byte strings:
$Muenchen = encode ('utf8', 'München');
$cgi->textfield(-name =>'ort', -value => $Muenchen, -size => 40)
This works but has the drawback that an extra step of decoding all input
values to text strings is required when the interaction with the user of
the form is over.
I have the suspicion that I am thinking to complicated and that there is a
simple -- and simple to explain -- method for dealing with CGI forms when the
code used is UTF-8.