CGI and UTF-8

H

Helmut Richter

I have the task of describing for authors how to prepare forms by CGI scripts
in perl, in particular, how to modify existing scripts to conform to a new
CMS. Meanwhile the CGI-generated pages are all in code UTF-8.

If I have understood everything correctly, the cooperation of the standard CGI
module and the Encode module is utterly tedious, as explained below. Perhaps
I have not seen the obvious.

Dealing with UTF-8 requires that byte strings and texts strings are
meticulously kept apart. Now, one of the functions of the CGI module is the
reuse of the last input as default for the next time. But the input is a byte
string, so the default value must be a byte string as well. An example:

We want to ask for a location and provide the default answer "München"
(Munich's German name) as default in the form. The obvious, but wrong, way
would be

$cgi->textfield(-name =>'ort', -value => 'München', -size => 40)

but that would interpret the string 'München' as a text string. This is always
wrong: Either STDOUT is binary, then the wide character will hurt. Or else,
STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been done), then the
value, if not modified by the user of the form, comes back as something else,
in this case as 'München' with the two bytes of the one UTF-8 character
interpreted as two characters. After all, there is no way to do the equivalent
of binmode for the post method of CGI.

The only work-around which I have found is to consequently use byte strings:

$Muenchen = encode ('utf8', 'München');
$cgi->textfield(-name =>'ort', -value => $Muenchen, -size => 40)

This works but has the drawback that an extra step of decoding all input
values to text strings is required when the interaction with the user of
the form is over.

I have the suspicion that I am thinking to complicated and that there is a
simple -- and simple to explain -- method for dealing with CGI forms when the
code used is UTF-8.
 
J

Jürgen Exner

Helmut Richter said:
We want to ask for a location and provide the default answer "München"[...]

$cgi->textfield(-name =>'ort', -value => 'München', -size => 40)

but that would interpret the string 'München' as a text string. This is always
wrong: Either STDOUT is binary, then the wide character will hurt. Or else,
STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been done), then the
value, if not modified by the user of the form, comes back as something else,
in this case as 'München' with the two bytes of the one UTF-8 character
interpreted as two characters. After all, there is no way to do the equivalent
of binmode for the post method of CGI.

I assume you did set the META charset of the HTML page to UTF-8? Or did
you let the browser guess about the encoding and then it returned the
wrong encoding in the form response?

jue
 
P

Peter J. Holzer

On 2009-09-28 13:41 said:
I have the suspicion that I am thinking to complicated and that there is a
simple -- and simple to explain -- method for dealing with CGI forms when the
code used is UTF-8.

AFAICT no. Newer versions of CGI have some UTF-8 support, but it isn't
documented at all. In previous threads I've poked around a bit in it
and posted what I found:

* http://groups.google.at/groups/[email protected]&hl=en

* http://groups.google.at/groups/[email protected]&hl=en

Hope that gives you a starting point.

hp
 
J

Jochen Lehmeier

If I have understood everything correctly, the cooperation of the
standard CGI module and the Encode module is utterly tedious, as
explained below. Perhaps I have not seen the obvious.

Perhaps. I don't exactly know what's going on with your code. I have only
had
good results when using existing CGI scripts with utf8. That is, scripts
that used
to run with latin1 were deployed "as is" in a utf8 setting.

The biggest issues I ran into were with DBD::Oracle, which has some very
ugly
problems in the utf8 world indeed (which, to be honest, are documented as
"features"),
but that is a different story, not related to CGI.
Dealing with UTF-8 requires that byte strings and texts strings are
meticulously kept apart.

Uhm. What are byte strings, what are text strings? Perl does not use these
words
in the context of utf8.
else, STDOUT is UTF-8 (that is, binmode (STDOUT, ":utf8"); has been
done),

This should not be done. The correct line would be

binmode STDOUT,":encoding(utf8)";

This activates error checking etc., while your version treats string as
utf8 while
not checking them at all, which could lead to bad_things[tm] (some docs
hinted
at segmentation faults even, though I do not know if that is true).
in this case as 'München' with the two bytes of the one UTF-8 character
interpreted as two characters. After all, there is no way to do the
equivalent of binmode for the post method of CGI.

Sure there is_

binmode STDIN,":encoding(utf8)";
$query=new CGI();

If because of some reason you cannot run the binmode before you create the
$query
object (this happened to me for some reason I won't go into), then it's no
problem either.
Then you can convert the parameters after "new CGI()" read them from STDIN:

# Warning, treat this as PSEUDO-CODE, it is from memory only
$query=new CGI();
foreach $key ($query->param)
{
$query->param($key,Encode::decode("utf8",$query->param($key)));

# Treating file upload parameters and multi-value parameters are left
# as an excercise for the reader.
}
I have the suspicion that I am thinking to complicated

Aye. ;-)
and that there is a simple -- and simple to explain -- method for
dealing with CGI forms when the code used is UTF-8.

binmode ... ":encoding(utf8)" on both STDIN and STDOUT. Plus proper
declaration of the charset
for your browser (in the HTTP header and the HTML header, just to be sure).

Good luck!
 
S

sln

I have the task of describing for authors how to prepare forms by CGI scripts
in perl, in particular, how to modify existing scripts to conform to a new
CMS. Meanwhile the CGI-generated pages are all in code UTF-8.

This works but has the drawback that an extra step of decoding all input
values to text strings is required when the interaction with the user of
the form is over.

I have the suspicion that I am thinking to complicated and that there is a
simple -- and simple to explain -- method for dealing with CGI forms when the
code used is UTF-8.

With Perl 5.10, cgi.pm version is $CGI::VERSION='3.41';
After some poking around in it, it looks as though it does all its filehandle
work in binary mode (moreso for the uploads I guess).

Without specifying the charset in cgi, my browser will display these cgi-
generated literal strings 'München' 'München') as:

München München - Western European (guessed)
München M? - UTF-8 (user forced)

where the same result as the second one if the html form is set to
charset utf-8.

If the form is coming back as 'München', which is utf-8, does that mean
you set the html charset to utf-8? I mean, it shouldn't otherwise, should it?

For OUTPUT, its better to set the charset to utf-8 then encode those strings that
are unicode (ASCII doesen't matter), or set the binmode of STDOUT to :utf8.
if you want to do everything.
$Muenchen = encode ('utf8', 'München');
$cgi->textfield(-name =>'ort', -value => $Muenchen, -size => 40)

For form INPUT, cgi.pm will auto-decode utf8, all form parameters for you
when you query them. Its the same decode you did above.
This can be set with a pragma in the use CGI statement like
use CGI qw/:standard -utf8/;
Aparently this pragma will only decode input.
From the docs:
" PRAGMAS
-utf8
This makes CGI.pm treat all parameters as UTF-8 strings.
Use this with care, as it will interfere with the processing of binary uploads.
It is better to manually select which fields are expected to return utf-8 strings
and convert them using code like this:
use Encode;
my $arg = decode utf8=>param('foo');
"

No matter how you look at it, if you need utf8 for input/output, there will be some
encode/decode going on somewhere.

You can avoid the encoding hassel by setting the binmode
of STDOUT to utf8 (then this is ok:
$cgi->textfield(-name =>'ort', -value => 'München', -size => 40),
and if you don't expect any binary upload data (input),
avoid the decode hassel by setting the -utf8 pragma for the
form input parameters.
Then set the charset to -utf8.

Good luck!
-sln
 
H

Helmut Richter

I assume you did set the META charset of the HTML page to UTF-8? Or did
you let the browser guess about the encoding and then it returned the
wrong encoding in the form response?

The problem is that the correct bytes arrive but are interpreted in a wrong
way. Two answers in this thread show two different way to control that
interpretation. I'll try them out.

Thanks to all who have responded.
 
J

Jochen Lehmeier

What do you mean, 'validate'?

To raise an error or at least display some warning about it when it
encounters
invalid bytes in an utf8 flagged string, like perl does at many other
places.
Perl strings are (logically) sequences of
Unicode characters, and any sequence of Unicode characters can be
represented in utf8. If you end up with a perl string with a corrupted
internal representation you've got bigger problems than invalid output
encoding.

Or I might simply be using some code which raises the utf8 flag on strings
that
are not.

Example:
cat test.pl
#!/usr/bin/perl -w

use strict;
use Encode;

my $validUTF8 = "\x{1010}";
my $bytes = encode("utf8",$validUTF8); # e1 80 90
my $invalidUTF8 = substr($bytes,0,length($bytes)-1); # e1 80

open(TMP,">invalidutf8.txt") or die;
print TMP $invalidUTF8;
close TMP;

open(TMP,"<invalidutf8.txt") or die;
binmode TMP,":utf8";
my $invalid2 = <TMP>;
close TMP;

print STDERR "is_utf8: '".utf8::is_utf8($invalid2).
"' valid: '".utf8::valid($invalid2)."'\n".
"length: ".length($invalid2)."\n";

print $invalid2;

binmode STDOUT,":utf8";
print $invalid2;

binmode STDOUT,":encoding(utf8)";
print $invalid2;

perl --version

This is perl, v5.8.8 built for i486-linux-gnu-thread-multi
....
perl test.pl | hexdump -C
utf8 "\xE1" does not map to Unicode at test.pl line 16, <TMP> line 1.
Malformed UTF-8 character (unexpected end of string) in length at test.pl
line 19.
is_utf8: '1' valid: ''
length: 0
Wide character in print at test.pl line 23.
00000000 e1 80 e1 80 e1 80 |......|
00000006
hexdump -C invalidutf8.txt
00000000 e1 80 |..|
00000002


The first line ("...does not map...") comes from reading the "binmode
:utf8" handle.
Note that $invalid2 contains exactly those two broken bytes from
invalidutf8.txt, anyway.

The next validation message comes from length($invalid2) ("...Malformed
UTF-8...").
Note that the string is indeed utf8 flagged, though perl has noticed that
it is invalid.

This example is just to provide a quick test case for invalid utf8 in an
utf8 string
in perl. My point from the previous post was that I assumed
:encoding(utf8) on the
output handle would at least give another "...malformed..." message, or,
better yet,
would not output anything. It does not, though, it silently and happily
outputs the
broken utf8, just like printing to a non-utf8 handle (hence the "Wide
character in print")
or a ":utf8" handle.

You are right then - "encoding(utf8)" seems only to differ from "utf8"
when used on an
input handle. If the 'binmode TMP,":encoding(utf8)";' is used when reading
the broken
bytes in, then everything works fine ($invalid2==undef, in that case).

BTW, this is not purely hypotetical for me; I have to work with some
broken modules
which I cannot easily change, which in some weird cases produce such
invalid utf8
strings.

It would be interesting to see whether newer perls behaves the same. Is
there someone
who would like to run the test script through 5.10?
 
J

Jochen Lehmeier

Make sure you check every string that might potentially be corrupted
with utf8::valid before using it for anything else.

Yes, of course. It would just be nice if :encoding(utf8) on an output layer
would catch those, kind of as a last resort, just the same as
:encoding(utf8)
catching them on input. Or, to put it the other way round,
I find it funny that perl warns in something like the length() function,
but not
while actually printing. Well, maybe in Perl 5.12. ;-)
 
H

Helmut Richter

From the docs:
" PRAGMAS
-utf8
This makes CGI.pm treat all parameters as UTF-8 strings.
Use this with care, as it will interfere with the processing of binary uploads.

This is the same problem for *both* solutions offered in this thread:
the utf8 pragma, and setting binmode on both STDIN and STDOUT. In fact, I
have the suspicion that the effect of the pragma is not much more than
such
a setting.
It is better to manually select which fields are expected to return utf-8 strings
and convert them using code like this:
use Encode;
my $arg = decode utf8=>param('foo');

This is much less than half of the story. Getting a single parameter is a
fairly easy thing to do, with or without the CGI module. Using the CGI
module for producing HTML is only a very cumbersome way of writing
something in a complicated syntax that is much easier written directly in
HTML. For which task does the CGI module offer significant help, compared
with simply outputting HTML and analysing the input?

One of the (relatively few) things that are easier with the CGI module
than without is reusing the input values as defaults for the same form
when it must be output again because of incompletely or wrongly filled-in
values. Now, if I have to touch every single value, decode it, and store
it back into the structure, I could have hand-programmed that reuse with
not more effort.
 
S

sln

This is the same problem for *both* solutions offered in this thread:
the utf8 pragma, and setting binmode on both STDIN and STDOUT. In fact, I
have the suspicion that the effect of the pragma is not much more than
such
a setting.

Actually, this does NOT refer to use utf8; pragma, which is not a solution.
This is actually a parameter, a private symbol that CGI.pm uses. When set
by the user, CGI.pm does automatic parameter decoding (so you don't have to).
I haven't used CGI, why does STDIN need binmode set to utf8? For the same reason
STDOUT needs to be set?

Anyway, this is some code from CGI.pm (look at '-->' in margin that is marked)
and you should look at the actual module once in a while to glean some insight.
==============
# >>>>> Here are some globals that you might want to adjust <<<<<<
sub initialize_globals {
...
# return everything as utf-8
--> $PARAM_UTF8 = 0;
...
}
sub _setup_symbols {
my $self = shift;
my $compile = 0;

# to avoid reexporting unwanted variables
undef %EXPORT;

foreach (@_) {
$HEADERS_ONCE++, next if /^[:-]unique_headers$/;
$NPH++, next if /^[:-]nph$/;
$NOSTICKY++, next if /^[:-]nosticky$/;
$DEBUG=0, next if /^[:-]no_?[Dd]ebug$/;
$DEBUG=2, next if /^[:-][Dd]ebug$/;
$USE_PARAM_SEMICOLONS++, next if /^[:-]newstyle_urls$/;
--> $PARAM_UTF8++, next if /^[:-]utf8$/;
$XHTML++, next if /^[:-]xhtml$/;
$XHTML=0, next if /^[:-]no_?xhtml$/;
$USE_PARAM_SEMICOLONS=0, next if /^[:-]oldstyle_urls$/;
$PRIVATE_TEMPFILES++, next if /^[:-]private_tempfiles$/;
$TABINDEX++, next if /^[:-]tabindex$/;
$CLOSE_UPLOAD_FILES++, next if /^[:-]close_upload_files$/;
$EXPORT{$_}++, next if /^[:-]any$/;
$compile++, next if /^[:-]compile$/;
$NO_UNDEF_PARAMS++, next if /^[:-]no_undef_params$/;

# This is probably extremely evil code -- to be deleted some day.
if (/^[-]autoload$/) {
my($pkg) = caller(1);
*{"${pkg}::AUTOLOAD"} = sub {
my($routine) = $AUTOLOAD;
$routine =~ s/^.*::/CGI::/;
&$routine;
};
next;
}

foreach (&expand_tags($_)) {
tr/a-zA-Z0-9_//cd; # don't allow weird function names
$EXPORT{$_}++;
}
}
_compile_all(keys %EXPORT) if $compile;
@SAVED_SYMBOLS = @_;
}
#### Method: param
# Returns the value(s)of a named parameter.
# If invoked in a list context, returns the
# entire list. Otherwise returns the first
# member of the list.
# If name is not provided, return a list of all
# the known parameters names available.
# If more than one argument is provided, the
# second and subsequent arguments are used to
# set the value of the parameter.
####
sub param {
my($self,@p) = self_or_default(@_);
return $self->all_parameters unless @p;
my($name,$value,@other);

# For compatibility between old calling style and use_named_parameters() style,
# we have to special case for a single parameter present.
if (@p > 1) {
($name,$value,@other) = rearrange([NAME,[DEFAULT,VALUE,VALUES]],@p);
my(@values);

if (substr($p[0],0,1) eq '-') {
@values = defined($value) ? (ref($value) && ref($value) eq 'ARRAY' ? @{$value} : $value) : ();
} else {
foreach ($value,@other) {
push(@values,$_) if defined($_);
}
}
# If values is provided, then we set it.
if (@values or defined $value) {
$self->add_parameter($name);
$self->{param}{$name}=[@values];
}
} else {
$name = $p[0];
}

return unless defined($name) && $self->{param}{$name};

my @result = @{$self->{param}{$name}};

--> if ($PARAM_UTF8) {
--> eval "require Encode; 1;" unless Encode->can('decode'); # bring in these functions
--> @result = map {ref $_ ? $_ : Encode::decode(utf8=>$_) } @result;
}

return wantarray ? @result : $result[0];
}
=====================


This is much less than half of the story. Getting a single parameter is a
fairly easy thing to do, with or without the CGI module.

As stated above, you can tell CGI.pm to decode ALL imput parameters for you,
taking care of having to individually do it.

Using the CGI
module for producing HTML is only a very cumbersome way of writing
something in a complicated syntax that is much easier written directly in
HTML. For which task does the CGI module offer significant help, compared
with simply outputting HTML and analysing the input?

One of the (relatively few) things that are easier with the CGI module
than without is reusing the input values as defaults for the same form
when it must be output again because of incompletely or wrongly filled-in
values. Now, if I have to touch every single value, decode it, and store
it back into the structure, I could have hand-programmed that reuse with
not more effort.

Yeah, there is no argument there. The Unicode (flavor utf-8) gatekeeper for
Perl is usually the file i/o layers which covers both input and output.
For example, the :utf8 layer on handles.

However, binary text can creep into data variables at various times and via
various paths. I guess its up to you to know where and when this can happen.
Decode in, encode out if thats the case. But don't encode more than once,
or data that hasen't been decoded.

-sln
 
P

Peter J. Holzer

There's no need to suspect: read the docs. The only effect of the utf8
pragma is to tell perl that your source code is written in UTF-8.

I think he meant the -utf8 pragma of the CGI module, i.e.

use CGI qw/-utf8/;

hp
 
B

Ben Bullock

This is the same problem for *both* solutions offered in this thread:
the utf8 pragma, and setting binmode on both STDIN and STDOUT. In fact, I
have the suspicion that the effect of the pragma is not much more than
such
a setting.

This web page has a working example of what sln is discussing:

http://www.lemoda.net/perl/strip-diacritics/

The line

use CGI '-utf8';

makes CGI.pm do what you seem to want it to do: convert the input from
the form into Perl's "utf8" or "character semantics" or whatever
they're calling it these days.

(I tried to send this by a free server & that seems to have failed, so
I am resending via Google Groups. Apologies if this message turns up
twice.)
 
C

cmic

Hello

Of course, perl's definition of 'utf8' is different from the Unicode
Consortium's 'UTF-8': the standard forbids representations of surrogates
and unassigned codepoints (and possibly other things I've forgotten). If
you want perl to enforce these restrictions you need to ask for it with
:encoding(UTF-8) (this appears to only be documented in perldoc Encode).

OK. But I can't find any clear explanation in perldoc Encode, but in
perldoc binmode.

Thank you for allowing me to rehearse this point though
Rgds
 
H

Helmut Richter

This web page has a working example of what sln is discussing:

http://www.lemoda.net/perl/strip-diacritics/

This is an entirely different issue, albeit also an interesting one.

It occurs when data are expected in a restricted character set, e.g. in
ISO-8859-1, but are input via a medium that allows UTF-8, e.g. a Web form.
Then an encode function is needed that would not blow up with the first
character that is not in the target character set.

It is not easy to provide a standard solution because the solution is
dependent on the target character set, where US-ASCII is not the only
conceivable one. For instance, real quotes (U+201C/D/E) should be mapped
to ASCII quotes (U+0022) if and only if the target character set does not
contain them (e.g. they are not contained in ISO-8859-1 but contained in
its Windows cousin CP1252).

The solution presented at that URL covers only the trivial (but useful
because many characters are affected) case where stripping diacritics does
the job. Many others (quotes, ordinary mathematical symbols, etc.) must be
given reasonable substitutes by hand.

And, of course, different cultures may differ which substitutes do the job best.
For a German, an "ä" is in any case to be rendered as "ae", for Finnish or
Swedish people, this is not the right thing to do.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,955
Messages
2,570,117
Members
46,705
Latest member
v_darius

Latest Threads

Top