parsing UTF-8 chars out of POST data

A

Aaron Anodide

Hello,

My question first: What is the correct way to deal with % signs in the
POST data?

Here's my situation - I have a cgi script recieving POST data:

PASS=hello%C2%A3

The %C2%A3 was generated by pressing ALT+156 (british pound sign).

Legacy code I'm using calls CGI::unescape to process the %'s, so in
this case it effectively calls (i don't know why is uses eval):

eval '$password = CGI::unescape($in[$i]);';

However, when this returs, length($password) = 7.

If I set a local variable to the same string:

$password1 = "hello£"; (this time using alt-156 directly in my
editor)

Then length($password1) = 6.

Then I call an external validation program, a C++ program compiled in
UNICODE:

system( "validate", $password );

It fails, because C2 and A3 appear as unique characters in argv[1].

BUT, if I call:

system( "validate", $password1 );

Then the program works.

Thanks in advance for anyone who takes the time to think about this
for me.
Aaron Anodide
 
A

Aaron Anodide

Well, I answered by own question:

The content-type of the page collecting the password was set to utf-8,
so the ALT+156 was put into the post data as two bytes, %C2%A3. When
I set the content-type to charset=ISO-8859-1, then ALT-156 was put
into the post data as one byte, %A3, which in turn was correctly
parsed by CGI::unescape.

Regards,
Aaron Anodide
 
A

Alan J. Flavell

Well, I answered by own question:

That's as may be, but your answer is good only for a relatively narrow
range of problems.

That fact is (I'm sorry to say) that processing i18n form submissions
is a nontrivial matter, not made any easier by numerous anomalies and
oddities in browsers. But all of that part of the problem is on-topic
in a WWW authoring group (e.g comp.infosystems.www.authoring.cgi -
beware the automoderation bot), and the locals around here don't like
to see their group misused for discussing that. (Because they've got
enough to worry about with the i18n implementations in Perl - no
offence meant).

I'm afraid the only genuine advice I can offer at this point is to
hone your expertise in the matter of partitioning a problem into its
components - where necessary, instrumenting the interfaces between
them so that you can see clearly what is going on - and seeking
advice on each of the problem domains in its appropriate place.
The content-type of the page collecting the password was set to utf-8,

Which is actually a good solution, if you need to handle a wide
character repertoire, and don't need to worry about antique browsers
up to and including Netscape 4. But I'm wandering into areas that are
off-topic for this group.
so the ALT+156 was put into the post data as two bytes, %C2%A3.

That was obvious (which is a way of trying, and failing, to say
politely that if it wasn't evident to you, then you're a bit out of
your depth; but that can be remedied if you're willing to put in a
bit of effort).
When I set the content-type to charset=ISO-8859-1, then ALT-156 was
put into the post data as one byte, %A3, which in turn was correctly
parsed by CGI::unescape.

Indeed. But now what happens if they type-in a character that isn't
representable in the iso-8859-1 encoding? There's no way that you can
stop them doing so, and so, I'd say, your server-side process needs to
be able to cope with the consequences, even if only by politely
telling them they provided invalid input and please to try again.

At risk of blowing my own trumpet for a part of the problem domain
that really is the off-topic part of your question here, see:
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html - but
please, I beg you, don't try to discuss the contents here, or the
regulars will bite.

good luck
 
A

Aaron Anodide

Alan J. Flavell said:
That's as may be, but your answer is good only for a relatively narrow
range of problems.

That fact is (I'm sorry to say) that processing i18n form submissions
is a nontrivial matter, not made any easier by numerous anomalies and
oddities in browsers. But all of that part of the problem is on-topic
in a WWW authoring group (e.g comp.infosystems.www.authoring.cgi -
beware the automoderation bot), and the locals around here don't like
to see their group misused for discussing that. (Because they've got
enough to worry about with the i18n implementations in Perl - no
offence meant).

I'm afraid the only genuine advice I can offer at this point is to
hone your expertise in the matter of partitioning a problem into its
components - where necessary, instrumenting the interfaces between
them so that you can see clearly what is going on - and seeking
advice on each of the problem domains in its appropriate place.


Which is actually a good solution, if you need to handle a wide
character repertoire, and don't need to worry about antique browsers
up to and including Netscape 4. But I'm wandering into areas that are
off-topic for this group.


That was obvious (which is a way of trying, and failing, to say
politely that if it wasn't evident to you, then you're a bit out of
your depth; but that can be remedied if you're willing to put in a
bit of effort).


Indeed. But now what happens if they type-in a character that isn't
representable in the iso-8859-1 encoding? There's no way that you can
stop them doing so, and so, I'd say, your server-side process needs to
be able to cope with the consequences, even if only by politely
telling them they provided invalid input and please to try again.

At risk of blowing my own trumpet for a part of the problem domain
that really is the off-topic part of your question here, see:
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html - but
please, I beg you, don't try to discuss the contents here, or the
regulars will bite.

good luck

Thanks for the insightful response to my off-topic post. And I
appreciate the pointers to the other newsgroups as well.

Thanks again,
Aaron Anodide
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,160
Messages
2,570,889
Members
47,420
Latest member
ZitaVos505

Latest Threads

Top