Russian to Windows-1252 (htmlencode)

B

Bart Van der Donck

Hello,

I am looking for a way to convert Russian characters to their
(numeric) html entities in a Windows-1252 character set. It looks like
an easy job but it isn't.

I did a search on CPAN and tried a few modules but their output is not
OK for what I want to do. HTML::Entities apparently converts to
non-numeric html entities only. In case of Russian, this module would
not be not suitable.
Unicode::Lite has a strange output (I don't understand it, weird
characters)
I also played around with CGI.pm. It has a short paragraph about this
issue. But I believe this is not the way to go either.

Here some Russian language examples and what they should become.
I post via Google, I am not sure if these characters will get through.
My IE6 is set at utf8 + I see utf8 in my Google-URL above, so I take
my chances.

изменения
should become=
изменения

информация
should become=
информация

Освобождение
should become=
Освобождение

Background:
My application provider did not compile MySQL with support for Russian
characters. (MySQL4) And they are not going to change that. So I am
going to store my Russian data in a webfriendly Windows-1252 charset.

Is this possible? Someone of you perl sultans maybe would have a
one-liner for that? :)

Thanks,
Bart
 
A

Alan J. Flavell

I am looking for a way to convert Russian characters to their
(numeric) html entities in a Windows-1252 character set.

I'm having a hard time understanding your question in the form that
it's stated.

Windows-1252 is a proprietary encoding for _Western_ content, neither
an open coding for use on the web nor a coding suitable for Russian.

What you call "numeric html entities" (to be pedantic their proper
name is "numeric character references") refer to character positions
in the "HTML document character set", which by definition is always
iso-10646/Unicode, not "in a Windows-1252 character set".
Here some Russian language examples and what they should become.
I post via Google, I am not sure if these characters will get through.

Your suspicions were well-founded, I'm afraid. It looks as if you can
use Google to perform the conversions for you ;-))
My IE6 is set at utf8 + I see utf8 in my Google-URL above, so I take
my chances.

Are you hinting that your incoming data will be utf-8-encoded?
изменения
should become=
изменения

Background:
My application provider did not compile MySQL with support for Russian
characters. (MySQL4) And they are not going to change that.

I think I understood that bit...
So I am going to store my Russian data in a webfriendly Windows-1252
charset.

Windows-1252 (MS Western Windows encoding) can hardly be termed as
"web friendly", IMHO: even if there's a lot of it about, it's still a
propriatary encoding. It might be database-friendly.
Is this possible?

I'm still confused about what direction you want to convert. What
coding is your Russian text provided in, and what do you really want
to put into the dataset? And then when you get it out again, what do
you need to do with it?

Then one can actually start to address the requirement. (Or maybe
someone else has a clearer crystal ball than I have, and will manage
to produce the right answer first, let's see).
 
G

Gunnar Hjalmarsson

Bart said:
I am looking for a way to convert Russian characters to their
(numeric) html entities in a Windows-1252 character set.

HTML::Entities apparently converts to non-numeric html entities
only.

Not true. Read the docs.
In case of Russian, this module would not be not suitable.

See below.
Here some Russian language examples and what they should become. I
post via Google, I am not sure if these characters will get
through.

They did not get through. They were converted. :)
изменения

Suppose you mean:

èçìåíåíèÿ
should become=
изменения

This is one way, using HTML::Entities:

use HTML::Entities;
my $russian = 'èçìåíåíèÿ';

sub convert {
my @chars = split //, shift;
for (@chars) {
$_ = HTML::Entities::encode_entities_numeric($_);
s/(\w+)/848 + hex $1/e;
}
return join '', @chars;
}

print convert($russian);

Outputs:
изменения

HTH
 
B

Bart Van der Donck

Hello,

Thanks to both responders.
Yes why not do a get request to Google to return the characters... I
'll save that for my very last option :)

Here is an image of the original characters:
http://www.dotinternet.be/russi.gif
Apparently, èçìåíåíèÿ is another encoding for that.

Thanks to Gunnar's subroutine, I got it working at server level.

My end-goal is to make it work as a CGI:
(1) capture the form input
(2) URL decode it
(3) pass it to HTML::Entities
(4) query in database and see if the value exists
(5) output the numeric entities (db values) in a Windows-1252 charset

However, different browsers seem to understand the output differently.

Here is the perl code, as short as possible:

---------------------------------------------
START PERL CODE
---------------------------------------------

#!/usr/bin/perl
print "Content-Type: text/html\n\n<html><body>";
use CGI qw/escape unescape/;
use HTML::Entities;
@pairs = split(/&/, $ENV{'QUERY_STRING'});
foreach (@pairs)
{ ($name, $value) = split(/=/, $_); $FORM{$name}=$value; }

$e = unescape($FORM{field}); # remove URLencode
$f=convert($e); # to html entities
print "This is my URLencoded string: $FORM{field} <hr>\n";
print "This is my string where URLencoding was removed by CGI.pm: $e
<hr>\n\n";
print "This is my string where URLencoding was removed by CGI.pm and
then converted to html numeric entities: $f <hr>\n\n";
print "<form action=rus.pl method=get>Type string:<br><input type=text
name=field></form></body></html>";

sub convert {
my @chars = split //, shift;
for (@chars) {
$_ = HTML::Entities::encode_entities_numeric($_);
s/(\w+)/848 + hex $1/e;
}
return join '', @chars;
}
----------------------------------------------
END PERL CODE
----------------------------------------------

When calling the script, I type these characters in the text field:
http://www.dotinternet.be/russi.gif
3 browsers give a different output.

Here under is the html output on XP IE6.0. This seems to be OK:

----------------------------------------------
START HTML OUTPUT ON IE6
----------------------------------------------

<html><body>
This is my URLencoded string: %E8%E7%EC%E5%ED%E5%ED%E8%FF <hr>
This is my string where URLencoding was removed by CGI.pm: èçìåíåíèÿ
<hr>

This is my string where URLencoding was removed by CGI.pm and then
converted to html numeric entities:
изменения <hr>

<form action=rus.pl method=get>Type string:<br><input type=text
name=field></form>
</body></html>

----------------------------------------------
END HTML OUTPUT ON IE6
----------------------------------------------

Here under is the html output on Win9x IE5.0.
It seems that it is enough here to just URLdecode it so I don't need
HTML::Entities anymore.

----------------------------------------------
START HTML OUTPUT ON IE5
----------------------------------------------

<html><body>
This is my URLencoded string:
%26%231076%3B%26%231072%3B%26%231085%3B%26%231085%3B%26%231099%3B%26%231093%3B
<hr>
This is my string where URLencoding was removed by CGI.pm:
данных <hr>

This is my string where URLencoding was removed by CGI.pm and then
converted to html numeric entities:
Ͷ#849848855854;Ͷ#849848855850;Ͷ#849848856853;Ͷ#849848856853;Ͷ#849848857857;Ͷ#849848857851;
<hr>

<form action=rus.pl method=get>Type string:<br><input type=text
name=field></form>
</body></html>

----------------------------------------------
END HTML OUTPUT ON IE5
----------------------------------------------

Netscape 4.7, as expected, has difficulty. When pasting the Russian
input in a textfield, it gives only question marks. And by
consequence, the script considers it as question marks.

----------------------------------------------
START HTML OUTPUT ON NS4
----------------------------------------------

<html><body>This is my URLencoded string: %3F%3F%3F%3F%3F%3F%3F%3F%3F
<hr>
This is my string where URLencoding was removed by CGI.pm: ?????????
<hr>

This is my string where URLencoding was removed by CGI.pm and then
converted to html numeric entities: ????????? <hr>

<form action=rus.pl method=get>Type string:<br><input type=text
name=field></form></body></html>

----------------------------------------------
END HTML OUTPUT ON NS4
----------------------------------------------

My application is IE5+. So I don't need NS4, but I put the NS4 output
here anyway as info.

So I am thinking now of a way how to make it work in all IE5+. Perhaps
by checking the browser, and then decide whether or not invoke
HTML::Entities? But this seems kind of a honky-clonky solution to me,
and the browser check would not be too trustful anyway.

Thanks
Bart
 
A

Alan J. Flavell

Here is an image of the original characters:
http://www.dotinternet.be/russi.gif

It's OK, I know what Cyrillic characters -look- like, but I still have
no idea whether you're starting from koi8-r, windows-1251, DOS-866,
utf-8 or what?
Apparently, èçìåíåíèÿ is another encoding for that.

You posted with this header:

Content-Type: text/plain; charset=ISO-8859-1

which declares those to be accented Western characters. e-grave,
c-cedilla, etc...

I still don't know for certain what input encoding you are trying to
use, but it looks to me like Windows-1251.
Thanks to Gunnar's subroutine, I got it working at server level.

My end-goal is to make it work as a CGI:
(1) capture the form input

Oh gosh, now we have to tangle with forms input, and already it's
clear that you're a bit shaky with character coding. I'm going to
have to say that -this- part of the task is way off topic for
c.l.p.misc. I have a web page that -might- be useful as background
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
but I repeat, this venue is -not- the right place for going into
detail about that.
(2) URL decode it
(3) pass it to HTML::Entities
(4) query in database and see if the value exists
(5) output the numeric entities (db values) in a Windows-1252 charset

Do you have a specific reason for this insistence on representing
Cyrillic characters using a proprietary 8-bit coding that's intended
for Western Roman content? It's not going to work in NN4 anyway.
However, different browsers seem to understand the output differently.

URL of a test case? Circumstances of the difference? If it's only
NN4 that's causing the problem, then I can tell you for free that it
isn't going to work. You'd need to follow e.g this advice instead:

http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#s6
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick#cons
Here is the perl code, as short as possible:

I'd rather see a static file as a test case first. We can worry about
the details of how you generate that file, afterwards.

But it's looking as if you have an HTML authoring problem or two to
solve, before you start writing Perl code.
 
G

Gunnar Hjalmarsson

Bart said:
Here is an image of the original characters:
http://www.dotinternet.be/russi.gif
Apparently, èçìåíåíèÿ is another encoding for that.

If I have understood it correctly ( please help me, Alan :) ),
èçìåíåíèÿ is the Windows-1251 encoded equivalent of what's displayed
on the image.
My end-goal is to make it work as a CGI:
(1) capture the form input
(2) URL decode it
(3) pass it to HTML::Entities
(4) query in database and see if the value exists
(5) output the numeric entities (db values) in a Windows-1252
charset

However, different browsers seem to understand the output
differently.

One reason may be that you don't tell the browsers which character
coding that shall be used when submitting the characters. Another
reason may be that a proper character coding set is not installed for
all the browsers.

To continue this experiment, I suggest that you replace the second
line in the script with:

print "Content-Type: text/html; charset=Windows-1251\n\n<html><body>";

and let us know if that makes a difference.

If the actual form is on a static HTML page, you rather need a meta
header at that page:

Here is the perl code, as short as possible:

Hmm.. There are room for other comments on the code, but I'll refrain
from that now.

It should be noted that this thread has become off topic here. I would
suggest that you move the discussion to
comp.infosystems.www.authoring.cgi.
 
A

Alan J. Flavell

If I have understood it correctly ( please help me, Alan :) ),
èçìåíåíèÿ is the Windows-1251 encoded equivalent of what's displayed
on the image.

By now, you should have seen my posting in which I came to the same
conclusion, yes.
To continue this experiment, I suggest that you replace the second
line in the script with:

print "Content-Type: text/html; charset=Windows-1251\n\n<html><body>";

That should help, yes: my investigations showed that browsers
typically performed their form submission using the same character
coding as the HTML page which contained the form, _provided_ the
submitted characters could be represented in that character coding.

However (ObPerl-ish), I refer you to the comment at
http://www.perldoc.com/perl5.8.0/lib/Encode/Supported.html -

"it is beyond the power of words to describe the way HTML browsers
encode non-ASCII form data. To get a general impression, visit
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
If the actual form is on a static HTML page, you rather need a meta
header at that page:

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1251">

Objection, your Honour! You preferably need web server configuration
(e.g AddCharset) to cause the web server to send out a proper HTTP
header.

W3C Hints and Tips: http://www.w3.org/International/O-HTTP-charset.html

meta http-equiv="Content-type" is a second-rate ersatz for HTML (and
an incomplete and problematical ersatz for XHTML).

cheers (you're right, we should move this part of the discussion)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,143
Messages
2,570,822
Members
47,368
Latest member
michaelsmithh

Latest Threads

Top