LWP and Unicode

D

Dale

I have a couple of questions/problems concerning LWP and
Unicode. Here's an ultra-simple program that goes to a web page,
downloads it's contents and prints them out in a semi-readable form:

----------------------------------

#!/.../perl-5.8.8/bin/perl -CSDA

use utf8;
use LWP;
use Encode;
use URI::Escape;

my $browser = LWP::UserAgent->new;
$browser->parse_head(0);

my $url =
'http://bg.wiktionary.org/wiki/УиÐ...€Ñки/Типове_думи/Глаголи';
my $response = $browser->get(encode("utf8", $url));

my $content = decode("utf8", uri_unescape($response->content));

print "$content\n";

----------------------------------

Question 1: Why do I need the line that says

$browser->parse_head(0);


Question 2: Why do I need to explicitly say:

decode("utf8", ...)

Isn't there a way to tell LWP that the content is utf8? Or more
precisely, that it is utf8 with some URI percent escapes.


Question 3: If you change the pragma "use utf8" to "use encoding
'utf8'" then you don't need the call to "decode("utf8", ...)". Why
should this be? What's the difference between "use utf8" and "use
encoding 'utf8'"? The perldoc:perlunicode is no help here.


Question 4: In the original program, replace the line

my $content = decode("utf8", uri_unescape($response->content));

with

my $content = $response->content;
utf8::upgrade($content);

The perldoc:perlunicode page says you should do this when, for some
reason, Unicode does not happen. But this does nothing for me. I still
end up with bytes.
 
D

Dale

One more question in a similar vein. Using HTML::LinkExtor on a page
using Unicode, I can't seem to process the page without at least one
warning of the form:

Parsing of undecoded UTF-8 will give garbage when decoding entities
at ./verb_extorline 32.

The code I used was pretty straightforwardly modified from the
Cookbook:

--------------------------------

#!.../perl-5.8.8/bin/perl -w -CSDA

use utf8;
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
use Encode;
use URI::Escape;

my $url =
'http://bg.wiktionary.org/wiki/УиÐ...€Ñки/Типове_думи/Глаголи';
my $encoded_url = encode("utf8", $url);

$ua = LWP::UserAgent->new;
$ua->parse_head(0); #### without this line, you get the error twice

# Set up a callback that collect image links
my @links = ();
sub callback {
my($tag, %attr) = @_;
my ($link) = values(%attr);
$link = url($link, $encoded_url)->abs;
$link = decode("utf8", uri_unescape($link));
push(@links, $link);
}

$p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
$ua->request(HTTP::Request->new(GET => encode("utf8", $url)),
sub {$p->parse($_[0])});


# Print them out
print join("\n", @links), "\n";
 
B

Ben Morrow

Quoth "Dale said:
I have a couple of questions/problems concerning LWP and
Unicode. Here's an ultra-simple program that goes to a web page,
downloads it's contents and prints them out in a semi-readable form:

I presume this isn't your real #! line...

Do you know what -CSDA does? In this case it is useless, unless it
interferes with LWP's filehandle encodings. It is probably best avoided
until you understand Perl's (slightly odd) Unicode handling better.
use utf8;
use LWP;
use Encode;
use URI::Escape;

my $browser = LWP::UserAgent->new;
$browser->parse_head(0);

my $url = 'http://bg.wiktionary.org/wiki/LotsaCyrillic';

Please don't post 8-bit data (including UTF8) to Usenet unless the
group's charter explicitly permits it.
my $response = $browser->get(encode("utf8", $url));

my $content = decode("utf8", uri_unescape($response->content));

print "$content\n";

----------------------------------

Question 1: Why do I need the line that says

$browser->parse_head(0);

You don't. The docs for this are (surprisingly) in perldoc
LWP::UserAgent.
Question 2: Why do I need to explicitly say:

decode("utf8", ...)

Isn't there a way to tell LWP that the content is utf8? Or more
precisely, that it is utf8 with some URI percent escapes.

Not AFAIK. You probably ought to decode the data before you uri_unescape
it; one of the virtues of UTF-8 is that this doesn't matter, but it
would for other encodings.
Question 3: If you change the pragma "use utf8" to "use encoding
'utf8'" then you don't need the call to "decode("utf8", ...)". Why
should this be? What's the difference between "use utf8" and "use
encoding 'utf8'"? The perldoc:perlunicode is no help here.

The differences are

1. encoding supports many encodings.
2. encoding is probably negligbly slower.
3. encoding gives decent error recovery (as opposed to crashing
perl).
4. encoding sets a default PerlIO layer on STDIN and STDOUT, unless
you've already done so with the -C switch.

I can see no reason why the two should give different results in this
case; but perhaps your -CSDA is interfering.
Question 4: In the original program, replace the line

my $content = decode("utf8", uri_unescape($response->content));

with

my $content = $response->content;
utf8::upgrade($content);

The perldoc:perlunicode page says you should do this when, for some
reason, Unicode does not happen. But this does nothing for me. I still
end up with bytes.

IMHO perlunicode is wrong in this regard :). The utf8::* functions are
part of the internal implementation of utf8-handling; users should never
have cause to use them.

As of 5.8, Perl strings have an internal flag that marks them as being
stored in utf8. What utf8::upgrade does is

1. If the string already has the UTF8 flag on, quit.
2. For every top-bit-set byte in the string:
3. Look up the appropriate character in ISO8859-1, and
4. Replace the byte with that character's 2-byte encoding in
utf8.
5. Set the UTF8 flag on the string, so that Perl now sees those
2-byte sequences as one character each.

The net result, from the Perl level, is that *absolutely nothing has
changed*. The *only* Perl-visible change is that utf8::is_utf8 now
returns true, even if it returned false before; but you *shouldn't be
concerned with that*.

The correct function for 'this bunch of bytes happens to be a piece of
UTF8-encoded text; decode it and give me a string containing those
characters' is Encode::decode, as you have established.

Ben
 
D

Dale

Thanks Ben for the thorough answer. But there is still a difference
between "encoding 'utf8'" and "use utf8" that you are somehow missing.


Ben said:
The differences are

1. encoding supports many encodings.
2. encoding is probably negligbly slower.
3. encoding gives decent error recovery (as opposed to crashing
perl).
4. encoding sets a default PerlIO layer on STDIN and STDOUT, unless
you've already done so with the -C switch.

I can see no reason why the two should give different results in this
case; but perhaps your -CSDA is interfering.

I've eliminated the -CSDA and still get a major difference. Try this:

#!.../perl-5.8.8/bin/perl -w

# uncomment one of the following:
# use encoding 'utf8';
# use utf8;
use LWP;
use Encode;
use URI::Escape;

my $browser = LWP::UserAgent->new;
$browser->parse_head(0);

my $url =
'http://bg.wiktionary.org/wiki/Уикиречник:Български/Типове_думи/Глаголи';

my $response = $browser->get($url);

my $content = uri_unescape($response->content);

print "$content\n";

---------------

The results are (for me) better with "use utf8'. In this case $content
is a character sequence with human-readable characters.

It appears that "use encoding 'utf8'" decodes the UTF-8 before the
call to uri_unescape and "use utf8" decodes after the
utf_unescape. But this is just my uneducated guess.
 
D

Dale

Parsing of undecoded UTF-8 will give garbage ...
Question 1: Why do I need the line that says

$browser->parse_head(0);

And Ben Morrow answered:
You don't. The docs for this are (surprisingly) in perldoc
LWP::UserAgent.

And the perldoc says:
$ua->parse_head
$ua->parse_head( $boolean )
Get/set a value indicating whether we should initialize response head-
ers from the <head> section of HTML documents. The default is TRUE.
Do not turn this off, unless you know what you are doing.

Okay, I admit that I don't know what I'm doing. But I do know that
without the line, you get a warning that says:

Parsing of undecoded UTF-8 will give garbage when decoding entities
at /afs/sfs/lehre/dg/myperl/lib/LWP/Protocol.pm line 114.

I'm just trying to make Perl happy.

Dale Gerdemann
 
D

Dale

Sorry to respond multiple time to my own question, but I keep testing
things and am getting results I can't explain.


Ben Morrow wrote (concerning "use 'utf8'" compared to "use encoding
'utf8'"):
1. encoding supports many encodings.
2. encoding is probably negligbly slower.
3. encoding gives decent error recovery (as opposed to crashing
perl).
4. encoding sets a default PerlIO layer on STDIN and STDOUT, unless
you've already done so with the -C switch.
From this, I suppose that the -CIO switch and "use encoding 'utf8'"
should be interchangable, as long as there is no Unicode in the
program. The only point that is relevant here is number 4 from the
above list.

perldoc encoding says:
The encoding pragma also modifies the filehandle layers of STDIN

and STDOUT to the specified encoding. Therefore,

perldoc perlrun says:
The "-C" flag controls some Unicode of the Perl Unicode
features.

As of 5.8.1, the "-C" can be followed either by a number or
a list of option letters. The letters, their numeric
values, and effects are as follows; listing the letters is
equal to summing the numbers.

I 1 STDIN is assumed to be in UTF-8
O 2 STDOUT will be in UTF-8


BUT: Surprisingly, the two don't give the same results:

Here's my test program:
#!/afs/sfs/lehre/dg/perl-5.8.8/bin/perl # -CIO

# use encoding 'utf8';
use LWP;
use URI::Escape;

my $browser = LWP::UserAgent->new;
$browser->parse_head(0);

my $url =
'http://bg.wiktionary.org/wiki/Уикиречник:Български/Типове_думи/Глаголи';


my $response = $browser->get($url);

my $content = uri_unescape($response->content);

print "$content\n";

-------------------

The -CIO switch and the encoding pragma are both commented out. There
are four possibilities to uncomment 0, 1 or 2 of these.

On the web page to be downloaded there is both:

1. utf8 encoded Unicode, and
2. escaped (percent encoded) utf8 encoded Unicode

So again there are 4 possibilities with one, the other, both or neither
of these Unicodes being correctly decoded.

And the results:

1. Using just the switch -CIO is a horrible failure. None of the
Unicode is decoded.
2. Using "use encoding 'utf8'" is better. The non-escaped Unicode is
decoded
3. Using both the switch -CIO and "use encoding 'utf8'" is the same as
just using the encoding pragma.
4. Using nothing at all gives the best result. All the Unicode is
correctly decoded.

Case 1 (the horrible failure) is in some ways better than cases 2 and
3. If none of the Unicode is decoded, then you can explicitly decode:

my $content = decode("utf8", uri_unescape($response->content));

In cases 2 and 3, this results in a failure:
Cannot decode string with wide characters at
/afs/sfs/lehre/dg/perl-5.8.8/lib/5.8.8/i686-linux/Encode.pm line 166.

Can anyone explain this behavior?

Dale Gerdemann
 
D

Dale

How to read a web page containing partly utf8 and partly
percent-encoded utf8.

Assumption: We want, for whatever reason, to have "use encoding
'utf8'". It's not clear that this pragma helps, but this is the 21st
century and we just want to use Unicode.

Problem: The problem occurs in an unexpected place. Escape::unescape
does more than just decode the percent encoding. This doesn't seem to
be documented.

Alternatives: The obvious alternative is to parse more carefully and
apply the appropriate encoding/unencoding to the appropriate parts of
the document.

Questions: I still don't think I understand the difference between the
runtime switch -CIO and "use encoding 'utf8'". Okay, I know that
'encoding' allows you to put Unicode into your program. But beyond
that, there seems to be some difference in what happens with IO.


#!/afs/sfs/lehre/dg/perl-5.8.8/bin/perl

use encoding 'utf8';
use LWP;
use Encode qw(encode decode is_utf8);
use URI::Escape qw(uri_unescape);

my $browser = LWP::UserAgent->new;

## Tiny test web-page. Included just the line "h a h", where the h's
## are upside down (actually Cyrillic).
my $url
=
'http://www.sfs.uni-tuebingen.de/iscl/Kursmaterialien/Gerdemann/foo.html';

my $response = $browser->get(encode("utf8", $url));

# raw_content is a byte sequence, containing some utf8 encoded bits
# and some percent encoded utf8 encoded bits
my $raw_content = $response->content;

# $encoded_content adds a double utf8 encoding to the already utf8 bits
of
# raw_content
my $encoded_content = encode("utf8", $raw_content);

# $unescaped_content is different from $encoded_content in two
respects.
# 1. The percent-encoded parts of $encoded_content are decoded into
utf8
# 2. The doubly utf-encoded bits of $encoded_content lose a layer of
utf
# encoding. WARNING: This only happens if there are actually some
percent
# encoded bits that get decoded. Is this documented somewhere???
my $unescaped_content = uri_unescape($encoded_content);

# After the previous step, everything was utf8, so now finally we turn
# it into Unicode.
my $decoded_content = decode("utf8", $unescaped_content);

print "$decoded_content\n";

# Test: change $decoded_content to the level you want to see.
# while ($decoded_content =~ m/(.)/g) {
# print ord $1, "\n";
# }
 
D

Dale

Mumia said:
403 Forbidden

Whoops! I'm sure you managed to recreate the website yourself. But just
in case:

http://www.sfs.uni-tuebingen.de/~dg/fooo.html

The contents are:

%D1%86 a ц

Sorry again for violating the newsgroup charter by using Unicode here.
But sometime there ought to be a discussion of why the newsgroup has
such a charter. Perl allows programs to be written with Unicode, but
such programs cannot be discussed here. Does this make sense?
 
B

Ben Morrow

Quoth "Dale said:
Whoops! I'm sure you managed to recreate the website yourself. But just
in case:

http://www.sfs.uni-tuebingen.de/~dg/fooo.html

The contents are:

%D1%86 a ц

Sorry again for violating the newsgroup charter by using Unicode here.
But sometime there ought to be a discussion of why the newsgroup has
such a charter. Perl allows programs to be written with Unicode, but
such programs cannot be discussed here. Does this make sense?

It's not a question of this group's charter, it applies generally on
Usenet. There is no header in a Usenet article that specifies a charset,
so no way to use anything other than the default ASCII.

I agree in principle: some form of charset header should be added, or
the charset should simply be specified to be UTF8. But until it is,
please refrain from using it.

Ben
 
D

Dr.Ruud

Ben Morrow schreef:
It's not a question of this group's charter, it applies generally on
Usenet. There is no header in a Usenet article that specifies a
charset, so no way to use anything other than the default ASCII.

I agree in principle: some form of charset header should be added, or
the charset should simply be specified to be UTF8. But until it is,
please refrain from using it.

In practice there is no problem with headers like
"Content-Type: text/plain; charset=ISO-8859-1"
because most readers deal with them as expected.

Henry Spencer once (1994) created the Son-of-RFC-1036:
http://www.chemie.fu-berlin.de/outerspace/netnews/son-of-1036.html
to document the state of that moment, and stated MIME as relevant for
news articles.

See also USEFOR, the Grandson-of-RFC-1036:
http://www.ietf.org/html.charters/usefor-charter.html
("an urgent need has been identified to formalize and document
many of the current and proposed extensions to the Usenet
Article format")
 
A

Alan J. Flavell

In practice there is no problem with headers like
"Content-Type: text/plain; charset=ISO-8859-1"
because most readers deal with them as expected.

That's not the whole story: such postings should also have valid
MIME headers, or else the client is required to treat it as pre-MIME
format, which probably isn't what was intended. And IIRC some news
clients do indeed apply that rule.

Of course this is all de-facto, but the valid RFC for usenet (1036) is
now hopelessly out of date, so we have to live with some de-facto
rules. Which can be very well based a best common factor between the
discussions for a grandson of RFC-1036, and the observed common
practice. (Common practice alone isn't good enough, since some
widely-used clients by default will violate what the rules are
expected to become).

Personal view: I wouldn't recommend using charset=utf-8 yet, except
perhaps on groups where its use is already widespread. iso-8859-1 is
very widely supported, and windows-1252, although proprietary and
therefore to be deprecated, is pretty widely supported; iso-8859-15 is
somewhat less supported, I'm disinclined to recommend it, but I note
that some folks use it for their usenet postings.

[ Totally OT: use of iso-8859-15 for HTML is utterly pointless. ]

IMHO and YMMV. But successful communication depends on a certain
conservatism in what one sends - not relying on the generosity of the
recipient to interpret it liberally.
 
B

Bart Van der Donck

Dr.Ruud said:
[...]
In practice there is no problem with headers like
"Content-Type: text/plain; charset=ISO-8859-1"
because most readers deal with them as expected.

Henry Spencer once (1994) created the Son-of-RFC-1036:
http://www.chemie.fu-berlin.de/outerspace/netnews/son-of-1036.html
to document the state of that moment, and stated MIME as relevant for
news articles.

See also USEFOR, the Grandson-of-RFC-1036:
http://www.ietf.org/html.charters/usefor-charter.html
("an urgent need has been identified to formalize and document
many of the current and proposed extensions to the Usenet
Article format")

I think the current situation is about like this:

One can safely use ASCII on Usenet with or without a charset header.
One can safely use ISO-8859-1, but only when specifying it in the
header.
Other charsets work when supported, but should be used carefully
depending on the circumstances. For example, on a Russian discussion
group it's reasonable to use KOI8-R. But one should obviously always
specify the charset in such cases.

I think one should not rely on any Unicode charset on Usenet (yet).

Google Groups deals with this issue as follows:

(1) Default to ISO-8859-1 when possible, yes even with plain ASCII.
(2) Use custom charset if the offered characters can unambiguously be
represented in that charset and if ISO-8859-1 is too narrow; perhaps
also considering browser settings/preferences.
(3) Use UTF-8 if the above fails; I suppose mostly in charset
combinations, 'tricky' replies or really exotic stuff.

A good policy, IMO.
 
D

Dale

Your data seems to be UTF8, but you advertise it as iso-8859-1. Don't
you think that will confuse user agents such as LWP::UserAgent?

Yes, I know. It's not configured properly for serving UTF8. That's why
I at first put it at a different URL where UTF8 is handled correctly.
But I forgot that this site is only local.
 
D

Dale

Bart said:
(1) Default to ISO-8859-1 when possible, yes even with plain ASCII.
(2) Use custom charset if the offered characters can unambiguously be
represented in that charset and if ISO-8859-1 is too narrow; perhaps
also considering browser settings/preferences.
(3) Use UTF-8 if the above fails; I suppose mostly in charset
combinations, 'tricky' replies or really exotic stuff.
... successful communication depends on a certain
conservatism in what one sends - not relying on the generosity of the
recipient to interpret it liberally.

Isn't UTF8 the most consertive choice nowadays? Look at Wikipedia or
Wiktionary. Massive international websites all in UTF8. And look at the
Russian Wikipedia. for example. It doesn't use a "custom charset" at
all.

The idea that UTF8 should be reserved for "really exotic stuff" seems
very weird. Look at any Wikipedia page dealing with mathematics, and
you're bound to find UTF8 used for quite normal things. Here, for
example, is the rule for the associativity of function composition:

f o (g o h) = (f o g) o h

Try to say that in ASCII or ISO-8859-1!

Dale

Dale
 
D

Dr.Ruud

Dale schreef:
Isn't UTF8 the most consertive choice nowadays? Look at Wikipedia or
Wiktionary. Massive international websites all in UTF8. And look at
the Russian Wikipedia. for example. It doesn't use a "custom charset"
at all.

See Subject, this is about "Usenet and charsets", not about HTML.

Your newsclient doesn't remove the
/[[:blank:]]+[(]was: Re: .*[)]$/
part from the Subject header field,
so you need to do it by hand.

Your broken newsclient does remove the [anything] prefix
from the Subject header field, which is real bad.

My broken newsclient (OE6) does a lot of real bad things too, but used
together with OE-QuoteFix and Hamster it is almost OK.
 
P

Peter J. Holzer

Dale schreef:

See Subject, this is about "Usenet and charsets", not about HTML.

Yup. Usenet is more conservative than the WWW. UTF-8 is only about 14
years old, so you can't expect all newsreaders to support it. Still, I
think that properly declared UTF-8 should be acceptable in international
newsgroups, and since nobody has complained about my postings yet, I
take it as evidence that my newsreader's inability to use ISO-8859-1
where sufficient is only a minor bug.

Your newsclient doesn't remove the
/[[:blank:]]+[(]was: Re: .*[)]$/
part from the Subject header field,
so you need to do it by hand.

Your broken newsclient does remove the [anything] prefix
from the Subject header field, which is real bad.

Weird. Dale seems to be using Mozilla 1.7.8 from Debian. I just
installed that (although a slightly newer version), and can't reproduce
this: [META] is preserved and (was: ...) is automatically removed.

hp
 
D

Dr.Ruud

Peter J. Holzer schreef:
Dr.Ruud:
[to Dale]
Your newsclient doesn't remove the
/[[:blank:]]+[(]was: Re: .*[)]$/
part from the Subject header field,
so you need to do it by hand.

Your broken newsclient does remove the [anything] prefix
from the Subject header field, which is real bad.

Weird. Dale seems to be using Mozilla 1.7.8 from Debian. I just
installed that (although a slightly newer version), and can't
reproduce this: [META] is preserved and (was: ...) is automatically
removed.

I assumed that Dale had used the googlegroups-interface for a
newsclient. The [META] was already removed with Bart's reply.
 
P

Peter J. Holzer

Peter J. Holzer schreef:
[weird things to the subject]
Weird. Dale seems to be using Mozilla 1.7.8 from Debian. I just
installed that (although a slightly newer version), and can't
reproduce this: [META] is preserved and (was: ...) is automatically
removed.

I assumed that Dale had used the googlegroups-interface for a
newsclient.

You are right. I saw Mozilla 1.7.8 in what looked like a useragent header, and
didn't notice that it was really an 'X-HTTP-Useragent' header. Sorry for
the confusion.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top