D
Dale
I have a couple of questions/problems concerning LWP and
Unicode. Here's an ultra-simple program that goes to a web page,
downloads it's contents and prints them out in a semi-readable form:
----------------------------------
#!/.../perl-5.8.8/bin/perl -CSDA
use utf8;
use LWP;
use Encode;
use URI::Escape;
my $browser = LWP::UserAgent->new;
$browser->parse_head(0);
my $url =
'http://bg.wiktionary.org/wiki/УиÐ...€Ñки/Типове_думи/Глаголи';
my $response = $browser->get(encode("utf8", $url));
my $content = decode("utf8", uri_unescape($response->content));
print "$content\n";
----------------------------------
Question 1: Why do I need the line that says
$browser->parse_head(0);
Question 2: Why do I need to explicitly say:
decode("utf8", ...)
Isn't there a way to tell LWP that the content is utf8? Or more
precisely, that it is utf8 with some URI percent escapes.
Question 3: If you change the pragma "use utf8" to "use encoding
'utf8'" then you don't need the call to "decode("utf8", ...)". Why
should this be? What's the difference between "use utf8" and "use
encoding 'utf8'"? The perldocerlunicode is no help here.
Question 4: In the original program, replace the line
my $content = decode("utf8", uri_unescape($response->content));
with
my $content = $response->content;
utf8::upgrade($content);
The perldocerlunicode page says you should do this when, for some
reason, Unicode does not happen. But this does nothing for me. I still
end up with bytes.
Unicode. Here's an ultra-simple program that goes to a web page,
downloads it's contents and prints them out in a semi-readable form:
----------------------------------
#!/.../perl-5.8.8/bin/perl -CSDA
use utf8;
use LWP;
use Encode;
use URI::Escape;
my $browser = LWP::UserAgent->new;
$browser->parse_head(0);
my $url =
'http://bg.wiktionary.org/wiki/УиÐ...€Ñки/Типове_думи/Глаголи';
my $response = $browser->get(encode("utf8", $url));
my $content = decode("utf8", uri_unescape($response->content));
print "$content\n";
----------------------------------
Question 1: Why do I need the line that says
$browser->parse_head(0);
Question 2: Why do I need to explicitly say:
decode("utf8", ...)
Isn't there a way to tell LWP that the content is utf8? Or more
precisely, that it is utf8 with some URI percent escapes.
Question 3: If you change the pragma "use utf8" to "use encoding
'utf8'" then you don't need the call to "decode("utf8", ...)". Why
should this be? What's the difference between "use utf8" and "use
encoding 'utf8'"? The perldocerlunicode is no help here.
Question 4: In the original program, replace the line
my $content = decode("utf8", uri_unescape($response->content));
with
my $content = $response->content;
utf8::upgrade($content);
The perldocerlunicode page says you should do this when, for some
reason, Unicode does not happen. But this does nothing for me. I still
end up with bytes.