How to handle a HTTP::Request with gzip, deflate headers

L

Leif Wessman

Hi!

If I send the following to in my request

$req->header('Accept-encoding', 'gzip, deflate');

And then the Content-Encoding header in the response is 'gzip' or
'deflate'. How can I uncompress the content? I've tried the following,
but $data becomes empty:

my $data = $response->content;
my $encoding = $response->header('Content-Encoding');
if ($encoding) {
if ($encoding =~ /gzip/i) {
$data = Compress::Zlib::memGunzip($data);
}
if ($encoding =~ /deflate/i) {
my $x = deflateInit() or die "Cannot create a deflation stream\n" ;
my ($output, $status) = $x->deflate($data) ;
$status == Z_OK or die "deflation failed\n" ;
$data = $output;
($output, $status) = $x->flush() ;
$status == Z_OK or die "deflation failed\n" ;
$data .= $output;
}
}
 
G

Gisle Aas

Leif Wessman said:
If I send the following to in my request

$req->header('Accept-encoding', 'gzip, deflate');

And then the Content-Encoding header in the response is 'gzip' or
'deflate'. How can I uncompress the content?

In libwww-perl-5.802 the response object have a decoded_content()
method that will undo any Content-Encoding for you.
 
G

Gisle Aas

Gisle Aas said:
In libwww-perl-5.802 the response object have a decoded_content()
method that will undo any Content-Encoding for you.

Also note that you can set up:

$ua->default_header("Accept-Encoding" => "gzip, deflate");

and then all requests you send out will automatically get this header.
No need to do it yourself for each request.
 
L

Leif Wessman

Gisle said:
Also note that you can set up:

$ua->default_header("Accept-Encoding" => "gzip, deflate");

and then all requests you send out will automatically get this header.
No need to do it yourself for each request.


Thanks. But how do I handle the HTTP Response?

Leif
 
L

Leif Wessman

Gisle said:
Just like normal, but use $res->decoded_content instead of $res->content.


I get the following error:
Can't locate object method "decoded_content" via package
"HTTP::Headers" at (eval 15) line 1.
I'm using LWP::parallel::UserAgent. Does that matter?

Leif
 
G

Gisle Aas

Leif Wessman said:
I get the following error:
Can't locate object method "decoded_content" via package
"HTTP::Headers" at (eval 15) line 1.
I'm using LWP::parallel::UserAgent. Does that matter?

No, but you need a recent version of LWP itself (v5.802).
 
L

Leif Wessman

Gisle said:
No, but you need a recent version of LWP itself (v5.802).

$res->decoded_content now returns the decoded content for webpages with
Content-Encoding: gzip. However, when Content-Encoding is 'deflate' I
get empty content. The website I'm testing on is Amazon.fr (they use
deflate) so I guess the problem is in my code. Should it work with
deflate automatically?
 
G

Gisle Aas

Leif Wessman said:
$res->decoded_content now returns the decoded content for webpages with
Content-Encoding: gzip. However, when Content-Encoding is 'deflate' I
get empty content. The website I'm testing on is Amazon.fr (they use
deflate) so I guess the problem is in my code. Should it work with
deflate automatically?

Yes, the code is there, but I have not actually tried it on a site
that use 'deflate' yet. Can you give my an URL to try?
 
G

Gisle Aas

Leif Wessman said:
Gisle, did decoding 'deflate' work when you tried it?

I did not manage to get it to pass back deflated content to me. I
just got back plain text/html. Can you provide a complete example?

I tried this:

lwp-request -H "Accept-Encoding: gzip, deflate" -SUed http://www.amazon.fr/exec/obidos/ASIN/0136609112/t/
GET http://www.amazon.fr/exec/obidos/ASIN/0136609112
Accept-Encoding: gzip, deflate
User-Agent: lwp-request/2.06

GET http://www.amazon.fr/exec/obidos/ASIN/0136609112/t/ --> 301 Moved Permanently
GET http://www.amazon.fr/exec/obidos/ASIN/0136609112 --> 200 OK
Date: Wed, 08 Dec 2004 10:01:53 GMT
Server: Stronghold/2.4.2 Apache/1.3.6 C2NetEU/2412 (Unix) amarewrite/0.1 mod_fastcgi/2.2.12
Content-Type: text/html
Client-Date: Wed, 08 Dec 2004 10:01:54 GMT
Client-Peer: 207.171.166.150:80
Client-Response-Num: 1
Client-Transfer-Encoding: chunked
Cneonction: close
Title: Amazon.fr : Livres en anglais: Practical Introduction to Data Structures and Algorithm Analysis: Java Edition
X-Meta-Description: Practical Introduction to Data Structures and Algorithm Analysis: Java Edition, Clifford A. Shaffer
X-Meta-Keywords: Practical Introduction to Data Structures and Algorithm Analysis: Java Edition, Livres en anglais, Clifford A. Shaffer, Computer Bks - Languages / Programming, Computer Books: General, Computer Programming Languages, Computer Science, Computer algorithms, Computers, Computers (Software), Data Structures, Data structures (Computer science), Databases & data structures, Java & variants, Java (Computer program language), Programming - General, Programming Languages - General
 
L

Leif Wessman

I did not manage to get it to pass back deflated content to me. I
just got back plain text/html. Can you provide a complete example?


This works for me:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
$ua->default_header("Accept-Encoding" => "gzip, deflate");
my $url = "http://www.amazon.fr/exec/obidos/ASIN/0136609112/t/";
my $req = HTTP::Request->new(GET => $url);

my $res = $ua->request($req);
if ($res->is_success) {
#print $res->content;
print $res->decoded_content;
} else {
print "Error: " . $res->status_line . "\n";
}

decoded_content is uninitialized in the above example, since deflate
decoding doesn't seem to work. When using $res->content you can see the
undecoded text. When checking the headers, Amazon adds the deflate
header:

HTTP/1.1 200 OK
Date: Wed, 08 Dec 2004 10:39:12 GMT
Server: Stronghold/2.4.2 Apache/1.3.6 C2NetEU/2412 (Unix)
amarewrite/0.1 mod_fastcgi/2.2.12
Content-Encoding: deflate
Content-Type: text/html
Client-Date: Wed, 08 Dec 2004 10:35:51 GMT
Client-Peer: 207.171.166.150:80
Client-Response-Num: 1
Client-Transfer-Encoding: chunked
Cneonction: close
Set-Cookie: session-id-time=1103065200; path=/; domain=.amazon.fr;
expires=Tuesdš‡š 14-Dec-2004 23:00:00 GMT
Se,,œ‹?okie: session-id=402-8853526-3873764; path=/;
domain=.amazon.fr; expires=T,...‰day,?™‹?‹œc-2004
23:00:00 GMT
,,?
ì=ksÛš‹-ŽUù^Sú¬µk¾I'²Å",ÛI¶òðENRu_T
0$aƒ^@^V=œÍݯù ÷íº{f€^A8
H‰ôíVNU'H`0Óïéîé^Y4^.Ó·?m¼\rÛ...Izçs-ÞEü¬™òÛ´í$IsúòKËb­,,ÇÞœýÆæa
 
G

Gisle Aas

Leif Wessman said:
This works for me:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");

They apparently need to see a specific User-Agent to send back deflate
content. I can reproduce this when I set this header.
$ua->default_header("Accept-Encoding" => "gzip, deflate");

....and they don't trust this header by itself.
HTTP/1.1 200 OK
Date: Wed, 08 Dec 2004 10:39:12 GMT
Server: Stronghold/2.4.2 Apache/1.3.6 C2NetEU/2412 (Unix)
amarewrite/0.1 mod_fastcgi/2.2.12
Content-Encoding: deflate
Content-Type: text/html
Client-Date: Wed, 08 Dec 2004 10:35:51 GMT
Client-Peer: 207.171.166.150:80
Client-Response-Num: 1
Client-Transfer-Encoding: chunked
Cneonction: close

....and they seem to have a typo in their server :)
 
G

Gisle Aas

Leif Wessman said:
This works for me:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
$ua->default_header("Accept-Encoding" => "gzip, deflate");
my $url = "http://www.amazon.fr/exec/obidos/ASIN/0136609112/t/";
my $req = HTTP::Request->new(GET => $url);

my $res = $ua->request($req);
if ($res->is_success) {
#print $res->content;
print $res->decoded_content;
} else {
print "Error: " . $res->status_line . "\n";
}

decoded_content is uninitialized in the above example, since deflate
decoding doesn't seem to work. When using $res->content you can see the
undecoded text. When checking the headers, Amazon adds the deflate
header:

HTTP/1.1 200 OK
Date: Wed, 08 Dec 2004 10:39:12 GMT
Server: Stronghold/2.4.2 Apache/1.3.6 C2NetEU/2412 (Unix)
amarewrite/0.1 mod_fastcgi/2.2.12
Content-Encoding: deflate

What I found out is that they actually output the wrong format here as
"Content-Encoding: deflate" is supposed to imply that the content is
in the "zlib" format (see [1] as well as RFC 2616). What is
unfortunate is that the zlib format contains data in the "deflate"
format, so it is not that hard to see how they managed to get this
wrong. According to [2] (item 36) Microsoft introduced this brain
damage, so perhaps that explains why they only sent data in this
format when _you_ claim to be MSIE. I also noticed that Apache's
mod_deflate will only compress into the "gzip" format as suggested by
the zlib FAQ to avoid this misunderstanding.

I'll see if I can hack libwww-perl to retry decoding with the
"deflate" format if decoding according to "zlib" fails. Seems like a
good idea to be MSIE bug compatible again :-(

[1] http://www.iana.org/assignments/http-parameters
[2] http://www.gzip.org/zlib/zlib_faq.html
 
G

Gisle Aas

Gisle Aas said:
I'll see if I can hack libwww-perl to retry decoding with the
"deflate" format if decoding according to "zlib" fails. Seems like a
good idea to be MSIE bug compatible again :-(

This patch fixes the problem for me. It will be in libwww-perl-5.803
when released.

Index: lib/HTTP/Message.pm
===================================================================
RCS file: /cvsroot/libwww-perl/lwp5/lib/HTTP/Message.pm,v
retrieving revision 1.55
diff -u -p -r1.55 Message.pm
--- lib/HTTP/Message.pm 6 Dec 2004 13:27:20 -0000 1.55
+++ lib/HTTP/Message.pm 8 Dec 2004 14:01:40 -0000
@@ -201,8 +201,39 @@ sub decoded_content
}
elsif ($ce eq "deflate") {
require Compress::Zlib;
- $content_ref = \Compress::Zlib::uncompress($$content_ref);
- die "Can't inflate content" unless defined $$content_ref;
+ my $out = Compress::Zlib::uncompress($$content_ref);
+ unless (defined $out) {
+ # "Content-Encoding: deflate" is supposed to mean the "zlib"
+ # format of RFC 1950, but Microsoft got that wrong, so some
+ # servers sends the raw compressed "deflate" data. This
+ # tries to inflate this format.
+ unless ($content_ref_iscopy) {
+ # the $i->inflate method is documented to destroy its
+ # buffer argument
+ my $copy = $$content_ref;
+ $content_ref = \$copy;
+ $content_ref_iscopy++;
+ }
+
+ my($i, $status) = Compress::Zlib::inflateInit(
+ WindowBits => -Compress::Zlib::MAX_WBITS(),
+ );
+ my $OK = Compress::Zlib::Z_OK();
+ die "Can't init inflate object" unless $i && $status == $OK;
+ ($out, $status) = $i->inflate($content_ref);
+ if ($status != Compress::Zlib::Z_STREAM_END()) {
+ if ($status == $OK) {
+ $self->push_header("Client-Warning" =>
+ "Content might be truncated; incomplete deflate stream");
+ }
+ else {
+ # something went bad, can't trust $out any more
+ $out = undef;
+ }
+ }
+ }
+ die "Can't inflate content" unless defined $out;
+ $content_ref = \$out;
$content_ref_iscopy++;
}
elsif ($ce eq "compress" || $ce eq "x-compress") {
 
L

Leif Wessman

Gisle said:
They apparently need to see a specific User-Agent to send back deflate
content. I can reproduce this when I set this header.


...and they don't trust this header by itself.


...and they seem to have a typo in their server :)
H‰ôíVNU'H`0Óïéîé^Y4^.Ó·?m¼\rÛ...Izçs-ÞEü¬™òÛ´í$IsúòKËb­,,ÇÞœýÆæa
But should the module be able to decode deflate anyway?

Leif
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,782
Latest member
ThomasGex

Latest Threads

Top