problem reading html stream

D

Dave Saville

I have a perl script that reads a, large, html stream (TV program
data).

I use IO::Socket, do a "my $socket = new" and then a "while (
<$socket>)" to fetch the data.

Now the problem *might* be their end, but it hangs after *exactly*
180K for about 5 minutes and then completes. Firefox pulls the same
data in 10s of seconds. Which, to my thinking, would eliminate any
funnies in libc.

Any thoughts?

TIA
 
B

Bjoern Hoehrmann

* Dave Saville wrote in comp.lang.perl.misc:
I have a perl script that reads a, large, html stream (TV program
data).

I use IO::Socket, do a "my $socket = new" and then a "while (
<$socket>)" to fetch the data.

Now the problem *might* be their end, but it hangs after *exactly*
180K for about 5 minutes and then completes. Firefox pulls the same
data in 10s of seconds. Which, to my thinking, would eliminate any
funnies in libc.

The error is in what you are not describing, like what <> does in your
code. By default it looks for newlines and there might be none in the
stream after a certain point, and the five minutes might simply be the
timeout where your program gives up waiting for more data.
 
D

Dave Saville

On Sat, 14 Jan 2012 16:56:36 UTC, Bjoern Hoehrmann

The error is in what you are not describing, like what <> does in your
code. By default it looks for newlines and there might be none in the
stream after a certain point, and the five minutes might simply be the
timeout where your program gives up waiting for more data.

It parses the input and writes to a file. It's not a timeout as after
the long wait it carries on to completion. It's processed line by line
and it is not running out of memory or anything like that. The socket
is blocked until some more lines eventually arrive.

One of the URLs I am having problems with is
xmltv.radiotimes.com/xmltv/94.dat

Will try an iptrace but I doubt there is any traffic. It is just so
suspicious that it is *exactly* 180K bytes.
 
B

Bjoern Hoehrmann

* Dave Saville wrote in comp.lang.perl.misc:
It parses the input and writes to a file. It's not a timeout as after
the long wait it carries on to completion. It's processed line by line
and it is not running out of memory or anything like that. The socket
is blocked until some more lines eventually arrive.

One of the URLs I am having problems with is
xmltv.radiotimes.com/xmltv/94.dat

Will try an iptrace but I doubt there is any traffic. It is just so
suspicious that it is *exactly* 180K bytes.

My guess is that you are trying to read a HTTP response via IO::Socket
and that does not work because you are expecting that while(<$socket>)
knows when it read the "last line" but there is no such thing in HTTP.
 
R

r.mariotti

I have a perl script that reads a, large, html stream (TV program
data).

I use IO::Socket, do a "my $socket = new" and then a "while (
<$socket>)" to fetch the data.

Now the problem *might* be their end, but it hangs after *exactly*
180K for about 5 minutes and then completes. Firefox pulls the same
data in 10s of seconds. Which, to my thinking, would eliminate any
funnies in libc.

Any thoughts?

TIA

Perhaps the IO::Socket module is not your best bet.

I do something similar and I use LWP::Simple. Streams come right in
at full bandwith speed.

Good luck
 
D

Dave Saville

Perhaps the IO::Socket module is not your best bet.

I do something similar and I use LWP::Simple. Streams come right in
at full bandwith speed.

It behaves, or misbehaves, with socket and io::socket - but I suppose
it would the latter being a wrapper for the former. Never thought of
LWP::Simple as it is not really an HTML page - just data.

The point several of you seem to have missed is that after the hang at
180K for minutes the stream resumes with no missing data. I ran an
iptrace and that showed damn all during the hang. I really think it
must be the server end - which I have nothing to do with. I have also
run my code against my own server, although not with such a big file,
but "normal" HTML pages and it works just fine.
 
P

Peter J. Holzer

It behaves, or misbehaves, with socket and io::socket - but I suppose
it would the latter being a wrapper for the former. Never thought of
LWP::Simple as it is not really an HTML page - just data.

Do you use HTTP to get the data or some custom protocol?

hp
 
D

Dave Saville

On Sun, 15 Jan 2012 12:16:08 UTC, "Peter J. Holzer"

Do you use HTTP to get the data or some custom protocol?

HTTP - But it would appear to be a problem with perl sockets - Someone
suggested LWP::Simple but that was no good as I needed to process the
files which are large and the server does not have much RAM. So I used
LWP::UserAgent to dump straight to a file which I can then post
process and it works fine. Odd as I would have thought that LWP* would
use sockets at the bottom layer. Ho hum.

Thanks for the help guys.
 
P

Peter J. Holzer

On Sun, 15 Jan 2012 12:16:08 UTC, "Peter J. Holzer"



HTTP - But it would appear to be a problem with perl sockets - Someone
suggested LWP::Simple but that was no good as I needed to process the
files which are large and the server does not have much RAM. So I used
LWP::UserAgent to dump straight to a file which I can then post
process and it works fine. Odd as I would have thought that LWP* would
use sockets at the bottom layer. Ho hum.

It does. You probably made an error in writing your own HTTP
implementation.

hp
 
D

Dave Saville

It does. You probably made an error in writing your own HTTP
implementation.

That I am willing to believe. Perhaps you would be so kind as to point
out the error in my code?

#!/usr/local/bin/perl
use warnings;
use strict;
use Socket;
open RAW, ">RAW" or die $!;
my $iaddr = inet_aton('xmltv.radiotimes.com') or die $!;
socket(SOCK, AF_INET, SOCK_STREAM, getprotobyname('tcp')) or die $!;
my $paddr = sockaddr_in(80, $iaddr);
connect(SOCK, $paddr) or die $!;
send SOCK, "GET /xmltv/94.dat HTTP\/1.1\r\n", 0;
send SOCK, "Host: xmltv.radiotimes.com\r\n\r\n", 0;
while ( <SOCK> )
{
print RAW $_;
}
close SOCK;
close RAW;

This hangs for minutes and then completes. I have run the above on two
different operating systems and they both do exactly the same.
 
J

Jens Thoms Toerring

Dave Saville said:
On Sun, 15 Jan 2012 15:09:39 UTC, "Peter J. Holzer"
That I am willing to believe. Perhaps you would be so kind as to point
out the error in my code?
#!/usr/local/bin/perl
use warnings;
use strict;
use Socket;
open RAW, ">RAW" or die $!;
my $iaddr = inet_aton('xmltv.radiotimes.com') or die $!;
socket(SOCK, AF_INET, SOCK_STREAM, getprotobyname('tcp')) or die $!;
my $paddr = sockaddr_in(80, $iaddr);
connect(SOCK, $paddr) or die $!;
send SOCK, "GET /xmltv/94.dat HTTP\/1.1\r\n", 0;
send SOCK, "Host: xmltv.radiotimes.com\r\n\r\n", 0;
while ( <SOCK> )
{
print RAW $_;
}
close SOCK;
close RAW;

This hangs for minutes and then completes. I have run the above on two
different operating systems and they both do exactly the same.

This 180 kB look suspicously like the length of the file the
server sends. And you're using HTTP 1.1, which allows the sender
to keep the connection open after it has send a file, waiting
for the next request unless told otherwise ("persistent connec-
tion" is actually the defalt with HTTP 1.1). So my guess is that
the server sends the complete file just fine and waits for the
the next request. But since your loop only ends when the connec-
tion is closed by the other side it hangs until the server gets
bored and closes the connection after a few minutes. So either
use HTTP 1.0 or send an additional HTTP header with (IIRC)
"Connection: close\r\n". See also e.g.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html

Regards, Jens
 
D

Dave Saville

This 180 kB look suspicously like the length of the file the
server sends. And you're using HTTP 1.1, which allows the sender
to keep the connection open after it has send a file, waiting
for the next request unless told otherwise ("persistent connec-
tion" is actually the defalt with HTTP 1.1). So my guess is that
the server sends the complete file just fine and waits for the
the next request. But since your loop only ends when the connec-
tion is closed by the other side it hangs until the server gets
bored and closes the connection after a few minutes. So either
use HTTP 1.0 or send an additional HTTP header with (IIRC)
"Connection: close\r\n". See also e.g.


Thank you so much Jens, reverting to 1.0 or adding the header both
work.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top