The huge amount response data problem

F

falconzyx

I have a issue:
1. I want to open a file and use the data from the file to construct
the url.
2. After I constructed the url and sent it, I got the response html
data and some parts are what I want store inot the files.

It seems like a very easy thing, however, the issue is that the data
from the file that I have to open are too huge, which I have to
consturct almost 200000 url address to send and parse response data.
And the speed is very very slow.

I have no idea with thread or db cache, so I want some help .

Please give me some advices that what I should do to improve the speed

Thanks very much.
 
F

falconzyx

I have a issue:
1. I want to open a file and use the data from the file to construct
the url.
2. After I constructed the url and sent it, I got the response html
data and some parts are what I want store inot the files.

It seems like a very easy thing, however, the issue is that the data
from the file that I have to open are too huge, which I have to
consturct almost 200000 url address to send and parse response data.
And the speed is very very slow.

I have no idea with thread or db cache, so I want some help .

Please give me some advices that what I should do to improve the speed

Thanks very much.

this is my code:

use threads;
use LWP::UserAgent;
use LWP::Simple;
use Data::Dumper;
use strict;
use threads::shared;



my $wordsList = &get_request;
#print Dumper( @wordsList );

my @words = split("\n", $wordsList);
#print Dumper(@words);

my @url = &get_url(@words);
#print Dumper(@url);
my @thr;
foreach my $i (1..100000) {
push @thr, threads->new(\&get_html, $url[$i]);
}
foreach (@thr) {
$_->detach; # it doesn't work!!!!!!!!!!!!!!!!
}



sub get_html {
my (@url) = @_;

}
sub get_request {
..........
return $wordsList;
}

sub get_url {
my (@words) = @_;
................
return @url;
}
 
B

Ben Bullock

Your code is hopelessly inefficient. 100,000 strings of even twenty
characters is at least two megabytes of memory. Then you've doubled
that number with the creation of the URL, and then you are creating
arrays of all these things, so you've used several megabytes of
memory.

Instead of first creating a huge array of names, then a huge array of
URLs, why don't you just read in one line of the file at a time, then
try to get data from each URL? Read in one line of the first file,
create its URL, get the response data, store it, then go back and get
the next line of the file, etc. A 100,000 line file actually isn't
that big.

But if you are getting all these files from the internet, the biggest
bottleneck is probably the time the code spends waiting for a response
from the web servers it's requested. You'd have to think about making
parallel requests somehow to solve that.
 
F

falconzyx

Your code is hopelessly inefficient. 100,000 strings of even twenty
characters is at least two megabytes of memory. Then you've doubled
that number with the creation of the URL, and then you are creating
arrays of all these things, so you've used several megabytes of
memory.

Instead of first creating a huge array of names, then a huge array of
URLs, why don't you just read in one line of the file at a time, then
try to get data from each URL? Read in one line of the first file,
create its URL, get the response data, store it, then go back and get
the next line of the file, etc. A 100,000 line file actually isn't
that big.

But if you are getting all these files from the internet, the biggest
bottleneck is probably the time the code spends waiting for a response
from the web servers it's requested. You'd have to think about making
parallel requests somehow to solve that.

Thanks Ben,

However, is there any good solution that use threads method? I use
that, and out of memory time by time after I refactor the code as you
told
I try thread::pool and some other thread module that I found.
Doesn't it really Perl suit for mutil threads programming??

Thanks again for eveyone.
 
F

falconzyx

Thanks Ben,

However, is there any good solution that use threads method? I use
that, and out of memory time by time after I refactor the code as you
told
I try thread::pool and some other thread module that I found.
Doesn't it really Perl suit for mutil threads programming??

Thanks again for eveyone.

here is my refactor code :
use threads;
use LWP::UserAgent;
use Data::Dumper;
use strict;



&get_request();

sub get_request {
open (FH, "...") or die "can not open file $!";
while (<FH>) {
my $i = <FH>;
my $url = ".../$i";
my $t = threads->new(\&get_html, $url);
$t->join();

}
close (FH);
}
sub get_html {
my ($url) = @_;
my $user_agent = LWP::UserAgent->new();
my $response = $user_agent->request(HTTP::Request->new('GET',
$url));
my $content = $response->content;
format_html ($content);
}
sub format_html {
my ($content) = shift;
my $html_data = $content;
my $word;
my $data;
while ( $html_data =~ m{...}igs ) {
$word = $1;
}
while ( $html_data =~ m{...}igs ) {
$data = $1;
save_data( $word, $data );
}
while ( $data =~ m{...}igs ) {
my $title = $1;
my $sound = $1.$2;
if ( defined($sound) ) {
save_sound( $word, $title, $sound );
}
}
}

sub save_data {
my ( $word, $data ) = @_;
open ( FH, " > ..." ) or die "Can not open $!";
print FH $data;
close(FH);
}

sub save_sound {
my ( $word, $title, $sound ) = @_;
getstore("....", "...") or warn $!;
}
 
R

RedGrittyBrick

Thanks Ben,

However, is there any good solution that use threads method? I use
that, and out of memory time by time after I refactor the code as you
told

That's because, if your file contains 100000 lines, your program tries
to create 100000 simultaneous threads doesn't it?

I would create a pool with a fixed number of threads (say 10). I'd read
the file adding tasks to a queue of the same size, after filling the
queue I'd pause reading the file until the queue has a spare space.
Maybe this could be achieved by sleeping a while (say 100ms) and
re-checking if the queue is stuill full. When a thread is created or has
finished a task it should remove a task from the queue and process it.
If the queue is empty the thread should sleep for a while (say 200ms)
and try again, you'd need some mechanism to signal threads that all
tasks have been queued (maybe a flag, a special marker task, a signal or
a certain number of consecutive failed attempts to find work.)

I've never tried to program something like this in Perl so I'd imagine
someone (probably several people) has already solved this and added
modules to CPAN to assist in this sort of task.

There's probably some OO Design Patterns that apply too.
I try thread::pool and some other thread module that I found.
Doesn't it really Perl suit for mutil threads programming??

I find it hard to understand what you are saying but I think the answer
is: Yes, Perl is well suited to programming with multiple threads (or
processes).
 
X

xhoster

I have a issue:
1. I want to open a file and use the data from the file to construct
the url.
2. After I constructed the url and sent it, I got the response html
data and some parts are what I want store inot the files.

It seems like a very easy thing, however, the issue is that the data
from the file that I have to open are too huge, which I have to
consturct almost 200000 url address to send and parse response data.
And the speed is very very slow.

What part is slow, waiting for the response or parsing it?

Does those URLs point to *your* servers? If so, then you should be able
to bypass http and go directly to the source. If not, then do you have
permission from the owner of the servers to launch what could very well
be a denial of service attack against them?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

RedGrittyBrick said:
I find it hard to understand what you are saying but I think the answer
is: Yes, Perl is well suited to programming with multiple threads (or
processes).

I agree with the "(or processes)" part, provided you are running on a Unix
like platform. But in my experience/opinion Perl threads mostly suck.

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
F

falconzyx

I agree with the "(or processes)" part, provided you are running on a Unix
like platform.  But in my experience/opinion Perl threads mostly suck.

--
--------------------http://NewsReader.Com/--------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Here is my refactor code, which still at a very slow speed, please
advice me how to improve it, thanks very much:

require LWP::parallel::UserAgent;
use HTTP::Request;
use LWP::Simple;
use threads;

# display tons of debugging messages. See 'perldoc LWP::Debug'
#use LWP::Debug qw(+);
my $reqs = [
HTTP::Request->new('GET',"http://www...."),
HTTP::Request->new('GET', "......"
..............# about nearly 200000 url here

];

my $pua = LWP::parallel::UserAgent->new();
$pua->in_order (10000); # handle requests in order of registration
$pua->duplicates(0); # ignore duplicates
$pua->timeout (1); # in seconds
$pua->redirect (1); # follow redirects

foreach my $req (@$reqs) {
print "Registering '".$req->url."'\n";
if ( my $res = $pua->register ($req) ) {
print STDERR $res->error_as_HTML;
}
}
my $entries = $pua->wait();

foreach (keys %$entries) {
my $res = $entries->{$_}->response;
threads->new(\&format_html, $res->content);

}
foreach my $thr (threads->list()) {
$thr->join(); # I think it does not work......
}

sub format_html {
my ($html_data) = shift;
my $word;
my $data;
while ( $html_data =~ m{...}igs ) {
$word = $1;
}
while ( $html_data =~ m{...}igs ) {
$data = $1;
save_data( $word, $data );
}
while ( $data =~ m{...}igs ) {
my $title = $1;
my $sound = $1.$2;
if ( defined($sound) ) {
save_sound( $word, $title, $sound );
}
}
}

sub save_data {
my ( $word, $data ) = @_;
open ( FH, " > ..." ) or die "Can not open $!";
print FH $data;
close(FH);
}



sub save_sound {
my ( $word, $title, $sound ) = @_;
getstore("...", "...") or warn $!;

}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top