H
Horacio Sanson
I am using Mechanize for several projects that require me to download large
amounts of html pages from a web site. Since I am working with about a 1000
pages the limitations of mechanize started to appear...
Try this code
################################################
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
prev = 0
curr = 0
prev_pages = 0
curr_pages = 0
1000.times do
page = agent.get("http://yourfavoritepage.com")
curr = 0
curr_pages = 0
# Count the total number of objects and the number of WWW::Mechanize:age
# objects.
ObjectSpace.each_object { |o|
curr += 1
curr_pages += 1 if o.class == WWW::Mechanize:age
}
puts "There are #{curr} (#{curr - prev}) objects"
puts "There are #{curr_pages} (#{curr_pages - prev_pages}) objects"
prev = curr
prev_pages = curr_pages
GC.enable
GC.start
sleep 1.0 # This avoids the script of taking 100% CPU
end
############################################
The output of this script repeals that at each iteration a
WWW::Mechanize:age object gets created (along with a lot of other objects)
and they never get GarbageCollected. So you can see your RAM flying away in
each iteration and never returning back.
Now this can be solved by putting the agent = WWW::Mechanize.new inside the
block like:
############################################
1000.times do
agent = WWW::Mechanize.new <-- CHANGE IS HERE
page = agent.get("http://yourfavoritepage.com")
curr = 0
curr_pages = 0
# Count the total number of objects and the number of WWW::Mechanize:age
..... the rest is the same
#############################################
With this change we see that the max number of WWW::Mechanize:age objects
never increases more then three and the other objects increase and decrease
in the order of 60 per iteration.
Does this means that the WWW::Mechanize object keeps references of all the
pages downloaded?? and those pages are not gonna be GarbageCollected until
the WWW::Mechanize object is alive?
In my script I cannot remove the WWW::Mechanize object since this page in
particular is a form and requires cookies state information to be able to
access to the pages I need to download. Is there a way to tell the Mechanize
Object to delete the pages alreade downloaded??
regards,
Horacio
amounts of html pages from a web site. Since I am working with about a 1000
pages the limitations of mechanize started to appear...
Try this code
################################################
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
prev = 0
curr = 0
prev_pages = 0
curr_pages = 0
1000.times do
page = agent.get("http://yourfavoritepage.com")
curr = 0
curr_pages = 0
# Count the total number of objects and the number of WWW::Mechanize:age
# objects.
ObjectSpace.each_object { |o|
curr += 1
curr_pages += 1 if o.class == WWW::Mechanize:age
}
puts "There are #{curr} (#{curr - prev}) objects"
puts "There are #{curr_pages} (#{curr_pages - prev_pages}) objects"
prev = curr
prev_pages = curr_pages
GC.enable
GC.start
sleep 1.0 # This avoids the script of taking 100% CPU
end
############################################
The output of this script repeals that at each iteration a
WWW::Mechanize:age object gets created (along with a lot of other objects)
and they never get GarbageCollected. So you can see your RAM flying away in
each iteration and never returning back.
Now this can be solved by putting the agent = WWW::Mechanize.new inside the
block like:
############################################
1000.times do
agent = WWW::Mechanize.new <-- CHANGE IS HERE
page = agent.get("http://yourfavoritepage.com")
curr = 0
curr_pages = 0
# Count the total number of objects and the number of WWW::Mechanize:age
..... the rest is the same
#############################################
With this change we see that the max number of WWW::Mechanize:age objects
never increases more then three and the other objects increase and decrease
in the order of 60 per iteration.
Does this means that the WWW::Mechanize object keeps references of all the
pages downloaded?? and those pages are not gonna be GarbageCollected until
the WWW::Mechanize object is alive?
In my script I cannot remove the WWW::Mechanize object since this page in
particular is a form and requires cookies state information to be able to
access to the pages I need to download. Is there a way to tell the Mechanize
Object to delete the pages alreade downloaded??
regards,
Horacio