R
Rob Doug
Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?
require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'yaml'
require 'net/http'
require 'uri'
require 'modules/common'
Thread.abort_on_exception = true
$config = YAML.load_file "config.yml"
links = IO.read("bases/3+4.txt").split("\n")
threads = []
links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end
threads.each { |t| t.join() }
What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.
When I start this program without threads I getting the same results
Please help, is this my fault or something wrong in the libraries ?
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?
require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'yaml'
require 'net/http'
require 'uri'
require 'modules/common'
Thread.abort_on_exception = true
$config = YAML.load_file "config.yml"
links = IO.read("bases/3+4.txt").split("\n")
threads = []
links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end
threads.each { |t| t.join() }
What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.
When I start this program without threads I getting the same results
Please help, is this my fault or something wrong in the libraries ?