memory leak

Rob Doug · Oct 20, 2009

Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?

require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'yaml'
require 'net/http'
require 'uri'
require 'modules/common'

Thread.abort_on_exception = true
$config = YAML.load_file "config.yml"
links = IO.read("bases/3+4.txt").split("\n")
threads = []

links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end

threads.each { |t| t.join() }

What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.

When I start this program without threads I getting the same results

Please help, is this my fault or something wrong in the libraries ?

Aaron Patterson · Oct 20, 2009

Rob said:
Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?

require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'yaml'
require 'net/http'
require 'uri'
require 'modules/common'

Thread.abort_on_exception = true
$config = YAML.load_file "config.yml"
links = IO.read("bases/3+4.txt").split("\n")
threads = []

links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end

threads.each { |t| t.join() }

What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.

When I start this program without threads I getting the same results

Please help, is this my fault or something wrong in the libraries ?

One culprit could be mechanize. Mechanize keeps a history of the pages
you've visited, and if you don't limit that length it will grow infinitely.

Try setting the max history:

http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000173

Rob Doug · Oct 20, 2009

One culprit could be mechanize. Mechanize keeps a history of the pages

you've visited, and if you don't limit that length it will grow
infinitely.

Try setting the max history:

http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000173

Thanks, but I've already try this... mem leak happens even when I use
"net/http" with Hpricot

Aaron Patterson · Oct 20, 2009

Rob said:
Thanks, but I've already try this... mem leak happens even when I use
"net/http" with Hpricot

You'll have to share more of your implementation then. You deduced that
the cause wasn't threads, but you've only shared the threaded part of
your code.

Also, have you run your program with valgrind to verify that it is in
fact a memory leak? It could just be objects sitting around waiting for GC.

Ryan Davis · Oct 20, 2009

links = IO.read("bases/3+4.txt").split("\n")
threads = []

links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end

threads.each { |t| t.join() }

I can't speak for your supposed "leaks" (pure ruby almost never has
leaks, but it does have many sneaky ways to make valid object
references that you can overlook), since I can't figure out what your
code does or how it is supposed to work. threads is an array of thread
objects that you only ever push on and you never clear or shift off
of. Also, why oh why oh why are you doing a redo??? Overly complicated
code may not be the source of your leak (and in this case it may very
well be), but it certainly isn't doing you any favors.

I almost always structure my threads as a pool of workers pulling from
a queue of tasks:

links = Queue.new
IO.read("bases/3+4.txt").split("\n").each do |l|
links << l
end
threads = []

50.times do
threads << Thread.new do
until links.empty? do
Common.post_it links.shift
end
end
end

threads.each { |t| t.join() }

Rob Doug · Oct 20, 2009

You'll have to share more of your implementation then. You deduced that
the cause wasn't threads, but you've only shared the threaded part of
your code.

Also, have you run your program with valgrind to verify that it is in
fact a memory leak? It could just be objects sitting around waiting for
GC.

Here the rest part of program, I did not think here can be any leaks,
this is why I posted code without this part.

module Common
def Common.post_it(link)
begin
puts link
agent = Net::HTTP
html = agent.get(URI.parse(link))
doc = Hpricot(html)
forms = doc.search "//form"
forms.each do |form|
puts form.search("input").length
end
rescue => err
puts err
file_write_safe("logs/errorlog.txt", "#{err.to_s}|#{link}\n")
return false
rescue Timeout::Error
file_write_safe("logs/errorlog.txt", "Request Timeout|#{link}\n")
end
end

private

def self.file_write_safe(file,message)
File.open(file, 'a') { |f|
f.flock(File::LOCK_EX)
f.write(message)
f.flock(File::LOCK_UN)
}
end
end

Rob Doug · Oct 20, 2009

Ryan said:
I can't speak for your supposed "leaks" (pure ruby almost never has
leaks, but it does have many sneaky ways to make valid object
references that you can overlook), since I can't figure out what your
code does or how it is supposed to work. threads is an array of thread
objects that you only ever push on and you never clear or shift off
of. Also, why oh why oh why are you doing a redo??? Overly complicated
code may not be the source of your leak (and in this case it may very
well be), but it certainly isn't doing you any favors.

I almost always structure my threads as a pool of workers pulling from
a queue of tasks:

actually I had this kind of code at first:

links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink,doorway)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
threads = threads.delete_if {|t| t.status == false}
GC.start
redo
end
end

but it does not help, and I reduce this code

Robert Klemme · Oct 20, 2009

Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?

require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'yaml'
require 'net/http'
require 'uri'
require 'modules/common'

Thread.abort_on_exception = true
$config = YAML.load_file "config.yml"
links = IO.read("bases/3+4.txt").split("\n")
threads = []

links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end

threads.each { |t| t.join() }

What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.

When I start this program without threads I getting the same results

Please help, is this my fault or something wrong in the libraries ?

As far as I can see you're never cleaning Array threads. You are just
appending to the Array and so even terminated threads will stay in there
because you also do not reuse threads.

Design criticism: usually you create threads upfront and pass work to
them via a Queue then you don't need the sleep crutch and use blocking
instead. Also, you can make your code better handle large volumes of
URLs by using a more streamed approach along the lines Ryan presented by
limiting all resources (not only threads but also size of the queue).

require 'thread'
....

THREADS = 50
q = SizedQueue.new(THREADS * 2)

threads = (1..THREADS).map do
Thread.new q do qq
until qq.equal?(url = qq.deq)
Common....
end
end
end

File.foreach("bases/3+4.txt") do |line|
line.chomp!
q.enq(URI.parse(line))
end

threads.size.times {|q| q.enq q}

Kind regards

robert

Ryan Davis · Oct 20, 2009

actually I had this kind of code at first:

yeah. I don't care. it is still really obfuscated. Switch to a simple
pool of workers pulling tasks out of a queue. Don't do redo for no
reason. Don't do crazy stuff.

Rob Doug · Oct 20, 2009

As far as I can see you're never cleaning Array threads. You are just
appending to the Array and so even terminated threads will stay in there
because you also do not reuse threads.

Design criticism: usually you create threads upfront and pass work to
them via a Queue then you don't need the sleep crutch and use blocking
instead. Also, you can make your code better handle large volumes of
URLs by using a more streamed approach along the lines Ryan presented by
limiting all resources (not only threads but also size of the queue).

require 'thread'
...

THREADS = 50
q = SizedQueue.new(THREADS * 2)

threads = (1..THREADS).map do
Thread.new q do qq
until qq.equal?(url = qq.deq)
Common....
end
end
end

File.foreach("bases/3+4.txt") do |line|
line.chomp!
q.enq(URI.parse(line))
end

threads.size.times {|q| q.enq q}

Very thank you, your solution is great. Code become much clear and looks
good.

But, this is not solve memory leak

Any ideas or suggestions?

Robert Klemme · Oct 21, 2009

Very thank you, your solution is great. Code become much clear and looks
good.

But, this is not solve memory leak

Any ideas or suggestions?

You could print out object statistics to get an idea about the source of
the leak. Something along the lines

require 'pp'

def print_class_counts
c = Hash.new 0
ObjectSpace.each_object(Object) do |o|
c[o.class] += 1
end
pp c
end

Cheers

robert

Rob Doug · Oct 22, 2009

You could print out object statistics to get an idea about the source of

the leak. Something along the lines

require 'pp'

def print_class_counts
c = Hash.new 0
ObjectSpace.each_object(Object) do |o|
c[o.class] += 1
end
pp c
end

Cheers

robert

I retest my script again on 10k+ database and seems it stops to grow
when reach 550mb size... I'm running 50 threads, so 10mb per thread. As
I understand this is still not good ?

Robert, I did what you propose and here is some data. I print all
objects that have more than 200 copies.

Here object after 100 URLs:
"Class: 1039"
"Thread: 395"
"Gem::Version:

art: 343"
"Regexp: 1463"
"MatchData: 301"
"Hpricot::Text: 31848"
"Gem::Version: 424"
"Array: 24977"
"Nokogiri::XML::Element: 1636"
"WWW::Mechanize::Chain: 203"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 627"
"String: 199615"
"Hpricot::Elem: 21211"
"Nokogiri::XML::SyntaxError: 1562"
"Hash: 15884"
"Proc: 949"
"WWW::Mechanize::Form::Option: 458"

Here after 400 URLs:
"Class: 890"
"Gem::Version:

art: 343"
"Regexp: 1089"
"MatchData: 264"
"Hpricot::BogusETag: 1573"
"Hpricot::Text: 82593"
"Gem::Version: 424"
"Array: 53259"
"Nokogiri::XML::Element: 2982"
"WWW::Mechanize::Form::Button: 259"
"WWW::Mechanize::Chain: 242"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 1179"
"String: 426729"
"WWW::Mechanize::Form::RadioButton: 351"
"Nokogiri::XML::SyntaxError: 6464"
"Hash: 37025"
"Hpricot::Elem: 55426"
"Proc: 774"
"WWW::Mechanize::Form: 338"
"WWW::Mechanize::Form::Option: 601"

And here after 500 URLs:
"Class: 991"
"Thread: 354"
"Gem::Version:

art: 343"
"Regexp: 847"
"MatchData: 409"
"Hpricot::BogusETag: 915"
"Hpricot::Text: 31707"
"Gem::Version: 424"
"Array: 24143"
"Nokogiri::XML::Element: 2076"
"WWW::Mechanize::Form::Button: 217"
"WWW::Mechanize::Chain: 218"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 874"
"String: 197836"
"Nokogiri::XML::SyntaxError: 2608"
"Hash: 14719"
"Hpricot::Elem: 21530"
"Proc: 925"
"WWW::Mechanize::Form: 266"
"WWW::Mechanize::Form::Option: 416"

As you can see there is no stable increasing amount of some object.

What can I do more?

John W Higgins · Oct 22, 2009

[Note: parts of this message were removed to make it a legal post.]

Evening Rob,

May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant message
queue and just fire up workers that grab messages and download away.

I know it's not the sexy option but what exactly are you gaining by using
threads here? Much easier to get a nice single threaded worker tuned up and
tight then what you appear to be going through here. In fact - I might argue
that you would be better with a downloader and separater parser processes
that worked independantly via messages.

Just a thought....

John

Robert Klemme · Oct 22, 2009

You could print out object statistics to get an idea about the source of
the leak. Something along the lines

require 'pp'

def print_class_counts
c = Hash.new 0
ObjectSpace.each_object(Object) do |o|
c[o.class] += 1
end
pp c
end

Cheers

robert

Click to expand...

I retest my script again on 10k+ database and seems it stops to grow
when reach 550mb size... I'm running 50 threads, so 10mb per thread. As
I understand this is still not good ?

Robert, I did what you propose and here is some data. I print all
objects that have more than 200 copies.

Here object after 100 URLs:
"Class: 1039"
"Thread: 395"
"Gem::Version:art: 343"
"Regexp: 1463"
"MatchData: 301"
"Hpricot::Text: 31848"
"Gem::Version: 424"
"Array: 24977"
"Nokogiri::XML::Element: 1636"
"WWW::Mechanize::Chain: 203"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 627"
"String: 199615"
"Hpricot::Elem: 21211"
"Nokogiri::XML::SyntaxError: 1562"
"Hash: 15884"
"Proc: 949"
"WWW::Mechanize::Form::Option: 458"

Here after 400 URLs:
"Class: 890"
"Gem::Version:art: 343"
"Regexp: 1089"
"MatchData: 264"
"Hpricot::BogusETag: 1573"
"Hpricot::Text: 82593"
"Gem::Version: 424"
"Array: 53259"
"Nokogiri::XML::Element: 2982"
"WWW::Mechanize::Form::Button: 259"
"WWW::Mechanize::Chain: 242"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 1179"
"String: 426729"
"WWW::Mechanize::Form::RadioButton: 351"
"Nokogiri::XML::SyntaxError: 6464"
"Hash: 37025"
"Hpricot::Elem: 55426"
"Proc: 774"
"WWW::Mechanize::Form: 338"
"WWW::Mechanize::Form::Option: 601"

And here after 500 URLs:
"Class: 991"
"Thread: 354"
"Gem::Version:art: 343"
"Regexp: 847"
"MatchData: 409"
"Hpricot::BogusETag: 915"
"Hpricot::Text: 31707"
"Gem::Version: 424"
"Array: 24143"
"Nokogiri::XML::Element: 2076"
"WWW::Mechanize::Form::Button: 217"
"WWW::Mechanize::Chain: 218"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 874"
"String: 197836"
"Nokogiri::XML::SyntaxError: 2608"
"Hash: 14719"
"Hpricot::Elem: 21530"
"Proc: 925"
"WWW::Mechanize::Form: 266"
"WWW::Mechanize::Form::Option: 416"

As you can see there is no stable increasing amount of some object.

What can I do more?

Your number of threads is suspiciously high compared to the 50 threads
that you intend to run. That may be an indication that your code still
suffers the issue that you stuff Thread instances into the Array and
never remove them.

Other than that I cannot see something obvious on first sight. Maybe it
makes more sense to extend the memory dumping to dump only positive
deltas. An alternative is to decrease the dumping interval (e.g. just
after fetching just one URL).

If you want to throw more on this you could use set_trace_func to
capture invocations of methods named "new" along with the stack trace so
you get the allocation sites.

Btw, what version of Ruby is this? IIRC there was a bug with
Array#shift up to 1.8.6 which could also cause these effects.

Kind regards

robert

Robert Klemme · Oct 22, 2009

May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant message
queue and just fire up workers that grab messages and download away.

I know it's not the sexy option but what exactly are you gaining by using
threads here? Much easier to get a nice single threaded worker tuned up and
tight then what you appear to be going through here. In fact - I might argue
that you would be better with a downloader and separater parser processes
that worked independantly via messages.

With DRb you can even work with those multiple processes like they were
in a single process.

Cheers

robert

Rob Doug · Oct 22, 2009

John said:
Evening Rob,

May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant
message
queue and just fire up workers that grab messages and download away.
I know it's not the sexy option but what exactly are you gaining by
using
threads here? Much easier to get a nice single threaded worker tuned up
and
tight then what you appear to be going through here.

Your solution is good and simple, but there is one problem... I'll have
1-2 mil links in my database to crawl. With only one process it will
take months.
Beside URLs in my list sometimes are very slow, and response time may
take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous
sockets, but this is too complicated for me right now, maybe in the next
versions.

In fact - I might
argue
that you would be better with a downloader and separater parser
processes
that worked independantly via messages.

I using mechanize, as I know it parse page right after download with
Nokogiri, so separating threads would be good when I start to use
net/http or sockets. Unfortunately right now I should use mechanize.

Rob Doug · Oct 22, 2009

Your number of threads is suspiciously high compared to the 50 threads

that you intend to run. That may be an indication that your code still
suffers the issue that you stuff Thread instances into the Array and
never remove them.

Yeah, number of threads confusing me too, but I did as you propose:

THREADS = 50
q = SizedQueue.new(THREADS * 2)
threads = []
threads = (1..THREADS).map do
Thread.new q do |qq|
until qq.equal?(myLink = qq.deq)
mutex.synchronize do
puts $n +=1 # show number of link
if ($n % 100) == 0 then
print_class_counts # show classes information
end
end
Common2.post_it(myLink)
end
end
end

So I just don't understand how amount of threads can be more than 50

Other than that I cannot see something obvious on first sight. Maybe it
makes more sense to extend the memory dumping to dump only positive
deltas. An alternative is to decrease the dumping interval (e.g. just
after fetching just one URL).

If you want to throw more on this you could use set_trace_func to
capture invocations of methods named "new" along with the stack trace so
you get the allocation sites.

Btw, what version of Ruby is this? IIRC there was a bug with
Array#shift up to 1.8.6 which could also cause these effects.

ruby 1.8.6 (2008-08-11 patchlevel 287) - it's strange because I've
updated it week ago

Bill Kelly · Oct 22, 2009

From: "Rob Doug said:
Your solution is good and simple, but there is one problem... I'll have
1-2 mil links in my database to crawl. With only one process it will
take months.
Beside URLs in my list sometimes are very slow, and response time may
take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous
sockets, but this is too complicated for me right now, maybe in the next
versions.

Might want to look at EventMachine: http://rubyeventmachine.com/

example (from lib/em/protocols/httpclient2.rb)

# === Usage
#
# EM.run{
# conn = EM:

rotocols::HttpClient2.connect 'google.com', 80
#
# req = conn.get('/')
# req.callback{ |response|
# p(response.status)
# p(response.headers)
# p(response.content)
# }
# }

You could create multiple such concurrent connections. And if your OS
supports it, EventMachine will make use of kqueue or epoll behind the
scenes for efficient handling of large numbers of I/O handles.

This is a simplistic example, but should work in principle:

sites = %w(aaa.com bbb.com ccc.com ddd.com eee.com fff.com) # etc...
conns = sites.map {|dn| EM:

rotocols::HttpClient2.connect(dn, 80)}
conns.each do |conn|
req = conn.get('/')
req.callback do |response|
p(response.status)
p(response.headers)
p(response.content)
end
end

This all takes place in a single thread.

Regards,

Bill

brabuhr · Oct 22, 2009

Your solution is good and simple, but there is one problem... I'll have
1-2 mil links in my database to crawl. With only one process it will
take months.
Beside URLs in my list sometimes are very slow, and response time may
take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous
sockets, but this is too complicated for me right now, maybe in the next
versions.

I'm sure he didn't mean (only) 1 process but several processes each
running a subset of the input space. It could also have the benefit
of utilizing multiple processing cores on one machine or across
multiple machines:

http://raa.ruby-lang.org/project/rq/
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/298739
http://pragmatig.wordpress.com/2009/08/13/ruby-parallel-processing-made-simple/

Rob Doug · Oct 22, 2009

Btw, what version of Ruby is this? IIRC there was a bug with

ruby 1.8.6 (2008-08-11 patchlevel 287) - it's strange because I've
updated it week ago

Right now testing on ruby 1.8.7 ... issue still exits

Threads, Queues and possible memory leak	3	Jun 26, 2008
Memory leak with threads	16	Sep 16, 2010
trouble with threads	5	Feb 21, 2008
how come this code doesnt work as designed?	0	Feb 19, 2008
Net::Imap thread safety	0	Oct 17, 2008
Memory Question	6	Feb 26, 2006
sockets and threads	2	Mar 31, 2006
runaway memory leak with LWP and Fork()ing on Windows	4	Nov 2, 2007

memory leak

Rob Doug

Aaron Patterson

Rob Doug

Aaron Patterson

Ryan Davis

Rob Doug

Rob Doug

Robert Klemme

Ryan Davis

Rob Doug

Robert Klemme

Rob Doug

John W Higgins

Robert Klemme

Robert Klemme

Rob Doug

Rob Doug

Bill Kelly

brabuhr

Rob Doug

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads