Hash optimization question

Roger Pack · Sep 3, 2008

Are there any libraries that overcome this problem:

(slow example)
=> #<... @real=0.113940954208374, @utime=0.0999999999999999,
@cstime=0.0>

(fast example)
=> #<...@real=0.0591599941253662, @utime=0.0600000000000001,
@cstime=0.0>

I.e. they optimize hashes such that their .each_xxx functions work as
fast as arrays?
My gut instinct is that this is a problem that could be overcome with
some hacking in the core, but thought I'd ask.

-=R

Robert Klemme · Sep 3, 2008

2008/9/3 Roger Pack said:
Are there any libraries that overcome this problem:

(slow example)

=> #<... @real=0.113940954208374, @utime=0.0999999999999999,
@cstime=0.0>

(fast example)

=> #<...@real=0.0591599941253662, @utime=0.0600000000000001,
@cstime=0.0>

I.e. they optimize hashes such that their .each_xxx functions work as
fast as arrays?
My gut instinct is that this is a problem that could be overcome with
some hacking in the core, but thought I'd ask.

No idea, you would have to look at the sources.

A few thoughts nevertheless: if you have to do the array conversion
prior to iterating then numbers suffer dramatically:

irb(main):003:0> Benchmark.measure { a = {:abc => :def}.to_a;
1_000_000.times { a.each{}}}
=> #<Benchmark::Tms:0x7fed1744 @label="", @real=0.641000032424927,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.641, @total=0.641>

irb(main):004:0> Benchmark.measure { a = {:abc => :def};
1_000_000.times { a.each{}}}
=> #<Benchmark::Tms:0x10013be0 @label="", @real=2.58999991416931,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=2.578, @total=2.578>

irb(main):005:0> Benchmark.measure { a = {:abc => :def};
1_000_000.times { a.to_a.each{}}}
=> #<Benchmark::Tms:0x7fee2184 @label="", @real=4.41000008583069,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=4.422, @total=4.422>

In other words: if speeding up relies on building an Array like
structure iterating will be dramatically slower - unless you want to
waste the added space of an Array representation and add the logic to
keep track of changes.

I haven't looked into Ruby's Hash iterating in detail but it might be
slower because of a general properties of hash tables: they are
usually bigger (i.e. have more buckets) than the number of elements in
the hash to allow for efficient hashing. Now if there are no links
between hash entries (whose maintenance I believe would increase
insertion overhead) iteration needs to access each hash bucket even
those with no data in. That might account for the slower runtimes.

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Kind regards

robert

Robert Dober · Sep 3, 2008

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Long time not having been on the same thread Robert

Well indeed I would suggest to anyone having problems with performance
of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.
Cheers
Robert
--=20
C'est v=E9ritablement utile puisque c'est joli.

Antoine de Saint Exup=E9ry

Robert Klemme · Sep 3, 2008

2008/9/3 Robert Dober said:
Long time not having been on the same thread Robert

Well indeed I would suggest to anyone having problems with performance
of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.

Binary search might be an additional option for small to medium sized
collections if retrieval speed is not paramount but iteration speed is
important. (Btw, do we have binary search in the std lib by now?
This should probably go into Array...)

Kind regards

robert

Roger Pack · Sep 4, 2008

I guess it will be difficult to get both: fast hash lookup and

insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Yeah I guess it's hard to have the best of both worlds. Hmm. I suppose
that for small hashes which call .each a fix would definitely help, for
large hashes that call .each "a lot" it would probably also be quicker,
and for any hashes which don't call .each it wouldn't be helpful. I
wonder if the slowdown is small enough to make it worth it anyway
[though it would use more RAM too--then again, Ruby isn't known for its
low memory usage]

In the example I was thinking of, I was using a hash to parse lines in a
file

a la

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}

string.each_line {|l| identifiers.each{|reg, description| if l =~ reg
then; do something; end }

So...my particular example I'm only using a hash because of the clean
syntactic look

[not for the functionality at all].

Rails uses it quite a bit, too, hence my enquiring about it.

I think that for now I'll just write a Hash monkey patcher a la

a = {}
a.fast_each_ify
a[:abc] = :def # elements from here on are now wrapped internally to
allow for a quicker .each

Thanks!
-=R

Robert Klemme · Sep 4, 2008

2008/9/4 Roger Pack said:
I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Click to expand...

Yeah I guess it's hard to have the best of both worlds. Hmm. I suppose
that for small hashes which call .each a fix would definitely help, for
large hashes that call .each "a lot" it would probably also be quicker,
and for any hashes which don't call .each it wouldn't be helpful. I
wonder if the slowdown is small enough to make it worth it anyway
[though it would use more RAM too--then again, Ruby isn't known for its
low memory usage]

The critical issue is not the size of the Has (at least not only) but
rather the frequency of change. Personally, since a Hash is primary
for efficient lookups I would not want to invest in any iteration
speedups if there was a chance of lookup and insertion slowdown. If
you are iterating through a Hash all the time then you are probably
not using the proper data structure.

In the example I was thinking of, I was using a hash to parse lines in a
file

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}

string.each_line {|l| identifiers.each{|reg, description| if l =~ reg
then; do something; end }

So...my particular example I'm only using a hash because of the clean
syntactic look [not for the functionality at all].
Ah!

I think that for now I'll just write a Hash monkey patcher a la

Why do that? Your identifiers is constant anyway - as far as I can
see. Why not just do

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}.to_a

Cheers

robert

Robert Klemme · Sep 5, 2008

2008/9/4 Roger Pack said:
Yeah--works like a charm.
Great!

This makes me wish that there were a data structure that has
object-based O(1) lookup/insertion/deletion, and O(n) .each calls.
Apparently Hash doesn't

Not so fast! Keep in mind that these figures are *asymptotic* there
are always constants involved and it might well be that for the N's we
deal with results look very different. That's why we do all the
benchmarking when we want to find out which approach is best.

My theory is that some people use hashes just for the convenience of
object-based lookup [ex: scope[:include] in rails] and not for the real
strength of hashes, which, as you mentioned, is "efficient
lookup/insertion/deletion"

I am not sure what you mean by "object-based lookup", especially how
it is not "efficient lookup/insertion/deletion" or differs from that.
I mean, people do use these lookups *because* they are efficient.

Thanks!

You're welcome.

Cheers

robert

Roger Pack · Sep 24, 2008

Well indeed I would suggest to anyone having problems with performance

of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.

So our goal is to see if Array#assoc is "about as fast as hash based
lookup"

a = {1 => 2, 2 => 3 }
b = [[1, 2], [2, 3]]
Benchmark.measure { 1000000.times { a[2] }}

Click to expand...

=> #<Benchmark::Tms:0x5d93dc @total=0.26, @cutime=0.0, @label="",
@stime=0.0, @real=0.270273208618164, @utime=0.26, @cstime=0.0>

=> #<Benchmark::Tms:0x5b2e1c @total=0.26, @cutime=0.0, @label="",
@stime=0.0, @real=0.264580965042114, @utime=0.26, @cstime=0.0>

faster than a hash

=> #<Benchmark::Tms:0x5e4408 @total=0.35, @cutime=0.0, @label="",
@stime=0.01, @real=0.35266900062561, @utime=0.34, @cstime=0.0>

Quite a bit slower.

So with only 2 entries #assoc is slower.

Oh well, we do what we can.
-=R

Brian Candler · Sep 25, 2008

I haven't looked into Ruby's Hash iterating in detail but it might be

slower because of a general properties of hash tables: they are
usually bigger (i.e. have more buckets) than the number of elements in
the hash to allow for efficient hashing. Now if there are no links
between hash entries (whose maintenance I believe would increase
insertion overhead) iteration needs to access each hash bucket even
those with no data in. That might account for the slower runtimes.

Aside: ruby-1.9 keeps links between hash entries, exactly for the
purpose of faster iteration. It also has the potentially useful
side-effect that hashes iterate in the same order that the entries were
inserted.

(However, depending on this feature is a good way of writing code which
won't run properly in 1.8)

Roger Pack · Sep 26, 2008

Aside: ruby-1.9 keeps links between hash entries, exactly for the

purpose of faster iteration. It also has the potentially useful
side-effect that hashes iterate in the same order that the entries were
inserted.

Interesting.
It appears that whatever the internal linking mechanism, it still
differs from Array#each speed-wise:

1.8.6 Hash:
=> #<Benchmark::Tms:0x26c2f8 @total=0.17, @cutime=0.0, @label="",
@stime=0.0, @real=0.173249959945679, @utime=0.17, @cstime=0.0>

1.8.6 Array:

Benchmark.measure { a = [[:a => :b], [:c => :d]]; 100_000.times { a.each {} }}

Click to expand...

=> #<Benchmark::Tms:0x5de7c4 @total=0.07, @cutime=0.0, @label="",
@stime=0.0, @real=0.0752890110015869, @utime=0.07, @cstime=0.0>

so 0.17 to 0.7 s

and on 1.9 Hash:=> #<Benchmark::Tms:0x130efc @label="", @real=0.142009973526001,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.14, @total=0.14>

1.9 Array:

Benchmark.measure { a = [[:a, :b], [:c, :d]]; 100_000.times { a.each {} }}

Click to expand...

=> #<Benchmark::Tms:0x13379c @label="", @real=0.0471560955047607,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.05, @total=0.05>

so 0.14 to 0.05 s

The numbers seem very similar, comparison wise, to each other.

-=R

Hash of Hash of Arrays Question	5	Jul 18, 2009
Optimization help - reading out of /proc on Solaris	4	Sep 16, 2008
Hash array with variable size?	5	Feb 28, 2011
Making Hash a first-class citizen	4	Aug 9, 2009
Windows/Linux optimization problem	4	Feb 10, 2006
Need assistance with hash of arrays and injecting values	6	Sep 25, 2010
Perl style hash	11	Sep 9, 2009
array and hash combine methods	7	Dec 14, 2006

Hash optimization question

Roger Pack

Robert Klemme

Robert Dober

Robert Klemme

Roger Pack

Robert Klemme

Robert Klemme

Roger Pack

Brian Candler

Roger Pack

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads