Hash optimization question

R

Roger Pack

Are there any libraries that overcome this problem:

(slow example)
=> #<... @real=0.113940954208374, @utime=0.0999999999999999,
@cstime=0.0>

(fast example)
=> #<...@real=0.0591599941253662, @utime=0.0600000000000001,
@cstime=0.0>


I.e. they optimize hashes such that their .each_xxx functions work as
fast as arrays?
My gut instinct is that this is a problem that could be overcome with
some hacking in the core, but thought I'd ask.

-=R
 
R

Robert Klemme

2008/9/3 Roger Pack said:
Are there any libraries that overcome this problem:

(slow example)

=> #<... @real=0.113940954208374, @utime=0.0999999999999999,
@cstime=0.0>

(fast example)

=> #<...@real=0.0591599941253662, @utime=0.0600000000000001,
@cstime=0.0>


I.e. they optimize hashes such that their .each_xxx functions work as
fast as arrays?
My gut instinct is that this is a problem that could be overcome with
some hacking in the core, but thought I'd ask.

No idea, you would have to look at the sources.

A few thoughts nevertheless: if you have to do the array conversion
prior to iterating then numbers suffer dramatically:

irb(main):003:0> Benchmark.measure { a = {:abc => :def}.to_a;
1_000_000.times { a.each{}}}
=> #<Benchmark::Tms:0x7fed1744 @label="", @real=0.641000032424927,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.641, @total=0.641>

irb(main):004:0> Benchmark.measure { a = {:abc => :def};
1_000_000.times { a.each{}}}
=> #<Benchmark::Tms:0x10013be0 @label="", @real=2.58999991416931,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=2.578, @total=2.578>

irb(main):005:0> Benchmark.measure { a = {:abc => :def};
1_000_000.times { a.to_a.each{}}}
=> #<Benchmark::Tms:0x7fee2184 @label="", @real=4.41000008583069,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=4.422, @total=4.422>

In other words: if speeding up relies on building an Array like
structure iterating will be dramatically slower - unless you want to
waste the added space of an Array representation and add the logic to
keep track of changes.

I haven't looked into Ruby's Hash iterating in detail but it might be
slower because of a general properties of hash tables: they are
usually bigger (i.e. have more buckets) than the number of elements in
the hash to allow for efficient hashing. Now if there are no links
between hash entries (whose maintenance I believe would increase
insertion overhead) iteration needs to access each hash bucket even
those with no data in. That might account for the slower runtimes.

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Kind regards

robert
 
R

Robert Dober

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.
Long time not having been on the same thread Robert:)
Well indeed I would suggest to anyone having problems with performance
of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.
Cheers
Robert
--=20
C'est v=E9ritablement utile puisque c'est joli.

Antoine de Saint Exup=E9ry
 
R

Robert Klemme

2008/9/3 Robert Dober said:
Long time not having been on the same thread Robert:)
:)

Well indeed I would suggest to anyone having problems with performance
of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.

Binary search might be an additional option for small to medium sized
collections if retrieval speed is not paramount but iteration speed is
important. (Btw, do we have binary search in the std lib by now?
This should probably go into Array...)

Kind regards

robert
 
R

Roger Pack

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Yeah I guess it's hard to have the best of both worlds. Hmm. I suppose
that for small hashes which call .each a fix would definitely help, for
large hashes that call .each "a lot" it would probably also be quicker,
and for any hashes which don't call .each it wouldn't be helpful. I
wonder if the slowdown is small enough to make it worth it anyway
[though it would use more RAM too--then again, Ruby isn't known for its
low memory usage] :)

In the example I was thinking of, I was using a hash to parse lines in a
file

a la

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}

string.each_line {|l| identifiers.each{|reg, description| if l =~ reg
then; do something; end }

So...my particular example I'm only using a hash because of the clean
syntactic look :) [not for the functionality at all].

Rails uses it quite a bit, too, hence my enquiring about it.

I think that for now I'll just write a Hash monkey patcher a la

a = {}
a.fast_each_ify
a[:abc] = :def # elements from here on are now wrapped internally to
allow for a quicker .each

Thanks!
-=R
 
R

Robert Klemme

2008/9/4 Roger Pack said:
I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Yeah I guess it's hard to have the best of both worlds. Hmm. I suppose
that for small hashes which call .each a fix would definitely help, for
large hashes that call .each "a lot" it would probably also be quicker,
and for any hashes which don't call .each it wouldn't be helpful. I
wonder if the slowdown is small enough to make it worth it anyway
[though it would use more RAM too--then again, Ruby isn't known for its
low memory usage] :)

The critical issue is not the size of the Has (at least not only) but
rather the frequency of change. Personally, since a Hash is primary
for efficient lookups I would not want to invest in any iteration
speedups if there was a chance of lookup and insertion slowdown. If
you are iterating through a Hash all the time then you are probably
not using the proper data structure.
In the example I was thinking of, I was using a hash to parse lines in a
file

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}

string.each_line {|l| identifiers.each{|reg, description| if l =~ reg
then; do something; end }

So...my particular example I'm only using a hash because of the clean
syntactic look :) [not for the functionality at all].
Ah!

I think that for now I'll just write a Hash monkey patcher a la

Why do that? Your identifiers is constant anyway - as far as I can
see. Why not just do

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}.to_a

Cheers

robert
 
R

Robert Klemme

2008/9/4 Roger Pack said:
Yeah--works like a charm.
Great!

This makes me wish that there were a data structure that has
object-based O(1) lookup/insertion/deletion, and O(n) .each calls.
Apparently Hash doesn't :)

Not so fast! Keep in mind that these figures are *asymptotic* there
are always constants involved and it might well be that for the N's we
deal with results look very different. That's why we do all the
benchmarking when we want to find out which approach is best. :)
My theory is that some people use hashes just for the convenience of
object-based lookup [ex: scope[:include] in rails] and not for the real
strength of hashes, which, as you mentioned, is "efficient
lookup/insertion/deletion"

I am not sure what you mean by "object-based lookup", especially how
it is not "efficient lookup/insertion/deletion" or differs from that.
I mean, people do use these lookups *because* they are efficient.

You're welcome.

Cheers

robert
 
R

Roger Pack

Well indeed I would suggest to anyone having problems with performance
of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.

So our goal is to see if Array#assoc is "about as fast as hash based
lookup"

a = {1 => 2, 2 => 3 }
b = [[1, 2], [2, 3]]
Benchmark.measure { 1000000.times { a[2] }}
=> #<Benchmark::Tms:0x5d93dc @total=0.26, @cutime=0.0, @label="",
@stime=0.0, @real=0.270273208618164, @utime=0.26, @cstime=0.0>

=> #<Benchmark::Tms:0x5b2e1c @total=0.26, @cutime=0.0, @label="",
@stime=0.0, @real=0.264580965042114, @utime=0.26, @cstime=0.0>

faster than a hash :)

=> #<Benchmark::Tms:0x5e4408 @total=0.35, @cutime=0.0, @label="",
@stime=0.01, @real=0.35266900062561, @utime=0.34, @cstime=0.0>

Quite a bit slower.

So with only 2 entries #assoc is slower.

Oh well, we do what we can.
-=R
 
B

Brian Candler

I haven't looked into Ruby's Hash iterating in detail but it might be
slower because of a general properties of hash tables: they are
usually bigger (i.e. have more buckets) than the number of elements in
the hash to allow for efficient hashing. Now if there are no links
between hash entries (whose maintenance I believe would increase
insertion overhead) iteration needs to access each hash bucket even
those with no data in. That might account for the slower runtimes.

Aside: ruby-1.9 keeps links between hash entries, exactly for the
purpose of faster iteration. It also has the potentially useful
side-effect that hashes iterate in the same order that the entries were
inserted.

(However, depending on this feature is a good way of writing code which
won't run properly in 1.8)
 
R

Roger Pack

Aside: ruby-1.9 keeps links between hash entries, exactly for the
purpose of faster iteration. It also has the potentially useful
side-effect that hashes iterate in the same order that the entries were
inserted.

Interesting.
It appears that whatever the internal linking mechanism, it still
differs from Array#each speed-wise:

1.8.6 Hash:
=> #<Benchmark::Tms:0x26c2f8 @total=0.17, @cutime=0.0, @label="",
@stime=0.0, @real=0.173249959945679, @utime=0.17, @cstime=0.0>

1.8.6 Array:
Benchmark.measure { a = [[:a => :b], [:c => :d]]; 100_000.times { a.each {} }}
=> #<Benchmark::Tms:0x5de7c4 @total=0.07, @cutime=0.0, @label="",
@stime=0.0, @real=0.0752890110015869, @utime=0.07, @cstime=0.0>

so 0.17 to 0.7 s

and on 1.9 Hash:=> #<Benchmark::Tms:0x130efc @label="", @real=0.142009973526001,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.14, @total=0.14>

1.9 Array:
Benchmark.measure { a = [[:a, :b], [:c, :d]]; 100_000.times { a.each {} }}
=> #<Benchmark::Tms:0x13379c @label="", @real=0.0471560955047607,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.05, @total=0.05>

so 0.14 to 0.05 s

The numbers seem very similar, comparison wise, to each other.

-=R
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,201
Messages
2,571,048
Members
47,647
Latest member
NelleMacy9

Latest Threads

Top