PStore vs. YAML::Store

Ara.T.Howard · May 12, 2004

is there anything that can be done to improve this? i'd like to use yaml for
this project but...

~ > cat test.rb
require 'yaml/store'
require 'pstore'

def time label
a = Time.now
ret = yield
b = Time.now
puts "#{ label } @ #{ b.to_f - a.to_f }"
ret
end

obj = Array.new(65536).map!{{'time'=>Time.now,'rand'=>rand}}
pstore = PStore.new 'pstore.db'
yamlstore = YAML::Store.new 'yaml.db'

time("pstore dump time") do
pstore.transaction{|db| db['obj'] = obj}
end

time("pstore load time") do
o = pstore.transaction{|db| db['obj']}
end

time("yamlstore dump time") do
yamlstore.transaction{|db| db['obj'] = obj}
end

time("yamlstore load time") do
o = yamlstore.transaction{|db| db['obj']}
end

File.unlink 'pstore.db'
File.unlink 'yaml.db'

~ > ruby test.rb
pstore dump time @ 0.988970994949341
pstore load time @ 1.86728405952454
yamlstore dump time @ 42.3903992176056
yamlstore load time @ 47.6173989772797

ouch!

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| URL :: http://www.ngdc.noaa.gov/stp/
| TRY :: for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done
===============================================================================

why the lucky stiff · May 12, 2004

Ara.T.Howard said:
~ > ruby test.rb
pstore dump time @ 0.988970994949341
pstore load time @ 1.86728405952454
yamlstore dump time @ 42.3903992176056
yamlstore load time @ 47.6173989772797

First off, you aren't performing just a straight 'load' in the second of
your paired tests. Everytime a PStore transaction takes place, the file
is loaded at the opening of the transaction and closed at the end.
Perhaps it would be worthwhile to check the hash of an object to see if
it has changed?

So, it looks like the YAML emitter needs to be optimized. The write
operation is what is consuming time. Right now, the YAML emitter
consists largely of Ruby code. I'll comb through it today and see what
I can find.

_why

Ara.T.Howard · May 12, 2004

First off, you aren't performing just a straight 'load' in the second of
your paired tests. Everytime a PStore transaction takes place, the file is
loaded at the opening of the transaction and closed at the end.
Perhaps it would be worthwhile to check the hash of an object to see if it
has changed?

yeah - i know that - i just wanted to test under actual usage scenario, which
would both be in a transation AND prohibit checking hash codes since i
wouldn't alreay have a handle on the object... i just wanted to get a feel
for the performance with a list of about 50000 objects...

So, it looks like the YAML emitter needs to be optimized. The write
operation is what is consuming time. Right now, the YAML emitter consists
largely of Ruby code. I'll comb through it today and see what I can find.

i don't know if i'd say that - it's perfectly acceptable (to me at least) that
a text serialization format is simply not suitable for such tasks... it's
still great for all sorts of other things. however, it __would__ be very
cool if i could use it: the project is a distributed job manager (grid) that
makes use of an on nfs disk priority job queue (pstore or yamlstore and
accessed safely by means of my Lockfile class) and n consumer processes
competing to run jobs from the queue. the advantage of using yaml is that
standard tools (eg. grep) might be used to check for the status of jobs...
also it might make possible the construction of tool sets that utilize each
others stdout/in a bit easier, eg.

rge_select --job_name=foobar | rge_alter --priority=42

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| URL :: http://www.ngdc.noaa.gov/stp/
| TRY :: for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done
===============================================================================

Kirk Haines · May 12, 2004

On Thu, 13 May 2004 01:53:53 +0900, Ara.T.Howard wrote

however, it __would__ be very cool if i could use it: the project is
a distributed job manager (grid) that makes use of an on nfs disk
priority job queue (pstore or yamlstore and accessed safely by means
of my Lockfile class) and n consumer processes competing to run jobs
from the queue. the advantage of using yaml is that standard tools
(eg. grep) might be used to check for the status of jobs... also it
might make possible the construction of tool sets that utilize each
others stdout/in a bit easier, eg.

Ara, I'm working on something similar right now, using a different
architecture.

I'm using a combination of DRb, Rinda/RingServer, and a TupleSpace to
receive a queue of jobs that the consumers processes then take off of the
tuplespace. I'm planning on coupling the tuplespace to a persistent store
so that if something happens to the process that holds the tuplespace, the
queue isn't lost and can be restored when the tuplespace server restarts.

In my tests so far it seems to be a stable and flexible way to queue jobs
since I can make use of objects in the queue to pass information from where
the job is generate to the consumer process. It, however, has the same
problem that your PStore based queue does in that simple tools like grep can
not be used to check the status of the queue. Specialized tools have to be
written to do that.

Kirk Haines

P.S. Is the weather as cold and wet down in Boulder today as it is up here
in Wyoming? Good day to be inside and writing some Ruby code, I think.

why the lucky stiff · May 12, 2004

Ara.T.Howard said:
i don't know if i'd say that - it's perfectly acceptable (to me at least) that
a text serialization format is simply not suitable for such tasks... it's
still great for all sorts of other things.

Yeah. I want Syck to be running in the vicinity of 1/3 of Marshal's
speed and the bytecode to run at neck-and-neck. I stopped working on
YAML's emitter because I ran out of ideas for it. But I have some
improvements in mind now that a few months has passed since it was
originally puzzling me.

For example, all the BaseEmitter code can now be moved into the
extension. And I'll bet a big ugly bug 'll fly right out when I crack
it open.

_why

Ara.T.Howard · May 12, 2004

On Thu, 13 May 2004 01:53:53 +0900, Ara.T.Howard wrote

Ara, I'm working on something similar right now, using a different
architecture.

I'm using a combination of DRb, Rinda/RingServer, and a TupleSpace to
receive a queue of jobs that the consumers processes then take off of the
tuplespace. I'm planning on coupling the tuplespace to a persistent store
so that if something happens to the process that holds the tuplespace, the
queue isn't lost and can be restored when the tuplespace server restarts.

that's too funny - i was just out running and thinking of __exactly__ this
scheme. actually i've been thinking of it for quite a while; the hang up is
that opening ports here is a ***ing nightmare and everytime the sysads upgrade
they forget to re-open it. my goal is to create clustering software that one
can have up and running in under five minutes that depends on nothing except
for having an NFS mounted directory... we do a lot of processing where we pool
10 or 15 nodes together for a few weeks/months and write custom scripts
driving ssh commands to 'cluster' them... this actually works really good:
right now i've got 36 nodes running jobs that take 5-10 days each. i've
looked (extensively) into SGE (sun grid engine) for this purpose for the last
few months, but even the new beta6.0 has limitations that, IMHO, make is
useless for our purposes (this revolves around resource monitoring/requests -
see SGE mailing list for threads i've been having there). in any case, your
plan sounds really neat. one thing to consider is that it introduces a single
point of failure - if the master nodes blows a disk your entire cluster will
grind to a halt... SGE uses something called a shadow master for this purpose
(it's basically hot backup). i think SGE might have a lot of good ideas that
apply to this type of arch if you haven't already looked at it. one of things
i like about my approach is that if one one single node remains running the
cluster works, there is no central point of failure (well, the NFS server but
all the data resides there so we're dead if it's down anyhow). do you know if
anyone has done anything to make a tuplespace highly available - like having
two nodes able to serve the space with some sort of negotiation between the
two in the case of failure? that would be really cool.

In my tests so far it seems to be a stable and flexible way to queue jobs
since I can make use of objects in the queue to pass information from where
the job is generate to the consumer process. It, however, has the same
problem that your PStore based queue does in that simple tools like grep can
not be used to check the status of the queue. Specialized tools have to be
written to do that.

yeah - i'm there now too.

P.S. Is the weather as cold and wet down in Boulder today as it is up here
in Wyoming? Good day to be inside and writing some Ruby code, I think.

yes - my hands are so cold from my run that i can barely type! where are you
in wy?

cheers.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
| URL :: http://www.ngdc.noaa.gov/stp/
| TRY :: for l in ruby perl;do $l -e "print \"\x3a\x2d\x29\x0a\"";done
===============================================================================

Kirk Haines · May 12, 2004

On Thu, 13 May 2004 04:33:53 +0900, Ara.T.Howard wrote

that's too funny - i was just out running and thinking of
__exactly__ this scheme. actually i've been thinking of it for
quite a while; the hang up is that opening ports here is a ***ing
nightmare and everytime the sysads upgrade they forget to re-open
it. my goal is to create clustering software that one can have up
and running in under five minutes that depends on nothing except for
having an NFS mounted directory... we do a lot of processing where

That's one problem that I, thankfully, do not have. Given that environment,
though, I see a great deal of appeal in the NFS based approach.

we pool 10 or 15 nodes together for a few weeks/months and write
custom scripts driving ssh commands to 'cluster' them... this
actually works really good: right now i've got 36 nodes running jobs
that take 5-10 days each. i've looked (extensively) into SGE (sun

That sounds pretty neat. In my case, the application will initially be for
the private use of the client. In that capacity it won't generate enough
load to really need any distributed capabilities, but it will generate, via
a web interface, jobs that will take a few minutes to maybe an hour to run,
so being able to queue them up for processing is still valuable.

The kicker is that the longer term plans for the project call for it to be
commercialized. If that happens there will be a need for distributing the
jobs over multiple nodes in order to keep the queue from building up faster
than it can be worked through. That's really where the value of being able
to bring up additional consumer processes easily will come into play.

at it. one of things i like about my approach is that if one one
single node remains running the cluster works, there is no central
point of failure (well, the NFS server but all the data resides
there so we're dead if it's down anyhow). do you know if anyone has
done anything to make a tuplespace highly available - like having
two nodes able to serve the space with some sort of negotiation
between the two in the case of failure? that would be really cool.

I've thought about that. I haven't delved into it deeply yet, but it seems
like a very doable task. Hopefully this project will evolve to the point
where I can delve into this more. It'd be great if someone has already done
this. In my searches for information on Rinda and TupleSpaces, though, I
didn't see anything.

yes - my hands are so cold from my run that i can barely type!
where are you in wy?

Oh, a couple hours or so north of you. I live outside of the little town of
Chugwater, Wyoming, north of Cheyenne.

Kirk Haines

pstore and nfs	0	Nov 11, 2003
[ANN] lockfile-0.0.0	2	May 7, 2004
[ANN] lockfile-0.2.0	0	May 13, 2004
Accessing a YAML loaded object without the object's definition?	2	Sep 29, 2008
Object#copy [rcr?]	4	May 17, 2004
pthread	2	Apr 3, 2004
[ANN] lockfile-1.4.0	0	Nov 10, 2005
make MISSING=flock.o	0	Nov 7, 2003

PStore vs. YAML::Store

Ara.T.Howard

why the lucky stiff

Ara.T.Howard

Kirk Haines

why the lucky stiff

Ara.T.Howard

Kirk Haines

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads