Beyond YAML? (scaling)

Bil Kleb · May 3, 2007

Hi,

I've been using YAML files to store hashes of numbers, e.g.,

{ ["O_Kc_01"] => [ 0.01232, 0.01212, 0.03222, ... ], ... }

This has worked wonderfully for portability and visibility
into the system as I've been creating it.

Recently, however, I've increased my problem size by orders
of magnitude in both the number of variables and the number
of associated values. The resulting YAML files are prohibitive:
10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

Thanks,

James Edward Gray II · May 3, 2007

Hi,

I've been using YAML files to store hashes of numbers, e.g.,

{ ["O_Kc_01"] => [ 0.01232, 0.01212, 0.03222, ... ], ... }

This has worked wonderfully for portability and visibility
into the system as I've been creating it.

Recently, however, I've increased my problem size by orders
of magnitude in both the number of variables and the number
of associated values. The resulting YAML files are prohibitive:
10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

Some random thoughts:

* If they are just super straightforward lists of numbers like this a
trivial flat file scheme, say with one number per line, might get the
job done.
* XML can be pretty darn easy to output manually and if you use
REXML's stream parser (not slurping everything into a DOM) you should
be able to read it reasonably quick.
* If you are willing to sacrifice a little visibility, you can always
take the step up to a real database, even if it's just sqlite. These
have varying degrees of portability as well.
* You might want to look at KirbyBase. (It has a younger brother
Mongoose, but that uses binary output.)

Hope something in there helps.

James Edward Gray II

Brian Candler · May 3, 2007

I've been using YAML files to store hashes of numbers, e.g.,

{ ["O_Kc_01"] => [ 0.01232, 0.01212, 0.03222, ... ], ... }

This has worked wonderfully for portability and visibility
into the system as I've been creating it.

Recently, however, I've increased my problem size by orders
of magnitude in both the number of variables and the number
of associated values. The resulting YAML files are prohibitive:
10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

Use a SQL database?

It all depends what sort of processing you're doing. If you're adding to a
dataset (rather than starting with an entirely fresh data set each time),
having a database makes sense. If you're doing searches across the data,
and/or if the data is larger than the available amount of RAM, then a
database makes sense. If you're only touching small subsets of the data at
any one time, then a database makes sense.

Put it another way, does your processing really require you to read the
entire collection of objects into RAM before you can perform any processing?

If it does, and your serialisation needs are as simple as it appears above,
then maybe something like CSV would be better.

O_Kc_01,0.01232,0.01212,0.03222,...

If the source of the data is another Ruby program, then Marshal will be much
faster than YAML (but unfortunately binary).

You could consider using something like Madeleine:
http://madeleine.rubyforge.org/
This snapshots your object tree to disk (using Marshal by default I think,
but can also use YAML). You can then make incremental changes and
occasionally rewrite the snapshot.

B.

Jamey Cribbs · May 3, 2007

Bil said:
Hi,

I've been using YAML files to store hashes of numbers, e.g.,

{ ["O_Kc_01"] => [ 0.01232, 0.01212, 0.03222, ... ], ... }

This has worked wonderfully for portability and visibility
into the system as I've been creating it.

Recently, however, I've increased my problem size by orders
of magnitude in both the number of variables and the number
of associated values. The resulting YAML files are prohibitive:
10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

Hey, Bil. If you don't mind a couple of shameless plugs, you might want
to try KirbyBase or Mongoose.

KirbyBase should be faster than YAML and it still stores the data in
plain text files, if that is important to you.

Mongoose is faster than KirbyBase, at the expense of the data not being
stored as plain text.

I don't know if either will be fast enough for you.

HTH,

Jamey Cribbs

khaines · May 3, 2007

I've been using YAML files to store hashes of numbers, e.g.,

{ ["O_Kc_01"] => [ 0.01232, 0.01212, 0.03222, ... ], ... }

of magnitude in both the number of variables and the number
of associated values. The resulting YAML files are prohibitive:
10s of MBs big and requiring 10s of minutes to dump/load.

Where should I go from here?

I guess that depends on whether you need the files to be easily readable
or not. If you don't, Marshal will be faster than YAML.

Kirk Haines

Bil Kleb · May 3, 2007

I guess that depends on whether you need the files to be easily readable
or not. If you don't, Marshal will be faster than YAML.

At this point, I'm looking for an easy out that will
reduce size and increase speed, and I'm willing to
go binary if necessary.

Of the answers I've seen so far (thanks everyone!),
migrating to Marshal seems to be the Simplest Thing
That Could Possibly Work.

Thanks,

Bil Kleb · May 3, 2007

Bil said:
At this point, I'm looking for an easy out that will
reduce size and increase speed, and I'm willing to
go binary if necessary.

Of the answers I've seen so far (thanks everyone!),
migrating to Marshal seems to be the Simplest Thing
That Could Possibly Work.

Well, maybe not so simple...

`dump': can't dump hash with default proc (TypeError)

which seems to be due to the trick I learned from zenspider
and drbrain to quickly setup a hash of arrays:,

Hash.new{ |hash,key| hash[key]=[] }

Later,

Bil Kleb · May 3, 2007

Jamey said:
Hey, Bil.
Hi.

KirbyBase should be faster than YAML and it still stores the data in
plain text files, if that is important to you.

Plain text will be too big -- I've got an n^2 problem.

Mongoose is faster than KirbyBase, at the expense of the data not being
stored as plain text.

Sounds intriguing, but where can I find some docs? So far, I'm
coming up empty...

Regards,

Jamey Cribbs · May 3, 2007

Bil said:
Plain text will be too big -- I've got an n^2 problem.

Sounds intriguing, but where can I find some docs? So far, I'm
coming up empty...

Docs are light compared to KirbyBase. If you download the
distribution, there is the README file, some pretty good examples in the
aptly named "examples" directory, and unit tests in the "tests" directory.

HTH,

Jamey

Bil Kleb · May 3, 2007

Brian said:
Use a SQL database?

I always suspect that I should be doing that more often,
but as my experience with databases is rather limited
and infrequent, I always shy away from those as James
already knows. Regardless, I should probably overcome
my aggressive incompetence one day!

It all depends what sort of processing you're doing. If you're adding to a
dataset (rather than starting with an entirely fresh data set each time),
having a database makes sense.

In this point, I'm generating an entirely fresh data set
each time, but I can foresee a point where that will change
to an incremental model...

Put it another way, does your processing really require you to read the
entire collection of objects into RAM before you can perform any processing?

Yes, AFAIK, but I suppose there are algorithms that could
compute statistical correlations incrementally.

You could consider using something like Madeleine:
http://madeleine.rubyforge.org/
This snapshots your object tree to disk (using Marshal by default I think,
but can also use YAML). You can then make incremental changes and
occasionally rewrite the snapshot.

Probably not a good fit as I won't change existing data,
only add new...

Thanks,

Bil Kleb · May 3, 2007

Jamey said:
Docs are light compared to KirbyBase. If you download the
distribution, there is the README file, some pretty good examples in the
aptly named "examples" directory, and unit tests in the "tests" directory.

Roger, I was afraid you'd say that.

Please throw those up on your Rubyforge webpage at some point?

Later,

Bil Kleb · May 3, 2007

Bil said:
Hash.new{ |hash,key| hash[key]=[] }

Is there a better way than,

samples[tag] = [] unless samples.has_key? tag
samples[tag] << sample

?

Anyway, apart from Marshal not having a convenient
#load_file method like Yaml, the conversion was
very painless and dropped file sizes considerably
and run times into the minutes category instead of
hours.

Thanks,

Brian Candler · May 3, 2007

Bil said:
Bil said:

Hash.new{ |hash,key| hash[key]=[] }

Click to expand...

Is there a better way than,

samples[tag] = [] unless samples.has_key? tag
samples[tag] << sample

Not exactly identical but usually good enough:

samples[tag] ||= []
samples[tag] << sample

And you can probably combine:

(samples[tag] ||= []) << sample

John Joyce · May 3, 2007

I always suspect that I should be doing that more often,
but as my experience with databases is rather limited
and infrequent, I always shy away from those as James
already knows. Regardless, I should probably overcome
my aggressive incompetence one day!

Don't be afraid of the database solution. In the long term, it is
much more scalable and will pay dividends immediately.
MySQL and PostgreSQL are both pretty fast and scalable, but if you
have a large data set, you certainly do need to plan a schema
carefully, but it should be somewhat similar to your existing data
structures anyway.
the database APIs in Ruby are pretty simple.

Joel VanderWerf · May 3, 2007

Bil said:
At this point, I'm looking for an easy out that will
reduce size and increase speed, and I'm willing to
go binary if necessary.

What about mmap and pack/unpack, as ara does in

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/175944

?

Bil Kleb · May 3, 2007

Brian said:
samples[tag] ||= []
samples[tag] << sample

And you can probably combine:

(samples[tag] ||= []) << sample

Thanks; I figured the magic || operator was somehow
involved, but I still don't have that thing memorized...
yet more aggressive incompetence.

Thanks again,

William James · May 4, 2007

Bil said:
Bil said:

(e-mail address removed) wrote:

Click to expand...

At this point, I'm looking for an easy out that will
reduce size and increase speed, and I'm willing to
go binary if necessary.

Click to expand...

Of the answers I've seen so far (thanks everyone!),
migrating to Marshal seems to be the Simplest Thing
That Could Possibly Work.

Click to expand...

Well, maybe not so simple...

`dump': can't dump hash with default proc (TypeError)

which seems to be due to the trick I learned from zenspider
and drbrain to quickly setup a hash of arrays:,

Hash.new{ |hash,key| hash[key]=[] }

Later,

Would making a copy of the hash use too much
memory or time?

h=Hash.new{ |hash,key| hash[key]=[] }

h['foo'] << 44
h['foo'] << 88

h_copy = {}
h.each{|k,v| h_copy[k] = v}
p h_copy

Bil Kleb · May 4, 2007

William said:
Would making a copy of the hash use too much
memory or time?

I don't know, but that's surely another way out of
the Marshal-hash-proc trap...

Thanks,

Jeremy Hinegardner · May 5, 2007

At this point, I'm looking for an easy out that will
reduce size and increase speed, and I'm willing to
go binary if necessary.

If you want to describe your data needs a bit, and what operations you
need to operate on it, I'll be happy to play around with an ruby/sqlite3
program and see what pops out.

Since there's no Ruby Quiz this weekend, we all need something to work
on

.

enjoy,

-jeremy

Bil Kleb · May 5, 2007

Hi,

Jeremy said:
If you want to describe your data needs a bit, and what operations you
need to operate on it, I'll be happy to play around with an ruby/sqlite3
program and see what pops out.

I've created a small tolerance DSL, and coupled with the Monte Carlo
Method[1] and the Pearson Correlation Coefficient[2], I'm performing
sensitivity analysis[2] on some of the simulation codes used for our
Orion vehicle[3]. In other words, jiggle the inputs, and see how
sensitive the outputs are and which inputs are the most influential.

The current system[5] works, and after the YAML->Marshal migration,
it scales well enough for now. The trouble is the entire architecture
is wrong if I want to monitor the Monte Carlos statistics to see
if I can stop sampling, i.e., the statistics are converged.

The current system consists of the following steps:

1) Prepare a "sufficiently large" number of cases, each with random
variations of the input parameters per the tolerance DSL markup.
Save all these input variables and all their samples for step 5.
2) Run all the cases.
3) Collect all the samples of all the outputs of interest.
4) Compute running history of the output statistics to see
if they have have converged, i.e., the "sufficiently large"
guess was correct -- typically a wasteful number of around 3,000.
If not, start at step 1 again with a bigger number of cases.
5) Compute normalized Pearson correlation coefficients for the
outputs and see which inputs they are most sensitive to by
using the data collected in steps 1 and 3.
6) Lobby for experiments to nail down these "tall pole" uncertainties.

This system is plagued by the question of "sufficiently large"?
The next generation system would do steps 1 through 3 in small
batches, and at the end of each batch, check for the statistical
convergence of step 4. If convergence has been reached, shutdown
the Monte Carlo process, declare victory, and proceed with steps
5 and 6.

I'm thinking this more incremental approach, and my lack of database
experience would make a perfect match for Mongoose[6]...

Since there's no Ruby Quiz this weekend, we all need something to work
on .

Regards,
--
Bil Kleb
http://fun3d.larc.nasa.gov

[1] http://en.wikipedia.org/wiki/Monte_Carlo_method
[2] http://en.wikipedia.org/wiki/Pearson_correlation
[3] http://en.wikipedia.org/wiki/Sensitivity_analysis
[4] http://en.wikipedia.org/wiki/Crew_Exploration_Vehicle
[5] The current system consists of 5 Ruby codes at ~40 lines each
plus some equally tiny library routines.
[6] http://mongoose.rubyforge.org/

What is YAML::Syck::Map?	1	Aug 17, 2011
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	4	Jun 4, 2023
1.8.1: Debugging threads/Yaml ...	1	Apr 6, 2004
[ANN] JRuby 1.1RC2 Released	1	Feb 16, 2008
[ANN] JRuby 1.4.0 Released	2	Nov 2, 2009
Sorting hash of hashes	3	Nov 21, 2011
[ANN] JRuby 1.4.0RC1 Released	0	Oct 3, 2009
[ANN] JRuby 1.4.0RC2 Released	0	Oct 21, 2009

Beyond YAML? (scaling)

Bil Kleb

James Edward Gray II

Brian Candler

Jamey Cribbs

khaines

Bil Kleb

Bil Kleb

Bil Kleb

Jamey Cribbs

Bil Kleb

Bil Kleb

Bil Kleb

Brian Candler

John Joyce

Joel VanderWerf

Bil Kleb

William James

Bil Kleb

Jeremy Hinegardner

Bil Kleb

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads