Beyond YAML? (scaling)

M

Matt Lawrence

I've created a small tolerance DSL, and coupled with the Monte Carlo
Method[1] and the Pearson Correlation Coefficient[2], I'm performing
sensitivity analysis[2] on some of the simulation codes used for our
Orion vehicle[3]. In other words, jiggle the inputs, and see how
sensitive the outputs are and which inputs are the most influential.

You're building an Orion? Please tell me it's not true!

-- Matt
It's not what I know that counts.
It's what I can remember in time to use.
 
J

Jeremy Hinegardner

Okay, so I didn't get to it this weekend, but it is an interesting
project, but can you explain a bit more on the data requirments? I have
some questions inlined.

Hi,

Jeremy said:
If you want to describe your data needs a bit, and what operations you
need to operate on it, I'll be happy to play around with an ruby/sqlite3
program and see what pops out.

I've created a small tolerance DSL, and coupled with the Monte Carlo
Method[1] and the Pearson Correlation Coefficient[2], I'm performing
sensitivity analysis[2] on some of the simulation codes used for our
Orion vehicle[3]. In other words, jiggle the inputs, and see how
sensitive the outputs are and which inputs are the most influential.

The current system[5] works, and after the YAML->Marshal migration,
it scales well enough for now. The trouble is the entire architecture
is wrong if I want to monitor the Monte Carlos statistics to see
if I can stop sampling, i.e., the statistics are converged.

The current system consists of the following steps:

1) Prepare a "sufficiently large" number of cases, each with random
variations of the input parameters per the tolerance DSL markup.
Save all these input variables and all their samples for step 5.

So the DSL generates a 'large' number of cases listing the input
parameters and their associated values. The list of input parameters
static across all cases, or a set of cases?

That is, for a given "experiment" you have the same set of parameters,
just with a "large" numver of different values to put in those
parameters.

2) Run all the cases.
3) Collect all the samples of all the outputs of interest.

I'm also assuming that the output(s) for a given "experiment" would be
consistent in their parameters?
4) Compute running history of the output statistics to see
if they have have converged, i.e., the "sufficiently large"
guess was correct -- typically a wasteful number of around 3,000.
If not, start at step 1 again with a bigger number of cases.

So right now you are doing say for a given experiment f:

inputs : i1,i2,i3,...,in
outputs : o1,o2,o3,...,0m

run f(i1,i2,i3,...,in) -> [o1,...,on] where the values for i1,...,in are
"jiggled" And you have around 3,000 diferent sets of inputs.
5) Compute normalized Pearson correlation coefficients for the
outputs and see which inputs they are most sensitive to by
using the data collected in steps 1 and 3.
6) Lobby for experiments to nail down these "tall pole" uncertainties.

This system is plagued by the question of "sufficiently large"?
The next generation system would do steps 1 through 3 in small
batches, and at the end of each batch, check for the statistical
convergence of step 4. If convergence has been reached, shutdown
the Monte Carlo process, declare victory, and proceed with steps
5 and 6. [...]

[5] The current system consists of 5 Ruby codes at ~40 lines each
plus some equally tiny library routines.

Would you be willing to share these? I'm not sure if what I'm assuming
about your problem is correct or not, but it intrigues me and I'd like
to fiddle with the general problem :). I'd be happy to talk offlist
too.

enjoy,

-jeremy
 
B

Bil Kleb

Jeremy said:
So the DSL generates a 'large' number of cases listing the input
parameters and their associated values. The list of input parameters
static across all cases, or a set of cases?

The input parameter names are static across all cases, but
for each case, the parameter value will vary randomly according
to the tolerance DSL, e.g., 1.05+/-0.2. Currently, I have all
these as a hash of arrays, e.g.,

{ 'F_x' => [ 1.23, 1.12, 0.92, 1.01, ... ],
'q_r' => [ 1.34e+9, 3.89e+8, 8.98e+8, 5.23e+9, ... ], ... }

where 1.23 is the sample for input parameter 'F_x' for the
first case, 1.12 is the sample for the second case, etc.,
and 1.34e+9 is the sample for input parameter 'q_r' for
the first case, and so forth.
That is, for a given "experiment" you have the same set of parameters,
just with a "large" numver of different values to put in those
parameters.

Yes, if I understand you correctly.
I'm also assuming that the output(s) for a given "experiment" would be
consistent in their parameters?

Yes, the output parameter hash has the same structure as the input
hash, although it typically has fewer parameters. The number
and sequence of values (realizations) for each output parameter,
however, corresponds exactly to the array of samples for each input
parameter. For example, the outputs hash may look like,

{ 'heating' => [ 75.23, 76.54, ... ],
'stag_pr' => [ 102.13, 108.02, ... ], ... }

Here, the 2nd realization of the output, 'stag_pr', 108.02 corresponds
to the 2nd case and is associated with the 2nd entries in the 'F_x'
and 'q_r' arrays, 1.12 and 3.89e+8, respectively.
So right now you are doing say for a given experiment f:

inputs : i1,i2,i3,...,in
outputs : o1,o2,o3,...,0m

run f(i1,i2,i3,...,in) -> [o1,...,on] where the values for i1,...,in are
"jiggled" And you have around 3,000 diferent sets of inputs.

Yes, where 'inputs' and 'outputs' are vectors m and k long,
respectively; so you have a matrix of values, e.g.,

input1 : i1_1, i1_2, i1_3, ..., i1_n
input2 : i2_1, i2_2, i2_3, ..., i2_n
. . . . . .
. [ m x n matrix ] .
. . . . . .
inputj : im_1, im_2, im_3, ..., im_n

output1 : o1_1, o1_2, o1_3, ..., o1_n
output1 : o2_1, o2_2, o2_3, ..., o2_n
. . . . . .
. [ k x n matrix ] .
. . . . . .
output1 : ok_1, ok_2, ok_3, ..., ok_n
Would you be willing to share these?

/I/ am willing, but unfortunately I'm also mired in red tape.
I'm not sure if what I'm assuming
about your problem is correct or not, but it intrigues me and I'd like
to fiddle with the general problem :).

The more I explain it, the more I learn about it; so thanks
for the interest.

Regards,
 
R

Robert Dober

Bill, maybe you want to have a look at JSON
http://json.rubyforge.org/

I do not have time right now to benchmark the reading, but the writing
gives some spectacular results, look at this:

517/17 > cat test-out.rb && ruby test-out.rb
# vim: sts=2 sw=2 expandtab nu tw=0:

require 'yaml'
require 'rubygems'
require 'json'
require 'benchmark'

@hash = Hash[*(1..100).map{|l| "k_%03d" % l}.zip([*1..100]).flatten]

Benchmark.bmbm do
|bench|
bench.report( "yaml" ) { 50.times{ @hash.to_yaml } }
bench.report( "json" ) { 50.times{ @hash.to_json } }
end

Rehearsal ----------------------------------------
yaml 0.630000 0.030000 0.660000 ( 0.748123)
json 0.020000 0.000000 0.020000 ( 0.079732)
------------------------------- total: 0.680000sec

user system total real
yaml 0.590000 0.000000 0.590000 ( 0.754097)
json 0.020000 0.000000 0.020000 ( 0.018363)

Looks promising, n'est-ce pas?

Maybe you want to investigate that a little bit more, JSON is of
course very readable, look e.g at this:

irb(main):002:0> require 'rubygems'
=> true
irb(main):003:0> require 'json'
=> true
irb(main):004:0> {:a => [*42..84]}.to_json
=> "{\"a\":[42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]}"

HTH
Robert
 
B

Bil Kleb

Robert said:
user system total real
yaml 0.590000 0.000000 0.590000 ( 0.754097)
json 0.020000 0.000000 0.020000 ( 0.018363)

Looks promising, n'est-ce pas?

Neat. Thanks for the data. For now though, Marshal
is an adequate alternative.

Regards,
 
J

Jeremy Hinegardner

Here, the 2nd realization of the output, 'stag_pr', 108.02 corresponds
to the 2nd case and is associated with the 2nd entries in the 'F_x'
and 'q_r' arrays, 1.12 and 3.89e+8, respectively.

Yeah, that's what explains it best for me.

The more I explain it, the more I learn about it; so thanks
for the interest.

I haven't forgotten about this, I have a couple of ways to manage it
with SQLite but how you want to deal with the data after running all the
cases could influence the direction.

You said you want to test for convergence after cases are run?
Basically you want to do some calculations using the inputs and outputs
after each case ( or N cases ) and save those calculations off to the
side until they reach some error/limit etc? For this do you want to
just record "After case N my running calculations(f,g,h) over the cases run so
far are x,y,z"

That is something like:

Case Running calc f, Running calc g, Running calc h
1 1.0 2.0 3.0
10 2.0 4.0 9.0
....

Also, do you do any comparison between experiments? That is, for one
scenario that you have a few thousand cases for with your inputs that
are jiggled; would you do anything with those results in relation to
some other scenario? Or are all scenario/experiments isolated?

enjoy,

-jeremy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,241
Messages
2,571,223
Members
47,860
Latest member
LoganF4991

Latest Threads

Top