John said:
The model would have to be a lot more complicated than that. There is a
base number of required columns. The kind suppliers of the data randomly
add extra columns, randomly permute the order in which the columns
appear, and, for date columns
I'm going to ignore this because these things have absolutely no affect
on the analysis whatsoever. Random order of columns? How could this
influence any statistics, counting, Bayesian, or otherwise?
randomly choose the day-month-year order,
how much punctuation to sprinkle between the digits, and whether to
append some bonus extra bytes like " 00:00:00".
I absolutely do not understand how bonus bytes or any of the above would
selectively adversely affect any single type of statistics--if your
converter doesn't recognize it then your converter doesn't recognize it
and so it will fail under every circumstance and influence any and all
statistical analysis. Under such conditions, I want very robust
analysis--probably more robust than simple counting statistics. And I
definitely want something more efficient.
Past stats on failure to cast are no guide to the future
Not true when using Bayesian statistics (and any type of inference for
that matter). For example, where did you get 90% cutoff? From
experience? I thought that past stats are no guide to future expectations?
... a sudden
change in the failure rate can be caused by the kind folk introducing a
new null designator i.e. outside the list ['', 'NULL', 'NA', 'N/A',
'#N/A!', 'UNK', 'UNKNOWN', 'NOT GIVEN', etc etc etc]
Using the rough model and having no idea that they threw in a few weird
designators so that you might suspect a 20% failure (instead of the 2% I
modeled previously), the *low probabilities of false positives* (say 5%
of the non-Int columns evaluate to integer--after you've eliminated
dates because you remembered to test more restrictive types first) would
still *drive the statistics*. Remember, the posteriors become priors
after the first test.
P_1(H) = 0.2 (Just a guess, it'll wash after about 3 tests.)
P(D|H) = 0.8 (Are you sure they have it together enough to pay you?)
P(D|H') = 0.05 (5% of the names, salaries, etc., evaluate to float?)
Lets model failures since the companies you work with have bad typists.
We have to reverse the probabilities for this:
Pf_1(H) = 0.2 (Only if this is round 1.)
Pf(D|H) = 0.2 (We *guess* a 20% chance by random any column is Int.)
Pf(D|H') = 0.80 (80% of Ints fail because of carpel tunnel, ennui, etc.)
You might take issue with Pf(D|H) = 0.2. I encourage you to try a range
of values here to see what the posteriors look like. You'll find that
this is not as important as the *low false positive rate*.
For example, lets not stop until we are 99.9% sure one way or the other.
With this cutoff, lets suppose this deplorable display of typing integers:
pass-fail-fail-pass-pass-pass
which might be expected from the above very pessimistic priors (maybe
you got data from the _Apathy_Coalition_ or the _Bad_Typists_Union_ or
the _Put_a_Quote_Around_Every_5th_Integer_League_):
P_1(H|D) = 0.800 (pass)
P_2(H|D) = 0.500 (fail)
P_3(H|D) = 0.200 (fail--don't stop, not 99.9% sure)
P_4(H|D) = 0.800 (pass)
P_6(H|D) = 0.9846153 (pass--not there yet)
P_7(H|D) = 0.9990243 (pass--got it!)
Now this is with 5% all salaries, names of people, addresses, favorite
colors, etc., evaluating to integers. (Pausing while I remember fondly
Uncle 41572--such a nice guy...funny name, though.)
There is also the problem of first-time-participating organisations --
in police parlance, they have no priors
Yes, because they teleported from Alpha Centauri where organizations are
fundamentally different from here on Earth and we can not make any
reasonable assumptions about them--like that they will indeed cough up
money when the time comes or that they speak a dialect of an earth
language or that they even generate spreadsheets for us to parse.
James