P
py_genetic
This is excellect advise, thank you gentelman.
Paddy:
We can't really, in this arena make assumtions about the data source.
I fully agree with your point, but if we had the luxury of really
knowing the source we wouldn't be having this conversation. Files we
can deal with could be consumer data files, log files, financial
files... all from different users BCP-ed out or cvs excell etc.
However, I agree that we can make one basic assumtion, for each coll
there is a correct and furthermore optimal format. In many cases we
may have a supplied "data dictionary" with the data in which case you
are right and we can override much of this process, except we still
need to find the optimal format like int8 vs int16.
James:
Using a baysian method were my inital thoughts as well. The key to
this method, I feel is getting a solid random sample of the entire
file without having to load the whole beast into memory.
What are your thoughts on other techniques? For example training a
neural net and feeding it a sample, this might be nice and very fast
since after training (we would have to create a good global training
set) we could just do a quick transform on a coll sample and ave the
probabilities of the output units (one output unit for each type).
The question here would encoding, any ideas? A bin rep of the vars?
Furthermore, niave bayes decision trees etc?
John:
I like your approach, this could be simple. Intially, I was thinking
a loop that did exactly this, just test the sample colls for "hits"
and take the best. Thanks for the sample code.
George:
Thank you for offering to share your transform function. I'm very
interested.
Paddy:
We can't really, in this arena make assumtions about the data source.
I fully agree with your point, but if we had the luxury of really
knowing the source we wouldn't be having this conversation. Files we
can deal with could be consumer data files, log files, financial
files... all from different users BCP-ed out or cvs excell etc.
However, I agree that we can make one basic assumtion, for each coll
there is a correct and furthermore optimal format. In many cases we
may have a supplied "data dictionary" with the data in which case you
are right and we can override much of this process, except we still
need to find the optimal format like int8 vs int16.
James:
Using a baysian method were my inital thoughts as well. The key to
this method, I feel is getting a solid random sample of the entire
file without having to load the whole beast into memory.
What are your thoughts on other techniques? For example training a
neural net and feeding it a sample, this might be nice and very fast
since after training (we would have to create a good global training
set) we could just do a quick transform on a coll sample and ave the
probabilities of the output units (one output unit for each type).
The question here would encoding, any ideas? A bin rep of the vars?
Furthermore, niave bayes decision trees etc?
John:
The approach that I've adopted is to test the values in a column for all
types, and choose the non-text type that has the highest success rate
(provided the rate is greater than some threshold e.g. 90%, otherwise
it's text).
For large files, taking a 1/N sample can save a lot of time with little
chance of misdiagnosis.
I like your approach, this could be simple. Intially, I was thinking
a loop that did exactly this, just test the sample colls for "hits"
and take the best. Thanks for the sample code.
George:
Thank you for offering to share your transform function. I'm very
interested.