converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

P

py_genetic

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor
 
D

Dustan

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

given the string s:

try:
integerValue = int(s)
except ValueError, e:
try:
floatValue = float(s)
except ValueError:
pass
else:
s = floatValue
else:
s = integerValue

I believe it will automatically identify base 8 and base 16 integers
(but not base 8/16 floats).
 
J

James Stroud

py_genetic said:
Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

This is untested, but here is an outline to do what you want.

First convert rows to columns:


columns = zip(*rows)


Okay, that was a lot of typing. Now, you should run down the columns,
testing with the most restrictive type and working to less restrictive
types. You will also need to keep in mind the potential for commas in
your numbers--so you will need to write your own converters, determining
for yourself what literals map to what values. Only you can decide what
you really want here. Here is a minimal idea of how I would do it:


def make_int(astr):
if not astr:
return 0
else:
return int(astr.replace(',', ''))

def make_float(astr):
if not astr:
return 0.0
else:
return float(astr.replace(',', ''))

make_str = lambda s: s


Now you can put the converters in a list, remembering to order them.


converters = [make_int, make_float, make_str]


Now, go down the columns checking, moving to the next, less restrictive,
converter when a particular converter fails. We assume that the make_str
identity operator will never fail. We could leave it out and have a
flag, etc., for efficiency, but that is left as an exercise.


new_columns = []
for column in columns:
for converter in converters:
try:
new_column = [converter(v) for v in column]
break
except:
continue
new_columns.append(new_column)


For no reason at all, convert back to rows:


new_rows = zip(*new_columns)


You must decide for yourself how to deal with ambiguities. For example,
will '1.0' be a float or an int? The above assumes you want all values
in a column to have the same type. Reordering the loops can give mixed
types in columns, but would not fulfill your stated requirements. Some
things are not as efficient as they might be (for example, eliminating
the clumsy make_str). But adding tests to improve efficiency would cloud
the logic.

James
 
J

John Machin

py_genetic said:
Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

This is untested, but here is an outline to do what you want.

First convert rows to columns:


columns = zip(*rows)


Okay, that was a lot of typing. Now, you should run down the columns,
testing with the most restrictive type and working to less restrictive
types. You will also need to keep in mind the potential for commas in
your numbers--so you will need to write your own converters, determining
for yourself what literals map to what values. Only you can decide what
you really want here. Here is a minimal idea of how I would do it:


def make_int(astr):
if not astr:
return 0
else:
return int(astr.replace(',', ''))

def make_float(astr):
if not astr:
return 0.0
else:
return float(astr.replace(',', ''))

make_str = lambda s: s


Now you can put the converters in a list, remembering to order them.


converters = [make_int, make_float, make_str]


Now, go down the columns checking, moving to the next, less restrictive,
converter when a particular converter fails. We assume that the make_str
identity operator will never fail. We could leave it out and have a
flag, etc., for efficiency, but that is left as an exercise.


new_columns = []
for column in columns:
for converter in converters:
try:
new_column = [converter(v) for v in column]
break
except:
continue
new_columns.append(new_column)


For no reason at all, convert back to rows:


new_rows = zip(*new_columns)


You must decide for yourself how to deal with ambiguities. For example,
will '1.0' be a float or an int? The above assumes you want all values
in a column to have the same type. Reordering the loops can give mixed
types in columns, but would not fulfill your stated requirements. Some
things are not as efficient as they might be (for example, eliminating
the clumsy make_str). But adding tests to improve efficiency would cloud
the logic.

[apologies in advance if this appears more than once]

This approach is quite reasonable, IF:
(1) the types involved follow a simple "ladder" hierarchy [ints pass the
float test, floats pass the str test]
(2) the supplier of the data has ensured that all values in a column are
actually instances of the intended type.

Constraint (1) falls apart if you need dates. Consider 31/12/99,
31/12/1999, 311299 [int?], 31121999 [int?], 31DEC99, ... and that's
before you allow for dates in three different orders (dmy, mdy, ymd).

Constraint (2) just falls apart -- with user-supplied data, there seem
to be no rules but Rafferty's and no laws but Murphy's.

The approach that I've adopted is to test the values in a column for all
types, and choose the non-text type that has the highest success rate
(provided the rate is greater than some threshold e.g. 90%, otherwise
it's text).

For large files, taking a 1/N sample can save a lot of time with little
chance of misdiagnosis.

Example: file of 1,079,000 records, with 15 columns, ultimately
diagnosed as being 8 x text, 3 x int, 1 x float, 2 x date (dmy order),
and [no kidding] 1 x date (ymd order). Using N==101 took about 15
seconds [Python 2.5.1, Win XP Pro SP2, 3.2GHz dual-core]; N==1 takes
about 900 seconds. The "converter" function for dates is written in C.

Cheers,
John
 
P

Paddy

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.

- Paddy.
 
J

James Stroud

John said:
The approach that I've adopted is to test the values in a column for all
types, and choose the non-text type that has the highest success rate
(provided the rate is greater than some threshold e.g. 90%, otherwise
it's text).

For large files, taking a 1/N sample can save a lot of time with little
chance of misdiagnosis.


Why stop there? You could lower the minimum 1/N by straightforward
application of Bayesian statistics, using results from previous tables
as priors.


James
 
J

John Machin

Why stop there? You could lower the minimum 1/N by straightforward
application of Bayesian statistics, using results from previous tables
as priors.

The example I gave related to one file out of several files prepared at
the same time by the same organisation from the same application by the
same personnel using the same query tool for a yearly process which has
been going on for several years. All files for a year should be in the
same format, and the format should not change year by year, and the
format should match the agreed specifications ... but this doesn't
happen. Against that background, please explain to me how I can use
"results from previous tables as priors".

Cheers,
John
 
J

John Machin

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.

Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?

"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"

N.B. I've kindly given you five lines instead of one or two :)

Cheers,
John
 
J

James Stroud

John said:
Against that background, please explain to me how I can use
"results from previous tables as priors".

Cheers,
John

It depends on how you want to model your probabilities, but, as an
example, you might find the following frequencies of columns in all
tables you have parsed from this organization: 35% Strings, 25% Floats,
20% Ints, 15% Date MMDDYYYY, and 5% Date YYMMDD. Let's say that you have
also used prior counting statistics to find that there is a 2% error
rate in the columns (2% of the values of a typical Float column fail to
cast to Float, 2% of values in Int columns fail to cast to Int, and
so-on, though these need not all be equal). Lets also say that for
non-Int columns, 1% of cells randomly selected cast to Int.

These percentages could be converted to probabilities and these
probabilities could be used as priors in Bayesian scheme to determine a
column type. Lets say you take one cell randomly and it can be cast to
an Int. What is the probability that the column is an Int? (See
<http://tinyurl.com/2bdn38>.)

P_1(H) = 0.20 --> Prior (20% prior columns are Int columns)
P(D|H) = 0.98
P(D|H') = 0.01

P_1(H|D) = 0.9607843 --> Posterior & New Prior "P_2(H|D)"


Now with one test positive for Int, you are getting pretty certain you
have an Int column. Now we take a second cell randomly from the same
column and find that it too casts to Int.

P_2(H) = 0.9607843 --> Confidence its an Int column from round 1
P(D|H) = 0.98
P(D|H') = 0.02

P_2(H|D) = 0.9995836


Yikes! But I'm still not convinced its an Int because I haven't even had
to wait a millisecond to get the answer. Lets burn some more clock cycles.

Lets say we really have an Int column and get "lucky" with our tests (P
= 0.98**4 = 92% chance) and find two more random cells successfully cast
to Int:

P_4(H) = 0.9999957
P(D|H) = 0.98
P(D|H') = 0.02

P(H|D) = 0.9999999


I don't know about you, but after only four positives, my calculator ran
out of significant digits and so I am at least 99.99999% convinced its
an Int column and I'm going to stop wasting CPU cycles and move on to
test the next column. How do you know its not a float? Well, given
floats with only one decimal place, you would expect only 1/10th could
be cast to Int (were the tenths-decimal place to vary randomly). You
could generate a similar statistical model to convince yourself with
vanishing uncertainty that the column that tests positive for Int four
times in a (random sample) is not actually a Float (even with only one
decimal place known).


James
 
J

James Stroud

James said:
Now with one test positive for Int, you are getting pretty certain you
have an Int column. Now we take a second cell randomly from the same
column and find that it too casts to Int.

P_2(H) = 0.9607843 --> Confidence its an Int column from round 1
P(D|H) = 0.98
P(D|H') = 0.02

P_2(H|D) = 0.9995836


Yikes! But I'm still not convinced its an Int because I haven't even had
to wait a millisecond to get the answer. Lets burn some more clock cycles.

Lets say we really have an Int column and get "lucky" with our tests (P
= 0.98**4 = 92% chance) and find two more random cells successfully cast
to Int:

P_4(H) = 0.9999957
P(D|H) = 0.98
P(D|H') = 0.02

P(H|D) = 0.9999999


I had typos. P(D|H') should be 0.01 for all rounds.

Also, I should clarify that 4 of 4 are positive with no fails observed.
Integrating fails would use the last posterior as a prior in a similar
scheme.

Also, given a 1% false positive rate, after only 4 rounds you are 1 -
(0.01**4) = 99.9999% sure your observations aren't because you
accidentally pulled 4 of the false positives in succession.

James
 
J

John Machin

It depends on how you want to model your probabilities, but, as an
example, you might find the following frequencies of columns in all
tables you have parsed from this organization: 35% Strings, 25% Floats,
20% Ints, 15% Date MMDDYYYY, and 5% Date YYMMDD.

The model would have to be a lot more complicated than that. There is a
base number of required columns. The kind suppliers of the data randomly
add extra columns, randomly permute the order in which the columns
appear, and, for date columns, randomly choose the day-month-year order,
how much punctuation to sprinkle between the digits, and whether to
append some bonus extra bytes like " 00:00:00".
Let's say that you have
also used prior counting statistics to find that there is a 2% error
rate in the columns (2% of the values of a typical Float column fail to
cast to Float, 2% of values in Int columns fail to cast to Int, and
so-on, though these need not all be equal). Lets also say that for
non-Int columns, 1% of cells randomly selected cast to Int.

Past stats on failure to cast are no guide to the future ... a sudden
change in the failure rate can be caused by the kind folk introducing a
new null designator i.e. outside the list ['', 'NULL', 'NA', 'N/A',
'#N/A!', 'UNK', 'UNKNOWN', 'NOT GIVEN', etc etc etc]


There is also the problem of first-time-participating organisations --
in police parlance, they have no priors :)

So, all in all, Bayesian inference doesn't seem much use in this scenario.
These percentages could be converted to probabilities and these
probabilities could be used as priors in Bayesian scheme to determine a
column type. Lets say you take one cell randomly and it can be cast to
an Int. What is the probability that the column is an Int? (See
<http://tinyurl.com/2bdn38>.)

That's fancy -- a great improvement on the slide rule and squared paper :)

Cheers,
John
 
J

James Stroud

John said:
So, all in all, Bayesian inference doesn't seem much use in this scenario.

This is equivalent to saying that any statistical analysis doesn't seem
much use in this scenario--but you go ahead and use statistics anyway?
 
P

Paddy

Hello,
I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...
Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]
Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.
1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?
2. Is there anything out there already which deals with this issue?
Thanks,
Conor
You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.

Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?

"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"

N.B. I've kindly given you five lines instead of one or two :)

Cheers,
John

John,
I've had cases where some investigation of the source of the data has
completely removed any ambiguity. I've found that data was generated
from one or two sources and been able to know what every field type is
by just examining a field that I have determined wil tell me the
source program that generated the data.

I have also found that the flow generating some data is subject to
hand editing so have had to both put in extra checks in my reader, and
on some occasions created specific editors to replace hand edits by
checked assisted hand edits.
I stand by my statement; "Know the source of your data", its less
likely to bite!

- Paddy.
 
J

James Stroud

John said:
The model would have to be a lot more complicated than that. There is a
base number of required columns. The kind suppliers of the data randomly
add extra columns, randomly permute the order in which the columns
appear, and, for date columns

I'm going to ignore this because these things have absolutely no affect
on the analysis whatsoever. Random order of columns? How could this
influence any statistics, counting, Bayesian, or otherwise?

randomly choose the day-month-year order,
how much punctuation to sprinkle between the digits, and whether to
append some bonus extra bytes like " 00:00:00".

I absolutely do not understand how bonus bytes or any of the above would
selectively adversely affect any single type of statistics--if your
converter doesn't recognize it then your converter doesn't recognize it
and so it will fail under every circumstance and influence any and all
statistical analysis. Under such conditions, I want very robust
analysis--probably more robust than simple counting statistics. And I
definitely want something more efficient.
Past stats on failure to cast are no guide to the future

Not true when using Bayesian statistics (and any type of inference for
that matter). For example, where did you get 90% cutoff? From
experience? I thought that past stats are no guide to future expectations?

... a sudden
change in the failure rate can be caused by the kind folk introducing a
new null designator i.e. outside the list ['', 'NULL', 'NA', 'N/A',
'#N/A!', 'UNK', 'UNKNOWN', 'NOT GIVEN', etc etc etc]

Using the rough model and having no idea that they threw in a few weird
designators so that you might suspect a 20% failure (instead of the 2% I
modeled previously), the *low probabilities of false positives* (say 5%
of the non-Int columns evaluate to integer--after you've eliminated
dates because you remembered to test more restrictive types first) would
still *drive the statistics*. Remember, the posteriors become priors
after the first test.

P_1(H) = 0.2 (Just a guess, it'll wash after about 3 tests.)
P(D|H) = 0.8 (Are you sure they have it together enough to pay you?)
P(D|H') = 0.05 (5% of the names, salaries, etc., evaluate to float?)

Lets model failures since the companies you work with have bad typists.
We have to reverse the probabilities for this:

Pf_1(H) = 0.2 (Only if this is round 1.)
Pf(D|H) = 0.2 (We *guess* a 20% chance by random any column is Int.)
Pf(D|H') = 0.80 (80% of Ints fail because of carpel tunnel, ennui, etc.)

You might take issue with Pf(D|H) = 0.2. I encourage you to try a range
of values here to see what the posteriors look like. You'll find that
this is not as important as the *low false positive rate*.

For example, lets not stop until we are 99.9% sure one way or the other.
With this cutoff, lets suppose this deplorable display of typing integers:

pass-fail-fail-pass-pass-pass

which might be expected from the above very pessimistic priors (maybe
you got data from the _Apathy_Coalition_ or the _Bad_Typists_Union_ or
the _Put_a_Quote_Around_Every_5th_Integer_League_):

P_1(H|D) = 0.800 (pass)
P_2(H|D) = 0.500 (fail)
P_3(H|D) = 0.200 (fail--don't stop, not 99.9% sure)
P_4(H|D) = 0.800 (pass)
P_6(H|D) = 0.9846153 (pass--not there yet)
P_7(H|D) = 0.9990243 (pass--got it!)

Now this is with 5% all salaries, names of people, addresses, favorite
colors, etc., evaluating to integers. (Pausing while I remember fondly
Uncle 41572--such a nice guy...funny name, though.)
There is also the problem of first-time-participating organisations --
in police parlance, they have no priors :)

Yes, because they teleported from Alpha Centauri where organizations are
fundamentally different from here on Earth and we can not make any
reasonable assumptions about them--like that they will indeed cough up
money when the time comes or that they speak a dialect of an earth
language or that they even generate spreadsheets for us to parse.

James
 
J

John Machin

Hello,
I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...
Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]
Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.
1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?
2. Is there anything out there already which deals with this issue?
Thanks,
Conor
You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.
Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?

"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"

N.B. I've kindly given you five lines instead of one or two :)

Cheers,
John

John,
I've had cases where some investigation of the source of the data has
completely removed any ambiguity. I've found that data was generated
from one or two sources and been able to know what every field type is
by just examining a field that I have determined wil tell me the
source program that generated the data.

The source program that produced my sample dataset was Microsoft Excel
(or OOo Calc or Gnumeric); it was induced to perform a "save as CSV"
operation. Does that help you determine the true nature of the first column?

I have also found that the flow generating some data is subject to
hand editing so have had to both put in extra checks in my reader, and
on some occasions created specific editors to replace hand edits by
checked assisted hand edits.
I stand by my statement; "Know the source of your data", its less
likely to bite!

My dataset has a known source, and furthermore meets your "lucky"
criteria (methodically generated, column type is consistent) -- I'm
waiting to hear from you about the "easily determined" part :)

Cheers,
John
 
P

Paddy

On 19/05/2007 3:14 PM, Paddy wrote:
Hello,
I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...
Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]
Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.
1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?
2. Is there anything out there already which deals with this issue?
Thanks,
Conor
You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.
Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?
"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"
N.B. I've kindly given you five lines instead of one or two :)
Cheers,
John
John,
I've had cases where some investigation of the source of the data has
completely removed any ambiguity. I've found that data was generated
from one or two sources and been able to know what every field type is
by just examining a field that I have determined wil tell me the
source program that generated the data.

The source program that produced my sample dataset was Microsoft Excel
(or OOo Calc or Gnumeric); it was induced to perform a "save as CSV"
operation. Does that help you determine the true nature of the first column?


I have also found that the flow generating some data is subject to
hand editing so have had to both put in extra checks in my reader, and
on some occasions created specific editors to replace hand edits by
checked assisted hand edits.
I stand by my statement; "Know the source of your data", its less
likely to bite!

My dataset has a known source, and furthermore meets your "lucky"
criteria (methodically generated, column type is consistent) -- I'm
waiting to hear from you about the "easily determined" part :)

Cheers,
John

John,
Open up your Excel spreadsheet and check what the format is for the
column. It's not a contest. If you KNOW what generated the data then
USE that knowledge. It would be counter-productive to do otherwise
surely?

(I know, don't call you Shirley :)

- Paddy.
 
J

John Machin

On 20/05/2007 8:52 PM, Paddy wrote:
On 19/05/2007 3:14 PM, Paddy wrote:
Hello,
I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...
Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]
Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.
1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?
2. Is there anything out there already which deals with this issue?
Thanks,
Conor
You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.
Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?
"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"
N.B. I've kindly given you five lines instead of one or two :)
Cheers,
John
John,
I've had cases where some investigation of the source of the data has
completely removed any ambiguity. I've found that data was generated
from one or two sources and been able to know what every field type is
by just examining a field that I have determined wil tell me the
source program that generated the data.
The source program that produced my sample dataset was Microsoft Excel
(or OOo Calc or Gnumeric); it was induced to perform a "save as CSV"
operation. Does that help you determine the true nature of the first column?
My dataset has a known source, and furthermore meets your "lucky"
criteria (methodically generated, column type is consistent) -- I'm
waiting to hear from you about the "easily determined" part :)
Cheers,
John

John,
Open up your Excel spreadsheet and check what the format is for the
column. It's not a contest. If you KNOW what generated the data then
USE that knowledge. It would be counter-productive to do otherwise
surely?

(I know, don't call you Shirley :)

.... and I won't call you Patsy more than this once :)

Patsy, re-read. The scenario is that I don't have the Excel
spreadsheet; I have a CSV file. The format is rather obviously
"currency" but that is not correct. The point is that (1) it was
methodically [mis-]produced by a known source [your criteria] but the
correct type of column 1 can't be determined by inspection of a value
or 2.

Yeah, it's not a contest, but I was kinda expecting that you might
have taken first differences of column 1 by now ...

Cheers,
John
 
G

George Sakkis

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

There are several replies to your immediate column type-guessing
problem, so I'm not going to address that. Once you decide the
converters for each column, you have to pass the dataset through them
(and optionally rearrange or omit some of them). That's easy to
hardcode for a few datasets with the same or similar structure but it
soon gets tiring.

I had a similar task recently so I wrote a general and efficient (at
least as far as pure python goes) row transformer that does the
repetitive work. Below are some examples from an Ipython session; let
me know if this might be useful and I'll post it here or at the
Cookbook.

George


#======= RowTransformer examples ============================

In [1]: from transrow import RowTransformer
In [2]: rows = [row.split(',') for row in "1,3.34,4-3.2j,John",
"4,4,4,4", "0,-1.1,3.4,None"]
In [3]: rows
Out[3]:
[['1', '3.34', '4-3.2j', 'John'],
['4', '4', '4', '4'],
['0', '-1.1', '3.4', 'None']]

# adapt the first three columns; the rest are omitted
In [4]: for row in RowTransformer([int,float,complex])(rows):
...: print row
...:
[1, 3.3399999999999999, (4-3.2000000000000002j)]
[4, 4.0, (4+0j)]
[0, -1.1000000000000001, (3.3999999999999999+0j)]

# return the 2nd column as float, followed by the 4th column as is
In [5]: for row in RowTransformer({1:float, 3:None})(rows):
....: print row
....:
[3.3399999999999999, 'John']
[4.0, '4']
[-1.1000000000000001, 'None']


# return the 3rd column as complex, followed by the 1st column as int
In [6]: for row in RowTransformer([(2,complex),(0,int)])(rows):
....: print row
....:
[(4-3.2000000000000002j), 1]
[(4+0j), 4]
[(3.3999999999999999+0j), 0]

# return the first three columns, adapted by eval()
# XXX: use eval() only for trusted data
In [7]: for row in RowTransformer(include=range(3),
default_adaptor=eval)(rows):
....: print row
....:
[1, 3.3399999999999999, (4-3.2000000000000002j)]
[4, 4, 4]
[0, -1.1000000000000001, 3.3999999999999999]

# equivalent to the previous
In [8]: for row in RowTransformer(default_adaptor=eval, exclude=[3])
(rows):
....: print row
....:
[1, 3.3399999999999999, (4-3.2000000000000002j)]
[4, 4, 4]
[0, -1.1000000000000001, 3.3999999999999999]
 
J

James Stroud

I need to correct myself here before someone else does. I didn't
actually reverse the probabilities as promised for the failing case. It
was late last night and I was starting to get a little cloudy.
Pf(D|H) = 0.2 (We *guess* a 20% chance by random any column is Int.)

This can be read instead as "probability that it will fail the test
given that it is really from an Int column", which is 20% of the time.
Pf(D|H') = 0.80 (80% of Ints fail because of carpel tunnel, ennui, etc.)

This can be read as "probability it will fail the test if it is not
really from an Int column". That would be Pf(D|H') = 0.95 (e.g. testing
the inability to cast to Int is a pretty bad test for Int because it
gives false positives 95% of the time).

This change doesn't change the conclusions of the example, with the
P_3(H|D) = 0.1505882 (lower than 20%, but no where near the 0.001 cutoff
to conclude the column is not Int) and the final probability P_7(H|D) =
0.9986247 (rounding up to our 0.999 criteria for confidence that it is
an Int ;).

James
 
N

Neil Cerutti

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.

Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?

"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"

N.B. I've kindly given you five lines instead of one or two :)

My experience with Excel-related mistakes leads me to think that
column one contains dates that got somehow misformatted on
export.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,000
Messages
2,570,252
Members
46,848
Latest member
CristineKo

Latest Threads

Top