max_ny wrote On 01/12/06 17:11,:
Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data
There is no foolproof way. Fundamentally, all you
get from the file is a bunch of Strings that are the
external appearance of data fields. You're trying to
deduce the original data by gazing at its appearance,
and your problem is that there may be many different
internal data items that could produce the same String.
"4" - It looks like a number, but is it an integer
or a floating-point value? Is it a percentage
that really represents 0.04? Or maybe it's
part of a date: do the nearby columns contain
7 and 1776?
"F" - It might be a short string, as in F Murray
Abraham. Then again, it might be the Boolean
value "false."
"1:10" - Very likely a time, but what kind? If it
denotes an interval, does it mean 70 seconds
or 70 minutes? If it denotes a specific time
of day, does it mean early in the morning or
early in the afternoon? Then again, it might
be the odds your bookie is quoting.
"1/2" - If a date, is it January 2 or February 1,
and in what year? Or is it just 0.5 (does
the next column say "tsp")?
The very best way to resolve such ambiguities is to
get some outside authority (e.g., the user) to tell you
what the columns contain. Failing that, a reasonable
approach might be to take a sample of fifty or a hundred
rows, and try several ways of "understanding" the set of
values you find in each column: If one of them works
more often than the others, guess that it's the right
interpretation. (It may be a good idea to ignore the
first several rows; they might be column headings from
a spreadsheet or something of the kind.)
For example, if the values found in column 3 are
Quantity, "", 42, 29, 3001, 86, 107, 1336
you might well decide that the first two rows are titles
and the real data consists of integers in rows 3 and on.
(They could also be floating-point values or percentages,
of course, but a reasonable call might be to prefer the
"simpler" explanation.) On the other hand, if you found
Exchange, Rate, 1.0035, 0.9926, 1, 99.4217
it would be reasonable to assume you're looking at a
column of floating-point numbers plus two rows of titles.
Yes, one of the values would also work as an integer
(it's telling you how many gold splonders you can get for
one gold splonder), but floating-point successfully
"explains" more of the values and so is more likely to
characterize the whole column.
Similarly, an entire column of T and F or of Y and N
is probably a column of Booleans; if A and R and L and
S are also mixed in, it's probably a column of strings.
A column consisting entirely of integers in the range 1..12
may well be month numbers; 1..31 might mean day numbers.
Maybe you're lucky and there's a leading "$" or trailing
"%" to give you a hint (careful of those dollars, though:
are they American, Canadian, Australian, or Doodledorfian?).
Start with a set of "value interpreters" of various kinds,
apply each of them to all the values in a column, and score
each by how many times it's able to make sense of what it
finds.
It's going to be an imprecise business ...