Determine datatype

max_ny · Jan 12, 2006

Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data

thanks

Thomas Kellerer · Jan 12, 2006

max_ny wrote on 12.01.2006 23:11:

Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data

There is no way you can do it 100%. Any character sequence you read
could be mapped to several data types.

Take the sequence "20060112". This could be a VARCHAR column that
happens to have no characters (only numbers), or it could be an integer
value, or it could be a float value that happens to have no decimal
digits, or it could be a date.

I don't see a different solution then to let your user provide the
necessary information.

Thomas

h4hrr · Jan 12, 2006

okk, very true

but suppose, i this csv data

21.10.2003;6 ;0.99;Feb 96;1
22.10.2003;5 ;0.99;Feb 66;1
23.10.2003;10;0.99;Feb 34;1
24.10.2003;3 ;1 ;02.Jun;1

def. i can't say anything about "20060112", what type it can be

but if its 0.99 (like in col 3, and fourth value is int in that col.);
and i read it each value from file as String[], how can i determine its
float and fourth value is int?

i mean, any java commands or way to determine it..

thanks.. (just wanna be sure what exactly i shud do and how to do or
let user decide it)

Eric Sosman · Jan 12, 2006

max_ny wrote On 01/12/06 17:11,:

Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data

There is no foolproof way. Fundamentally, all you
get from the file is a bunch of Strings that are the
external appearance of data fields. You're trying to
deduce the original data by gazing at its appearance,
and your problem is that there may be many different
internal data items that could produce the same String.

"4" - It looks like a number, but is it an integer
or a floating-point value? Is it a percentage
that really represents 0.04? Or maybe it's
part of a date: do the nearby columns contain
7 and 1776?

"F" - It might be a short string, as in F Murray
Abraham. Then again, it might be the Boolean
value "false."

"1:10" - Very likely a time, but what kind? If it
denotes an interval, does it mean 70 seconds
or 70 minutes? If it denotes a specific time
of day, does it mean early in the morning or
early in the afternoon? Then again, it might
be the odds your bookie is quoting.

"1/2" - If a date, is it January 2 or February 1,
and in what year? Or is it just 0.5 (does
the next column say "tsp")?

The very best way to resolve such ambiguities is to
get some outside authority (e.g., the user) to tell you
what the columns contain. Failing that, a reasonable
approach might be to take a sample of fifty or a hundred
rows, and try several ways of "understanding" the set of
values you find in each column: If one of them works
more often than the others, guess that it's the right
interpretation. (It may be a good idea to ignore the
first several rows; they might be column headings from
a spreadsheet or something of the kind.)

For example, if the values found in column 3 are

Quantity, "", 42, 29, 3001, 86, 107, 1336

you might well decide that the first two rows are titles
and the real data consists of integers in rows 3 and on.
(They could also be floating-point values or percentages,
of course, but a reasonable call might be to prefer the
"simpler" explanation.) On the other hand, if you found

Exchange, Rate, 1.0035, 0.9926, 1, 99.4217

it would be reasonable to assume you're looking at a
column of floating-point numbers plus two rows of titles.
Yes, one of the values would also work as an integer
(it's telling you how many gold splonders you can get for
one gold splonder), but floating-point successfully
"explains" more of the values and so is more likely to
characterize the whole column.

Similarly, an entire column of T and F or of Y and N
is probably a column of Booleans; if A and R and L and
S are also mixed in, it's probably a column of strings.
A column consisting entirely of integers in the range 1..12
may well be month numbers; 1..31 might mean day numbers.
Maybe you're lucky and there's a leading "$" or trailing
"%" to give you a hint (careful of those dollars, though:
are they American, Canadian, Australian, or Doodledorfian?).
Start with a set of "value interpreters" of various kinds,
apply each of them to all the values in a column, and score
each by how many times it's able to make sense of what it
finds.

It's going to be an imprecise business ...

Thomas Kellerer · Jan 12, 2006

h4hrr wrote on 12.01.2006 23:51:

21.10.2003;6 ;0.99;Feb 96;1
22.10.2003;5 ;0.99;Feb 66;1
23.10.2003;10;0.99;Feb 34;1
24.10.2003;3 ;1 ;02.Jun;1

def. i can't say anything about "20060112", what type it can be

but if its 0.99 (like in col 3, and fourth value is int in that col.);
and i read it each value from file as String[], how can i determine its
float and fourth value is int?

Why would you want to have a different data type for the first and
fourth value? They all go into the same database column, right? So
you'll have to use float for all of them.

But this is precisely another situation why you cannot reliable
determine the data type from a plain text file. Suppose you analyze the
first row to find the datatype and the "1" is in there, so you'll go for
an int, but later you'll hit 0.99 and your code throws an exception.

But if you really need to do it, you could try something like this:

String valueFromFile = ....;

Object dbValue = null;
try
{
dbValue = Integer.valueOf((valueFromFile);
}
catch (NumberFormatException e)
{
dbValue = null;

}

if (dbValue == null)
{
try
{
dbValue = Double.valueOf(valueFromFile);
}
catch (NumberFormatException e)
{
dbValue = null;
}
}

if (dbValue == null)
{
.....
}

You get the idea.

Basically try with one type, if that fails try the next until the
conversion does not throw an Exception. Not the fastest way, but if you
have a limited set of types, this might work. Won't help with the
situation I described in my first post though.

The actual dbValue object can be written to the database using
setObject() on a PreparedStatement.

Thomas

Roedy Green · Jan 12, 2006

my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.

This is similar to the logic in my Align utility, and also in the HTML
table sorter, that scan a file first to determine if each column is
alpha or numeric. If all entries are numeric, I right align or sort
numerically.

So you do a pass through the file with CSVReader (see
http://mindprod.com/jgloss/csv.html

There is an option to give you each entire line as an array of
strings.

Examine each string, and adjust your types[col] array.

Now you know the types of each column. You can then read the file
converting each entry to the column type and know it will work.

h4hrr · Jan 12, 2006

Thanks for the nice explanation.

my main aim was to do the same as you mentioned,

to take a sample of fifty or a hundred
rows, and try several ways of "understanding" the set of
values you find in each column: If one of them works
more often than the others, guess that it's the right
interpretation.

but how to determine, whether the string[] value is integer, double or
something else, i mean the java reference for it (as i said in my other
reply).. i am not sure how to do that.. the java reference or command
to determine the value type ..

what to finalize which is datatype is it at the end, may be one has to
do some heuristics for determining it out of the values read..

thanks

h4hrr · Jan 12, 2006

ah, didn't look at this before posting the last reply..

thanks for code sample.. yeah looks like its not a easy way to go and
if i check 1000 rows or so, for all the different types.. numerics,
dates,

hmm..

Francesco Devittori · Jan 13, 2006

max_ny said:
Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data

thanks

You could use String.matches(r) with r being a regular expression.
For example

if (col.matches("[0-10]")) { type = integer }
else if (col.matches("[0-10]\.{0-10}+")) { type = float }
else if (col.matches("(true)(false)")) { type = bool }
else if (col.matches("[0-31] [(jan)(feb)...] [0-2034]")) { type = date }
and so on...

(not sure about the regexp syntax, just to show how you could do it)

Francesco

Eric Sosman · Jan 13, 2006

h4hrr wrote On 01/12/06 18:42,:

Thanks for the nice explanation.

my main aim was to do the same as you mentioned,

to take a sample of fifty or a hundred
rows, and try several ways of "understanding" the set of
values you find in each column: If one of them works
more often than the others, guess that it's the right
interpretation.

Click to expand...

but how to determine, whether the string[] value is integer, double or
something else, i mean the java reference for it (as i said in my other
reply).. i am not sure how to do that.. the java reference or command
to determine the value type ..

what to finalize which is datatype is it at the end, may be one has to
do some heuristics for determining it out of the values read..

Exactly: your code needs to do some guesswork. My
suggestion is that you come up with a limited set of
data types that you expect to find in the data: Let's
say they are Date, Integer, Float, and String. Take
all (or a good-sized sample) of the text values in a
column and try to convert them to Dates; if none of the
texts successfully convert to Date then you conclude
that the column doesn't represent Dates. Try again,
converting them all to Integers; if 30% of them convert
successfully and 70% don't, you can be reasonably sure
that the column doesn't represent Integers. Try again,
this time converting every value to a Float; if they
all succeed, you can be pretty sure the column contains
floating-point numbers. If nothing else seems to fill
the bill, declare that the column contains Strings (the
conversion of a String to a String nearly always succeeds).

The "interesting" cases are when none of your candidate
data types successfully converts all the column's values,
but one converts "most" of the column. You could fall
back on String (the ultimate "I don't know"), or you could
decide to reject the few un-convertable rows (especially
if other columns in the same rows prove recalcitrant).
Another "interesting" case is when two or more data types
successfully convert all or most of the data, but neither
is obviously "more specific" than the other. These call
for judgment, a quality notably lacking in computers.

Programming challenge?	4	Jul 23, 2021
Getting Error reading in JSON file	0	Apr 28, 2022
psycopg2 insertion and reading binary data to PostgreSQL database(bytea datatype)	0	Mar 1, 2011
Two Questions: JSpinner and datatype	11	May 17, 2010
Node.js server that streams an infinite silence sound switches to streaming mew audio data receiving an event, and then returns to silence	0	Aug 16, 2024
List of template classes	2	Sep 10, 2013
Datatype matching	6	Feb 20, 2006
Python point location of intersect between two lines	0	Feb 28, 2018

Determine datatype

max_ny

Thomas Kellerer

h4hrr

Eric Sosman

Thomas Kellerer

Roedy Green

h4hrr

h4hrr

Francesco Devittori

Eric Sosman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads