SQL and CSV

Nick · May 5, 2009

I have a requirement to read a CSV file. Normally, no problem, just
import CSV and slurp the file up.

However, in this case I want to filter out lines that have fields set
to particular values.

It would be neat to be able to do something like this.

select * from test.csv where status <> "Canceled"

Using adodb I can do this, so long as I don't have the where clause. :-
(

Is there a reasonable lightweight way of doing this in Python?

I could write some python code that is used to filter rows, and inport
that from config, but it's not quite as elegant as an SQL route.

Thanks

Nick

Tim Golden · May 5, 2009

Nick said:
I have a requirement to read a CSV file. Normally, no problem, just
import CSV and slurp the file up.

However, in this case I want to filter out lines that have fields set
to particular values.

It would be neat to be able to do something like this.

select * from test.csv where status <> "Canceled"

Using adodb I can do this, so long as I don't have the where clause. :-
(

Is there a reasonable lightweight way of doing this in Python?

I could write some python code that is used to filter rows, and inport
that from config, but it's not quite as elegant as an SQL route.

Not entirely clear what you are and aren't prepared to try here, but...
the most obvious Python-based way to do this is treating the csv reader
as an iterator and filtering there. Your last line suggests that's not
what you want but just in case I've misunderstood:

<test.csv>
id,code,status
1,"ONE","Active"
2,"TWO","Cancelled"
3,"THREE","Active"
</test.csv>

<code>
import csv

for row in csv.DictReader (open ("c:/temp/test.csv", "rb")):
if row['status'] != 'Cancelled':
print row

</code>

Doesn't seem too onerous, and could obviously be wrapped in
some useful class/module.

But if you really want to go the SQL route, I believe there are
ODBC adapters for CSV which, combined with PyODBC or CeODBC,
would probably take you where you want to go.

TJG

Nick · May 5, 2009

Nick said:
Nick said:

I have a requirement to read a CSV file. Normally, no problem, just
import CSV and slurp the file up.

Click to expand...

However, in this case I want to filter out lines that have fields set
to particular values.

Click to expand...

It would be neat to be able to do something like this.

Click to expand...

select * from test.csv where status <> "Canceled"

Click to expand...

Using adodb I can do this, so long as I don't have the where clause. :-
(

Click to expand...

Is there a reasonable lightweight way of doing this in Python?

Click to expand...

I could write some python code that is used to filter rows, and inport
that from config, but it's not quite as elegant as an SQL route.

Click to expand...

Not entirely clear what you are and aren't prepared to try here, but...
the most obvious Python-based way to do this is treating the csv reader
as an iterator and filtering there. Your last line suggests that's not
what you want but just in case I've misunderstood:

<test.csv>
id,code,status
1,"ONE","Active"
2,"TWO","Cancelled"
3,"THREE","Active"
</test.csv>

<code>
import csv

for row in csv.DictReader (open ("c:/temp/test.csv", "rb")):
if row['status'] != 'Cancelled':
print row

</code>

Doesn't seem too onerous, and could obviously be wrapped in
some useful class/module.

But if you really want to go the SQL route, I believe there are
ODBC adapters for CSV which, combined with PyODBC or CeODBC,
would probably take you where you want to go.

TJG

Part of the problem is that the 'selection' needs to be in a config
file. I can put the if row['status'] != 'Cancelled': return True into
a config, read it and eval it, but its not quite as clean as an sql
route.

Nick

Matimus · May 5, 2009

Not entirely clear what you are and aren't prepared to try here, but...
the most obvious Python-based way to do this is treating the csv reader
as an iterator and filtering there. Your last line suggests that's not
what you want but just in case I've misunderstood:

<code>
import csv

Click to expand...

for row in csv.DictReader (open ("c:/temp/test.csv", "rb")):
if row['status'] != 'Cancelled':
print row

</code>

Click to expand...

Doesn't seem too onerous, and could obviously be wrapped in
some useful class/module.

Click to expand...

But if you really want to go the SQL route, I believe there are
ODBC adapters for CSV which, combined with PyODBC or CeODBC,
would probably take you where you want to go.

Click to expand...

TJG

Click to expand...

Part of the problem is that the 'selection' needs to be in a config
file. I can put the if row['status'] != 'Cancelled': return True into
a config, read it and eval it, but its not quite as clean as an sql
route.

Nick

Well, if you are using 2.5.x you could always stuff it into a sqlite
in-memory database, and then execute a SQL query. Heck, you don't even
_need_ 2.5, but in 2.5 sqlite is part of the distribution.

Matt

Tim Golden · May 5, 2009

Nick said:
Part of the problem is that the 'selection' needs to be in a config
file. I can put the if row['status'] != 'Cancelled': return True into
a config, read it and eval it, but its not quite as clean as an sql
route.

Still not clear what the restriction is. If you were writing
SQL you'd have to read *something* from your config file,
unless you're suggesting that the "config file" is in fact
a SQL file. Which is one way of doing it, but then you might
just as well have your config file as a Python file and
import it.

Have I missed the point somewhere here? Can you give an
example -- even a fictional one -- of what you couldn't
do using, say, the example I gave earlier?

TJG

Nick · May 7, 2009

Nick said:
Nick said:

Part of the problem is that the 'selection' needs to be in a config
file. I can put the if row['status'] != 'Cancelled': return True into
a config, read it and eval it, but its not quite as clean as ansql
route.

Click to expand...

Still not clear what the restriction is. If you were writingSQLyou'd have to read *something* from your config file,
unless you're suggesting that the "config file" is in fact
aSQLfile. Which is one way of doing it, but then you might
just as well have your config file as a Python file and
import it.

Have I missed the point somewhere here? Can you give an
example -- even a fictional one -- of what you couldn't
do using, say, the example I gave earlier?

TJG

Solution found. In the end I used SQLite to read from a csv file, and
now I can query the CSV file. The file is read using the csv module

First create a function

def fraction(p, denom):
num, frac = p.split ('-')
return float (num) + float (frac) / denom

for use within queries.

Now build the class.

self.filename = filename
self.dialect = dialect
self.query = query
reader = csv.reader (open (filename, 'r'))
self.connection = sqlite.connect(":memory:")
self.connection.create_function("fraction", 2, fraction) #
Adds in function
self.cursor = self.connection.cursor()
first = True
for row in reader:
if first:
headers = []
for r in row:
n = r.strip().replace (' ', '_').replace ('-','_')
headers.append (n)
command = 'create table csv (%s)' % ','.join (headers)
self.cursor.execute (command)
first = False
else:
command = 'insert into csv values ("%s")' % '","'.join
(row)
self.cursor.execute (command)

and then I can use this

self.cursor.execute (self.query)
rows = self.cursor.fetchall()
headers = []
for r in self.cursor.description:
headers.append (r[0])
results = Results.Results (headers, self.name, {})
i = 0
for row in rows:
results.add (row, i)
i = i + 1
return results

to query the results.

Results.Results is one of my classes that's reused in lots of places.

The query then looks somethign like this

select
Client_Reference_Number as TrdNbr,
Asset_Number as ISIN,
Quantity as Qty,
status
from
csv
where status in ("CANCELLED")

union

select
Client_Reference_Number as TrdNbr,
Asset_Number as ISIN,
Quantity as Qty,
status
from
csv
where status not in ("CANCELLED")

All incredibly neat and the first time I've used SQLite.

nick

John Machin · May 8, 2009

Nick said:
Nick said:

Part of the problem is that the 'selection' needs to be in a config
file. I can put the if row['status'] != 'Cancelled': return True into
a config, read it and eval it, but its not quite as clean as ansql
route.

Click to expand...

Click to expand...

Still not clear what the restriction is. If you were writingSQLyou'd have to read *something* from your config file,
unless you're suggesting that the "config file" is in fact
aSQLfile. Which is one way of doing it, but then you might
just as well have your config file as a Python file and
import it.

Click to expand...

Have I missed the point somewhere here? Can you give an
example -- even a fictional one -- of what you couldn't
do using, say, the example I gave earlier?

Click to expand...

TJG

Click to expand...

Solution found. In the end I used SQLite to read from a csv file, and
now I can query the CSV file. The file is read using the csv module

First create a function

def fraction(p, denom):
num, frac = p.split ('-')
return float (num) + float (frac) / denom

for use within queries.

Now build the class.

self.filename = filename
self.dialect = dialect
self.query = query
reader = csv.reader (open (filename, 'r'))
self.connection = sqlite.connect(":memory:")
self.connection.create_function("fraction", 2, fraction) #
Adds in function
self.cursor = self.connection.cursor()
first = True
for row in reader:
if first:
headers = []
for r in row:
n = r.strip().replace (' ', '_').replace ('-','_')
headers.append (n)
command = 'create table csv (%s)' % ','..join (headers)
self.cursor.execute (command)
first = False
else:
command = 'insert into csv values ("%s")' % '","'.join
(row)
self.cursor.execute (command)

and then I can use this

self.cursor.execute (self.query)
rows = self.cursor.fetchall()
headers = []
for r in self.cursor.description:
headers.append (r[0])
results = Results.Results (headers, self.name, {})
i = 0
for row in rows:
results.add (row, i)
i = i + 1
return results

to query the results.

Results.Results is one of my classes that's reused in lots of places.

The query then looks somethign like this

select
Client_Reference_Number as TrdNbr,
Asset_Number as ISIN,
Quantity as Qty,
status
from
csv

The remaining lines of your SELECT statement are incredibly redundant
AFAICT. It seems you have pushed the contents of your csv file into a
data base and pulled them ALL out again ... not what I'd call a
"query". What's the point?

Dennis Lee Bieber · May 8, 2009

The query then looks somethign like this

select
Client_Reference_Number as TrdNbr,
Asset_Number as ISIN,
Quantity as Qty,
status
from
csv
where status in ("CANCELLED")

union

select
Client_Reference_Number as TrdNbr,
Asset_Number as ISIN,
Quantity as Qty,
status
from
csv
where status not in ("CANCELLED")

Pardon? The union of "in ('CANCELLED')" and "not in ('CANCELLED')"
will be ALL records... -- possibly ordered with the "cancelled" first...

I'd probably just use

select ... from csv
order by status;

to group the values.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Peter Otten · May 8, 2009

Nick said:
self.cursor = self.connection.cursor()
first = True
for row in reader:
if first:
headers = []
for r in row:
n = r.strip().replace (' ', '_').replace ('-','_')
headers.append (n)
command = 'create table csv (%s)' % ','.join (headers)
self.cursor.execute (command)
first = False
else:
command = 'insert into csv values ("%s")' % '","'.join
(row)
self.cursor.execute (command)

You can simplify that a bit:

cursor = self.cursor = self.connection.cursor()

first_row = next(reader)
headers = [column.strip().replace(" ", "_").replace("-", "_") for column in
first_row]
cursor.execute("create table csv (%s)" % ", ".join(headers))

placeholders = ", ".join("?"*len(headers))
command = "insert into csv values (%s)" % placeholders
cursor.executemany(command, reader)

While it may not matter here using placeholders instead of manually escaping
user-provided values is a good habit to get into.

self.cursor.execute (self.query)
rows = self.cursor.fetchall()

rows = self.cursor.execute(self.query)

doesn't build an intermediate list.

i = 0
for row in rows:
results.add (row, i)
i = i + 1

This is written

for i, row in enumerate(rows):
results.add(row, i)

in idiomatic Python.

Peter

Lawrence D'Oliveiro · May 8, 2009

While it may not matter here using placeholders instead of manually
escaping user-provided values is a good habit to get into.

Until you hit things it can't deal with.

Nick · May 8, 2009

The post you are replying to was talking about using the SQL library's "?"
syntax that automatically escapes values. The usual reason this is
recommended (if I have understood correctly) is that the library code is
much more likely to foil injection attacks. I have seen this mentioned
often and assume it is good advice.

Can you expand on your comment? I assume you are thinking of how the
library might handle some strange class. But aren't the number of types
limited by SQL? In which case a "thing that can't be handled" could
presumably be managed by adding an appropriate __str__ or __float__ or
whatever? And you would still use the library to give safety with other
values.

Maybe you could give an example of the kind of problem you're thinking of?

Thanks,
Andrew

Injection attacks aren't an issue, its a local app.

It's part of a reconciliation system, where sometimes data is in csv
files. If you want the whole csv file, you can use csv module without
a problem.

In some cases, I need to manipulate the data.

The choices are hard code the manipulation, or load the data from a
config file.

So what I've got is the query in the config and I can process it.

Nick

Lawrence D'Oliveiro · May 9, 2009

andrew said:
Can you expand on your comment? I assume you are thinking of how the
library might handle some strange class.

Consider something as simple as a "like" clause.

Lawrence D'Oliveiro · May 9, 2009

andrew said:
even if you're not open to injection attacks, you're still less likely to
get escaping correct than a puprose written, widely used library.

Escaping isn't really that hard to do. For example, the rules for MySQL
literals are clearly spelled out here
<http://dev.mysql.com/doc/refman/5.0/en/string-syntax.html>. You can easily
check that against my quoting routines here
<http://codecodex.com/wiki/index.php?title=Useful_MySQL_Routines#Quoting>.

KML to CSV file conversion using Python and Windows Powershell	0	Oct 14, 2022
How to convert CSV to parquet file without RLE_DICTIONARY encoding?	0	Sep 2, 2022
I need help with my python assignment and I'm stuck can't find any solution for it. Convert CSV string format to JSON format	0	Oct 12, 2021
Grouping on and exporting to csv files	1	Mar 20, 2013
csv read clean up and write out to csv	2	Nov 2, 2012
Number of cells, using CSV module	8	May 16, 2013
Errors When Pulling Information from CSV File to Python	0	Dec 10, 2020
How to sort a CSV file with merge sort JAVA	7	May 6, 2021

SQL and CSV

Nick

Tim Golden

Nick

Matimus

Tim Golden

Nick

John Machin

Dennis Lee Bieber

Peter Otten

Lawrence D'Oliveiro

Nick

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads