csv Parser Question - Handling of Double Quotes

J

jwbrown77

Hello,

I am trying to read a csv file. I have the following functioning
code:

---- BEGIN ----
import csv

reader = csv.reader(open("test.csv", "rb"), delimiter=';')

for row in reader:
print row
---- END ----

This code will successfully parse my csv file formatted as such:

"this";"is";"a";"test"

Resulting in an output of:

['this', 'is', 'a', 'test']

However, if I modify the csv to:

"t"h"is";"is";"a";"test"

The output changes to:

['th"is"', 'is', 'a', 'test']

My question is, can you change the behavior of the parser to only
remove quotes when they are next to the delimiter? I would like both
quotes around the h in the example above to remain, however it is
instead removing only the first two instances of quotes it runs across
and leaves the others.

The closest solution I have found is to add to the reader command
"escapechar='\\'" then manually add a single \ character before the
quotes I'd like to keep. But instead of writing something to add
those slashes before csv parsing I was wondering if the parser can
handle it instead.

Thanks in advance for the help.
 
A

Aaron Watters

"this";"is";"a";"test"

Resulting in an output of:

['this', 'is', 'a', 'test']

However, if I modify the csv to:

"t"h"is";"is";"a";"test"

The output changes to:

['th"is"', 'is', 'a', 'test']

I'd be tempted to say that this is a bug,
except that I think the definition of "csv" is
informal, so the "bug/feature" distinction
cannot be exactly defined, unless I'm mistaken.

What I would do is write roll my own
parser using very simple python and check
that it works for the examples of interest.
If, for example, you can assume that the
delimiter will never occur inside the
payload and the payload contains no
"quoted" characters you could do something like:

==== cut
def trimQuotes(txt):
txt = txt.strip()
if txt:
start = txt[0]
end = txt[-1]
if start==end and start in ('"', "'"):
return txt[1:-1]
return txt

def simpleCsv(lines, delimiter):
for line in lines:
fields = line.split(delimiter)
fields = map(trimQuotes, fields)
yield fields

def test():
lines = ['"t"h"is";"is";"a";"test"']
for fields in simpleCsv(lines, ';'):
print fields

if __name__=="__main__":
test()
=== cut

If you want fame and admiration you could fix
the arguably bug in the csv module and send
the patch to the python bugs mailing list.
However, I just had a perusal of csv.py....
good luck :).
-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=too+general
 
G

Gabriel Genellina

En Thu, 27 Mar 2008 17:37:33 -0300, Aaron Watters
"this";"is";"a";"test"

Resulting in an output of:

['this', 'is', 'a', 'test']

However, if I modify the csv to:

"t"h"is";"is";"a";"test"

The output changes to:

['th"is"', 'is', 'a', 'test']

I'd be tempted to say that this is a bug,
except that I think the definition of "csv" is
informal, so the "bug/feature" distinction
cannot be exactly defined, unless I'm mistaken.

AFAIK, the csv module tries to mimic Excel behavior as close as possible.
It has some test cases that look horrible, but that's what Excel does...
I'd try actually using Excel to see what happens.
Perhaps the behavior could be more configurable, like the codecs are.
 
J

jwbrown77

En Thu, 27 Mar 2008 17:37:33 -0300, Aaron Watters  
<[email protected]> escribió:


"this";"is";"a";"test"
Resulting in an output of:
['this', 'is', 'a', 'test']
However, if I modify the csv to:
"t"h"is";"is";"a";"test"
The output changes to:
['th"is"', 'is', 'a', 'test']
I'd be tempted to say that this is a bug,
except that I think the definition of "csv" is
informal, so the "bug/feature" distinction
cannot be exactly defined, unless I'm mistaken.

AFAIK, the csv module tries to mimic Excel behavior as close as possible.  
It has some test cases that look horrible, but that's what Excel does...  
I'd try actually using Excel to see what happens.
Perhaps the behavior could be more configurable, like the codecs are.

Thank you Aaron and Gabriel. I was also hesitant to use the term
"bug" since as you said CSV isn't a standard. Yet in the same right I
couldn't readily think of an instance where the quote should be
removed if it's not sitting right next to the delimiter (or at the
very beginning/end of the line).

I'm not even sure if it should be patched since there could be cases
where this is how people want it to behave and I wouldn't want their
code to break.

I think rolling out a custom class seems like the only solution but if
anyone else has any other advice I'd like to hear it.

Thanks again for the help.
 
J

John Machin

If you want fame and admiration you could fix
the arguably bug in the csv module and send
the patch to the python bugs mailing list.
However, I just had a perusal of csv.py....
good luck :).

It is *NOT* a bug in the Python CSV module. The data is the problem.
The admittedly arcane way that the admittedly informal CSV writing
protocol works for each field is (simplified by ignoring \n and other
quotables):

QUOTE = '"'
DELIM = ','
if QUOTE in field:
emit(QUOTE + field.replace(QUOTE, QUOTE+QUOTE) + QUOTE)
elif DELIM in field:
emit(QUOTE + field + QUOTE)
else:
emit(field)

Example: database query, customer's surname recorded as O"Brien

This should be written as ...,"O""Brien",...
and read back as ['...', 'O"Brien', '...']

Aside: this quote-doubling caper is not restricted to CSV and not
exactly an uncommon occurrence:
SELECT * FROM cust WHERE surname = 'O''Brien';

A common mistake in CSV writing is to omit the quote-doubling step
above. If that is done, it is impossible to recover the original
contents unambiguously in all cases without further knowledge,
assumptions, heuristics, or look-ahead e.g. (1) the original field had
an even number of quotes or (2) the intended number of fields is known
or (3) there is only one quote in the line and there are no embedded
newlines ...

The Python csv module emulates Excel in delivering garbage silently in
cases when the expected serialisation protocol has (detectably) not
been followed. Proffering fame and admiration might be better directed
towards introducing a "strict" option than patching a non-existing bug
(which would introduce new ones).

Cheers,
John
 
J

John Machin

En Thu, 27 Mar 2008 17:37:33 -0300, Aaron Watters
<[email protected]> escribió:
"this";"is";"a";"test"
Resulting in an output of:
['this', 'is', 'a', 'test']
However, if I modify the csv to:
"t"h"is";"is";"a";"test"
The output changes to:
['th"is"', 'is', 'a', 'test']
I'd be tempted to say that this is a bug,
except that I think the definition of "csv" is
informal, so the "bug/feature" distinction
cannot be exactly defined, unless I'm mistaken.
AFAIK, the csv module tries to mimic Excel behavior as close as possible..
It has some test cases that look horrible, but that's what Excel does...
I'd try actually using Excel to see what happens.
Perhaps the behavior could be more configurable, like the codecs are.

Thank you Aaron and Gabriel. I was also hesitant to use the term
"bug" since as you said CSV isn't a standard. Yet in the same right I
couldn't readily think of an instance where the quote should be
removed if it's not sitting right next to the delimiter (or at the
very beginning/end of the line).

I'm not even sure if it should be patched since there could be cases
where this is how people want it to behave and I wouldn't want their
code to break.

I think rolling out a custom class seems like the only solution but if
anyone else has any other advice I'd like to hear it.

I have code in awk, C, and Python for reading bad-CSV data under the
assumptions (1) no embedded newlines (2) embedded quotes are not
doubled as they should be (3) there is an even number of quotes in
each original field (4) the caller prefers an exception or error
return when there is anomalous data.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top