csv Parser Question - Handling of Double Quotes

jwbrown77 · Mar 27, 2008

Hello,

I am trying to read a csv file. I have the following functioning
code:

---- BEGIN ----
import csv

reader = csv.reader(open("test.csv", "rb"), delimiter=';')

for row in reader:
print row
---- END ----

This code will successfully parse my csv file formatted as such:

"this";"is";"a";"test"

Resulting in an output of:

['this', 'is', 'a', 'test']

However, if I modify the csv to:

"t"h"is";"is";"a";"test"

The output changes to:

['th"is"', 'is', 'a', 'test']

My question is, can you change the behavior of the parser to only
remove quotes when they are next to the delimiter? I would like both
quotes around the h in the example above to remain, however it is
instead removing only the first two instances of quotes it runs across
and leaves the others.

The closest solution I have found is to add to the reader command
"escapechar='\\'" then manually add a single \ character before the
quotes I'd like to keep. But instead of writing something to add
those slashes before csv parsing I was wondering if the parser can
handle it instead.

Thanks in advance for the help.

Aaron Watters · Mar 27, 2008

"this";"is";"a";"test"

Resulting in an output of:

['this', 'is', 'a', 'test']

However, if I modify the csv to:

"t"h"is";"is";"a";"test"

The output changes to:

['th"is"', 'is', 'a', 'test']

I'd be tempted to say that this is a bug,
except that I think the definition of "csv" is
informal, so the "bug/feature" distinction
cannot be exactly defined, unless I'm mistaken.

What I would do is write roll my own
parser using very simple python and check
that it works for the examples of interest.
If, for example, you can assume that the
delimiter will never occur inside the
payload and the payload contains no
"quoted" characters you could do something like:

==== cut
def trimQuotes(txt):
txt = txt.strip()
if txt:
start = txt[0]
end = txt[-1]
if start==end and start in ('"', "'"):
return txt[1:-1]
return txt

def simpleCsv(lines, delimiter):
for line in lines:
fields = line.split(delimiter)
fields = map(trimQuotes, fields)
yield fields

def test():
lines = ['"t"h"is";"is";"a";"test"']
for fields in simpleCsv(lines, ';'):
print fields

if __name__=="__main__":
test()
=== cut

If you want fame and admiration you could fix
the arguably bug in the csv module and send
the patch to the python bugs mailing list.
However, I just had a perusal of csv.py....
good luck

.
-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=too+general

Gabriel Genellina · Mar 27, 2008

En Thu, 27 Mar 2008 17:37:33 -0300, Aaron Watters

"this";"is";"a";"test"

Resulting in an output of:

['this', 'is', 'a', 'test']

However, if I modify the csv to:

"t"h"is";"is";"a";"test"

The output changes to:

['th"is"', 'is', 'a', 'test']

Click to expand...

I'd be tempted to say that this is a bug,
except that I think the definition of "csv" is
informal, so the "bug/feature" distinction
cannot be exactly defined, unless I'm mistaken.

AFAIK, the csv module tries to mimic Excel behavior as close as possible.
It has some test cases that look horrible, but that's what Excel does...
I'd try actually using Excel to see what happens.
Perhaps the behavior could be more configurable, like the codecs are.

jwbrown77 · Mar 27, 2008

En Thu, 27 Mar 2008 17:37:33 -0300, Aaron Watters
<[email protected]> escribió:

"this";"is";"a";"test"
Resulting in an output of:
['this', 'is', 'a', 'test']
However, if I modify the csv to:
"t"h"is";"is";"a";"test"
The output changes to:
['th"is"', 'is', 'a', 'test']

Click to expand...

Click to expand...

I'd be tempted to say that this is a bug,
except that I think the definition of "csv" is
informal, so the "bug/feature" distinction
cannot be exactly defined, unless I'm mistaken.

Click to expand...

AFAIK, the csv module tries to mimic Excel behavior as close as possible.
It has some test cases that look horrible, but that's what Excel does...
I'd try actually using Excel to see what happens.
Perhaps the behavior could be more configurable, like the codecs are.

Thank you Aaron and Gabriel. I was also hesitant to use the term
"bug" since as you said CSV isn't a standard. Yet in the same right I
couldn't readily think of an instance where the quote should be
removed if it's not sitting right next to the delimiter (or at the
very beginning/end of the line).

I'm not even sure if it should be patched since there could be cases
where this is how people want it to behave and I wouldn't want their
code to break.

I think rolling out a custom class seems like the only solution but if
anyone else has any other advice I'd like to hear it.

Thanks again for the help.

John Machin · Mar 27, 2008

If you want fame and admiration you could fix
the arguably bug in the csv module and send
the patch to the python bugs mailing list.
However, I just had a perusal of csv.py....
good luck .

It is *NOT* a bug in the Python CSV module. The data is the problem.
The admittedly arcane way that the admittedly informal CSV writing
protocol works for each field is (simplified by ignoring \n and other
quotables):

QUOTE = '"'
DELIM = ','
if QUOTE in field:
emit(QUOTE + field.replace(QUOTE, QUOTE+QUOTE) + QUOTE)
elif DELIM in field:
emit(QUOTE + field + QUOTE)
else:
emit(field)

Example: database query, customer's surname recorded as O"Brien

This should be written as ...,"O""Brien",...
and read back as ['...', 'O"Brien', '...']

Aside: this quote-doubling caper is not restricted to CSV and not
exactly an uncommon occurrence:
SELECT * FROM cust WHERE surname = 'O''Brien';

A common mistake in CSV writing is to omit the quote-doubling step
above. If that is done, it is impossible to recover the original
contents unambiguously in all cases without further knowledge,
assumptions, heuristics, or look-ahead e.g. (1) the original field had
an even number of quotes or (2) the intended number of fields is known
or (3) there is only one quote in the line and there are no embedded
newlines ...

The Python csv module emulates Excel in delivering garbage silently in
cases when the expected serialisation protocol has (detectably) not
been followed. Proffering fame and admiration might be better directed
towards introducing a "strict" option than patching a non-existing bug
(which would introduce new ones).

Cheers,
John

John Machin · Mar 27, 2008

En Thu, 27 Mar 2008 17:37:33 -0300, Aaron Watters
<[email protected]> escribió:

"this";"is";"a";"test"
Resulting in an output of:
['this', 'is', 'a', 'test']
However, if I modify the csv to:
"t"h"is";"is";"a";"test"
The output changes to:
['th"is"', 'is', 'a', 'test']
I'd be tempted to say that this is a bug,
except that I think the definition of "csv" is
informal, so the "bug/feature" distinction
cannot be exactly defined, unless I'm mistaken.

Click to expand...

Click to expand...

AFAIK, the csv module tries to mimic Excel behavior as close as possible..
It has some test cases that look horrible, but that's what Excel does...
I'd try actually using Excel to see what happens.
Perhaps the behavior could be more configurable, like the codecs are.

Click to expand...

Thank you Aaron and Gabriel. I was also hesitant to use the term
"bug" since as you said CSV isn't a standard. Yet in the same right I
couldn't readily think of an instance where the quote should be
removed if it's not sitting right next to the delimiter (or at the
very beginning/end of the line).

I'm not even sure if it should be patched since there could be cases
where this is how people want it to behave and I wouldn't want their
code to break.

I think rolling out a custom class seems like the only solution but if
anyone else has any other advice I'd like to hear it.

I have code in awk, C, and Python for reading bad-CSV data under the
assumptions (1) no embedded newlines (2) embedded quotes are not
doubled as they should be (3) there is an even number of quotes in
each original field (4) the caller prefers an exception or error
return when there is anomalous data.

Aaron Watters · Mar 28, 2008

...The Python csv module emulates Excel in delivering garbage silently in
cases when the expected serialisation protocol has (detectably) not
been followed....

Fine, but I'd say the heuristic adopted produces
bizarre and surprising results in the illustrated case.
It's a matter of taste of course...
-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=mondo+bizarre

CSV out of range	0	Dec 4, 2012
.csv to .txt after adding columns	7	Sep 18, 2013
Number of cells, using CSV module	8	May 16, 2013
csv read clean up and write out to csv	2	Nov 2, 2012
csv reader	5	Dec 15, 2009
Grouping on and exporting to csv files	1	Mar 20, 2013
Newbie question: Tuples and reading csv files	3	Mar 29, 2010
Compare list entry from csv files	0	Nov 26, 2012

csv Parser Question - Handling of Double Quotes

jwbrown77

Aaron Watters

Gabriel Genellina

jwbrown77

John Machin

John Machin

Aaron Watters

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads