Odd csv column-name truncation with only one column

Tim Chase · Jul 19, 2012

tim@laptop:~/tmp$ python
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.['Emai', '']

I get the same results using Python 3.1.3 (also readily available on
Debian Stable), as well as working directly on a file rather than a
StringIO.

Any reason I'm getting ['Emai', ''] (note the missing ell) instead
of ['Email'] as my resulting fieldnames? Did I miss something in
the docs?

-tkc

Steven D'Aprano · Jul 19, 2012

tim@laptop:~/tmp$ python
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) [GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.['Emai', '']

I get the same results for Python 2.6 and 2.7. Curiously, 2.5 returns
fieldnames as None.

I'm not entirely sure that a single column is legitimate for CSV -- if
there's only one column, it is hardly comma-separated, or any other
separated for that matter. But perhaps the csv module should raise an
exception in that case.

I think you've found a weird corner case where the sniffer goes nuts. You
should probably report it as a bug:

py> s = StringIO('Email\[email protected]\[email protected]\n')
py> s.seek(0)
py> d = csv.Sniffer().sniff(s.read())
py> d.delimiter
'l'

py> s = StringIO('Spam\[email protected]\[email protected]\n')
py> s.seek(0)
py> d = csv.Sniffer().sniff(s.read())
py> d.delimiter
'p'

py> s = StringIO('Spam\nham\ncheese\n')
py> s.seek(0)
py> d = csv.Sniffer().sniff(s.read())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/csv.py", line 184, in sniff
raise Error, "Could not determine delimiter"
_csv.Error: Could not determine delimiter

Hans Mulder · Jul 19, 2012

tim@laptop:~/tmp$ python
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.['Emai', '']

I get the same results using Python 3.1.3 (also readily available on
Debian Stable), as well as working directly on a file rather than a
StringIO.

Any reason I'm getting ['Emai', ''] (note the missing ell) instead
of ['Email'] as my resulting fieldnames? Did I miss something in
the docs?

The sniffer tries to guess the column separator. If none of the
usual suspects seems to work, it tries to find a character that
occurs with the same frequency in every row. In your sample,
the letter 'l' occurs exactly once on each line, so it is the
most plausible separator, or so the Sniffer thinks.

Perhaps it should be documented that the Sniffer doesn't work
on single-column data.

If you really need to read a one-column csv file, you'll have
to find some other way to produce a Dialect object. Perhaps the
predefined 'cvs.excel' dialect matches your data. If not, the
easiest way might be to manually define a csv.Dialect subclass.

Hope this helps,

-- HansM

Tim Chase · Jul 19, 2012

Perhaps it should be documented that the Sniffer doesn't work
on single-column data.

I think this would involve the least change in existing code, and
go a long way towards removing my surprise.

If you really need to read a one-column csv file, you'll have
to find some other way to produce a Dialect object. Perhaps the
predefined 'cvs.excel' dialect matches your data. If not, the
easiest way might be to manually define a csv.Dialect subclass.

The problem I'm trying to solve is "here's a filename that might be
comma/pipe/tab delimited, it has an 'email' column at minimum, and
perhaps a couple others of interest if they were included" It's
improbable that it's ONLY an email column, but my tests happened to
snag this edge case. I can likely do my own sniffing by reading the
first line, checking for tabs then pipes then commas (perhaps
biasing the order based on the file-extension of .csv vs. .txt), and
then building my own dialect information to pass to csv.DictReader
It just seems unfortunate that the sniffer would ever consider
[a-zA-Z0-9] as a valid delimiter.

-tkc

Dennis Lee Bieber · Jul 19, 2012

It just seems unfortunate that the sniffer would ever consider
[a-zA-Z0-9] as a valid delimiter.

I'd suspect the sniffer logic does not do any special casing -- any
/byte value/ is a candidate for the delimiter. This would allow for
usage of some old ASCII control characters -- things like x1F (unit
separator)

{Next is to rig the sniffer to identify x1F for fields, and x1E for
records <G>}

Steven D'Aprano · Jul 20, 2012

Perhaps it should be documented that the Sniffer doesn't work on
single-column data.

If you really need to read a one-column csv file, you'll have to find
some other way to produce a Dialect object. Perhaps the predefined
'cvs.excel' dialect matches your data. If not, the easiest way might be
to manually define a csv.Dialect subclass.

Perhaps the csv module could do with a pre-defined "one column" dialect.
If anyone comes up with one, do consider proposing it as a patch on the
bug tracker.

Hans Mulder · Jul 20, 2012

It just seems unfortunate that the sniffer would ever consider
[a-zA-Z0-9] as a valid delimiter.

Click to expand...

+1

I'd suspect the sniffer logic does not do any special casing
-- any /byte value/ is a candidate for the delimiter.

The sniffer prefers [',', '\t', ';', ' ', ':'] (in that order).
If none of those is found, it goes to the other extreme and considers
all characters equally likely.

This would allow for usage of some old ASCII control characters --
things like x1F (unit separator)

If the Sniffer excludes [a-zA-Z0-9] (or all alphanumerics) as
potential delimiters, than control characters such as "\x1F" are
still possible.

{Next is to rig the sniffer to identify x1F for fields, and x1E
for records <G>}

The sniffer will always guess '\r\n' as the line terminator.

That should not stop you from creating a dialect with '\x1E' as
the line terminator. Just don't expect the sniffer to recognize
that dialect.

-- HansM

Dennis Lee Bieber · Jul 20, 2012

The sniffer will always guess '\r\n' as the line terminator.

That should not stop you from creating a dialect with '\x1E' as
the line terminator. Just don't expect the sniffer to recognize
that dialect.

{devil's advocate}: Maybe it's time to expand the CSV module... Of
course, if we set it to recognize x1E as a record separator, we should
be fair and also incorporate the other two ASCII "separator" codes.

x1D (group separator) could be used to signal a "new table" -- ie; a
change in record structure (number of columns, header labels). And then
x1C (file separator) could represent a new "worksheet" (in Excel terms).

We'd need some sort of flag/query method to detect these changes, of
course.

while not csv.EndOfSheet():
while not csv.EndOfTable():
...

And then there is the potential of using <VT> and <FF> as
equivalents for x1D and x1C (for those files using <TAB> and <CR><LF> as
field/record separators).

Cannot connect to IMAP server in Python 3.2	9	Apr 5, 2012
Environment variables not visible from Python	6	Sep 22, 2011
Segmentation Fault on exit	2	Aug 6, 2011
round in 2.6 and 2.7	8	Dec 23, 2010
What happened to module.__file__?	2	Dec 12, 2011
running gui py script	0	Dec 26, 2012
[pyplot] using f1=figure(1)	2	Mar 28, 2011
re documentation bug?	0	Mar 8, 2011

Odd csv column-name truncation with only one column

Tim Chase

Steven D'Aprano

Hans Mulder

Tim Chase

Dennis Lee Bieber

Steven D'Aprano

Hans Mulder

Dennis Lee Bieber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads