Using csv.DictReader with \r\n in the middle of fields

P

pstatham

Hello everyone!

Hopefully this will interest some, I have a csv file (can be
downloaded from http://www.paulstathamphotography.co.uk/45.txt) which
has five fields separated by ~ delimiters. To read this I've been
using a csv.DictReader which works in 99% of the cases. Occasionally
however the description field has errant \r\n characters in the middle
of the record. This causes the reader to assume it's a new record and
try to read it.

Here's the code I had

import csv

fields = ["PROGTITLE", "SUBTITLE", "EPISODE", "DESCRIPTION", "DATE"]
delim = '~'

lineReader = csv.DictReader(open('45.txt', 'rbU'),
delimiter=delim,fieldnames=fields)

def FormatDate(date):
return date[6:10] +"-" +date[3:5] + "-" +date[0:2]

channelPrograms = []

for row in lineReader:
row["DATE"] = FormatDate(row["DATE"])
channelPrograms.append(row)

Which when run would give me an error as it was trying to pass a
NoneType to the FormatDate method, which obviously couldn't handle it.

I'd like to find a way to read this record correctly despite the \r
\n's in the middle of the description. The problem is I can't change
the behaviour in which it reads a record.

For the moment I've had to resort to extending the csv.DictReader and
overriding the next() method to set the number of fields versus the
number of values, if they're not equal I don't add those lines to my
list of records.

import csv

class ChanDictReader(csv.DictReader):
def __init__(self, f, fieldnames=None, restkey=None, restval=None,
dialect="excel", *args, **kwds):
csv.DictReader.__init__(self, f, fieldnames, restkey, restval,
dialect, *args, **kwds)
self.lf = 0
self.lr = 0

def next(self):
if self.line_num == 0:
# Used only for its side effect.
self.fieldnames
row = self.reader.next()
self.line_num = self.reader.line_num

# unlike the basic reader, we prefer not to return blanks,
# because we will typically wind up with a dict full of None
# values
while row == []:
row = self.reader.next()
d = dict(zip(self.fieldnames, row))
self.lf = len(self.fieldnames)
self.lr = len(row)
if self.lf < self.lr:
d[self.restkey] = row[self.lf:]
elif self.lf > self.lr:
for key in self.fieldnames[self.lr:]:
d[key] = self.restval
return d

fields = ["PROGTITLE", "SUBTITLE", "EPISODE", "DESCRIPTION", "DATE"]
delim = '~'

lineReader = ChanDictReader(open('45.txt', 'rbU'),
delimiter=delim,fieldnames=fields)

def FormatDate(date):
return date[6:10] +"-" +date[3:5] + "-" +date[0:2]

channelPrograms = []

for row in lineReader:
print "Number of fields: " + str(lineReader.lf) + " Number of
values: " + str(lineReader.lr)
if lineReader.lf == lineReader.lr:
row["DATE"] = FormatDate(row["DATE"])
channelPrograms.append(row)

Anyone have any ideas? :eek:)

Paul
 
N

Neil Cerutti

Hopefully this will interest some, I have a csv file (can be
downloaded from http://www.paulstathamphotography.co.uk/45.txt) which
has five fields separated by ~ delimiters. To read this I've been
using a csv.DictReader which works in 99% of the cases. Occasionally
however the description field has errant \r\n characters in the middle
of the record. This causes the reader to assume it's a new record and
try to read it.

Here's an alternative idea. Working with csv module for this job
is too difficult for me. ;)

import re

record_re = "(?P<PROGTITLE>.*?)~(?P<SUBTITLE>.*?)~(?P<EPISODE>.*?)~(?P<DESCRIPTION>.*?)~(?P<DATE>.*?)\n(.*)"

def parse_file(fname):
with open(fname) as f:
data = f.read()
m = re.match(record_re, data, flags=re.M | re.S)
while m:
yield m.groupdict()
m = re.match(record_re, m.group(6), flags=re.M | re.S)

for record in parse_file('45.txt'):
print(record)
 
D

Dennis Lee Bieber

Hopefully this will interest some, I have a csv file (can be
downloaded from http://www.paulstathamphotography.co.uk/45.txt) which
has five fields separated by ~ delimiters. To read this I've been
using a csv.DictReader which works in 99% of the cases. Occasionally
however the description field has errant \r\n characters in the middle
of the record. This causes the reader to assume it's a new record and
try to read it.
How is the data file being generated? Could the generation procedure
be modified?

While I've not tested it, my understanding of the documentation
indicates that the reader /can/ handle multi-line fields IF QUOTED...
(you may still have to strip the terminator out of the description data
after it has been loaded).

That is:

Some Title~Subtitle~Episode~"A description with<cr><lf>
an embedded new line terminator"~Date

should be properly parsed.
 
T

Tim Chase

While I've not tested it, my understanding of the documentation
indicates that the reader /can/ handle multi-line fields IF QUOTED...
(you may still have to strip the terminator out of the description data
after it has been loaded).

That is:

Some Title~Subtitle~Episode~"A description with<cr><lf>
an embedded new line terminator"~Date

should be properly parsed.

I believe this was fixed in 2.5 The following worked in 2.5 but
2.4 rejected it:

# saved as testr.py
from cStringIO import StringIO
from csv import DictReader

data = StringIO(
'one,"two two",three\n'
'"1a\r1b","2a\n2b","3a\r\n3b"\n'
'"1x\r1y","2x\n2y","3x\r\n3y"\n'
)

data.reset()
dr = DictReader(data)
for row in dr:
for k,v in row.iteritems():
print '%r ==> %r' % (k,v)


tim@rubbish:~/tmp$ python2.5 testr.py
'two two' ==> '2a\n2b'
'three' ==> '3a\r\n3b'
'one' ==> '1a\r1b'
'two two' ==> '2x\n2y'
'three' ==> '3x\r\n3y'
'one' ==> '1x\r1y'
tim@rubbish:~/tmp$ python2.4 testr.py
Traceback (most recent call last):
File "testr.py", line 12, in ?
for row in dr:
File "/usr/lib/python2.4/csv.py", line 109, in next
row = self.reader.next()
_csv.Error: newline inside string



-tkc
 
P

pstatham

Here's an alternative idea. Working with csv module for this job
is too difficult for me. ;)

import re

record_re = "(?P<PROGTITLE>.*?)~(?P<SUBTITLE>.*?)~(?P<EPISODE>.*?)~(?P<DESCRIPTION>.*?)~(?P<DATE>.*?)\n(.*)"

def parse_file(fname):
    with open(fname) as f:
        data = f.read()
        m = re.match(record_re, data, flags=re.M | re.S)
        while m:
            yield m.groupdict()
            m = re.match(record_re, m.group(6), flags=re.M | re.S)

for record in parse_file('45.txt'):
    print(record)

Thanks guys, I can't alter the source data.

I wouldn't of considered regex, but it's a good idea as I can then
define my own record structure instead of reader dictating to me what
a record is.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,954
Messages
2,570,116
Members
46,704
Latest member
BernadineF

Latest Threads

Top