RE Help splitting CVS data

Garry · Jan 20, 2013

I'm trying to manipulate family tree data using Python.
I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
The data appears in this format:

Marriage,Husband,Wife,Date,Place,Source,Note0x0a
Note: the Source field or the Note field can contain quoted data (same as the Place field)

Actual data:
[F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
[F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a

code snippet follows:

import os
import re
#I'm using the following regex in an attempt to decode the data:
RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
#
line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir. Co, MO\",,"
#
(Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)
#
#However, this does not decode the 7 fields.
# The following error is displayed:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
#
# When I use xx the fields apparently get unpacked.
xx = re.split(RegExp2,line)
#

print xx[0]
print xx[1] [F0244]
print xx[5] "Neely's Landing, Cape Gir. Co, MO"
print xx[6]
print xx[7]
print xx[8]

Click to expand...

Click to expand...

Why is there an extra NULL field before and after my record contents?
I'm stuck, comments and solutions greatly appreciated.

Garry

Mitya Sirenef · Jan 20, 2013

I'm trying to manipulate family tree data using Python.
I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
The data appears in this format:

Marriage,Husband,Wife,Date,Place,Source,Note0x0a
Note: the Source field or the Note field can contain quoted data (same as the Place field)

Actual data:
[F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
[F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a

code snippet follows:

import os
import re
#I'm using the following regex in an attempt to decode the data:
RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
#
line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir. Co, MO\",,"
#
(Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)
#
#However, this does not decode the 7 fields.
# The following error is displayed:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
#
# When I use xx the fields apparently get unpacked.
xx = re.split(RegExp2,line)
#

print xx[0]
print xx[1] [F0244]
print xx[5] "Neely's Landing, Cape Gir. Co, MO"
print xx[6]
print xx[7]
print xx[8]

Click to expand...

Click to expand...

Why is there an extra NULL field before and after my record contents?
I'm stuck, comments and solutions greatly appreciated.

Garry

Gosh, you really don't want to use regex to split csv lines like that....

Use csv module:
'[F0244],[I0690],[I0354],1916-06-08,"Neely\'s Landing, Cape Gir. Co,
MO",,0x0a'

import csv
r = csv.reader()
for l in r: print(l)

Click to expand...

Click to expand...

....
['[F0244]', '[I0690]', '[I0354]', '1916-06-08', "Neely's Landing, Cape
Gir. Co, MO", '', '0x0a']

the arg to csv.reader can be the file object (or a list of lines).

- mitya

Terry Reedy · Jan 20, 2013

I'm trying to manipulate family tree data using Python.
I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files ....
I'm stuck, comments and solutions greatly appreciated.

Why are you not using the cvs module?

Roy Smith · Jan 21, 2013

Garry said:
Actual data:
[F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
[F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a

code snippet follows:

import os
import re
#I'm using the following regex in an attempt to decode the data:

First suggestion, don't try to parse CSV data with regex. I'm a huge
regex fan, but it's just the wrong tool for this job. Use the built-in
csv module (http://docs.python.org/2/library/csv.html). Or, if you want
something fancier, read_csv() from pandas (http://tinyurl.com/ajxdxjm).

Second, when you use regexes, *always* use raw strings around the
pattern:

RegExp2 = r'....'

Lastly, take a look at the re.VERBOSE flag. It lets you write monster
regexes split up into several lines. Between re.VERBOSE and raw
strings, it can make the difference between line noise like this:

RegExp2 =
"^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d
{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"

and something that mere mortals can understand.

Tim Chase · Jan 21, 2013

Why are you not using the cvs module?

that's an easy answer:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named cvs

Now the *csv* module... ;-)

-tkc

Garry · Jan 21, 2013

I'm trying to manipulate family tree data using Python.

I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files

The data appears in this format:

Marriage,Husband,Wife,Date,Place,Source,Note0x0a

Note: the Source field or the Note field can contain quoted data (same as the Place field)

Actual data:

[F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a

[F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a

code snippet follows:

import os

import re

#I'm using the following regex in an attempt to decode the data:

RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"

#

line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir. Co, MO\",,"

#

(Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)

#

#However, this does not decode the 7 fields.

# The following error is displayed:

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

ValueError: too many values to unpack

#

# When I use xx the fields apparently get unpacked.

xx = re.split(RegExp2,line)

#

print xx[0]

Click to expand...

print xx[1]

Click to expand...

[F0244]

print xx[5]

Click to expand...

Click to expand...

"Neely's Landing, Cape Gir. Co, MO"

print xx[6]

Click to expand...

print xx[7]

Click to expand...

print xx[8]

Click to expand...

Click to expand...

Why is there an extra NULL field before and after my record contents?

I'm stuck, comments and solutions greatly appreciated.

Garry

Thanks everyone for your comments. I'm new to Python, but can get around in Perl and regular expressions. I sure was taking the long way trying to get the cvs data parsed.

Sure hope to teach myself python. Maybe I need to look into courses offered at the local Jr College!

Garry

Chris Angelico · Jan 21, 2013

Thanks everyone for your comments. I'm new to Python, but can get around in Perl and regular expressions. I sure was taking the long way trying to get the cvs data parsed.

As has been hinted by Tim, you're actually talking about csv data -
Comma Separated Values. Not to be confused with cvs, an old vcs. (See?
The v can go anywhere...) Not a big deal, but it's much easier to find
stuff on PyPI or similar when you have the right keyword to search
for!

ChrisA

Neil Cerutti · Jan 21, 2013

Thanks everyone for your comments. I'm new to Python, but can
get around in Perl and regular expressions. I sure was taking
the long way trying to get the cvs data parsed.

Sure hope to teach myself python. Maybe I need to look into
courses offered at the local Jr College!

There's more than enough free resources online for the
resourceful Perl programmer to get going. It sounds like you
might be interested in Text Processing in Python.

http://gnosis.cx/TPiP/

Also good for your purposes is Dive Into Python.

http://www.diveintopython.net/

Using re to get data from text file	6	Sep 10, 2004
split() can help to read UTF-16 encoded file without codecs support,why?	1	Mar 17, 2006
numpy help	2	Nov 3, 2006
anybody help me	1	Feb 10, 2006
mod_python help!	1	Feb 16, 2006
Request for help	22	Sep 20, 2007
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007
Data type problem in encryption algorithm	11	Nov 25, 2005

RE Help splitting CVS data

Garry

Mitya Sirenef

Terry Reedy

Roy Smith

Tim Chase

Garry

Chris Angelico

Neil Cerutti

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads