text file parsing (awk -> python)

D

Daniel Nogradi

Hi list,

I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?
 
P

Peter Otten

Daniel said:
I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)
 
D

Daniel Nogradi

I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)


Thanks very much, that's exactly what I had in mind.

Thanks again,
Daniel
 
B

bearophileHUGS

Peter Otten, your solution is very nice, it uses groupby splitting on
empty lines, so it doesn't need to read the whole files into memory.

But Daniel Nogradi says:
But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

Your version with the converters dict fails to convert the number of
node, z fields, etc. (generally using such converters dict is an
elegant solution, it allows to define string, float, etc fields):
converters = dict(
x=int,
y=int
)


I have created a version with a RE, but it's probably too much rigid,
it doesn't handle files with the z field, etc:

data = """node 10
y 1
x -1

node 11
x -2
y 1
z 5

node 12
x -3
y 1
z 6"""

import re
unpack = re.compile(r"(\D+) \s+ ([-+]? \d+) \s+" * 3, re.VERBOSE)

result = []
for obj in unpack.finditer(data):
block = obj.groups()
d = dict((block, int(block[i+1])) for i in xrange(0, 6, 2))
result.append(d)

print result


So I have just modified and simplified your quite nice solution (I have
removed the pprint, but it's the same):

def open(filename):
from cStringIO import StringIO
return StringIO(data)

from itertools import groupby

records = []
for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
pairs = ([k, int(v)] for k,v in map(str.split, record))
records.append(dict(pairs))

print records

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,818
Latest member
Brigette36

Latest Threads

Top