text file parsing (awk -> python)

Daniel Nogradi · Nov 22, 2006

Hi list,

I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

Peter Otten · Nov 22, 2006

Daniel said:
I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)

Daniel Nogradi · Nov 22, 2006

I have an awk program that parses a text file which I would like to

rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

Click to expand...

data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)

Thanks very much, that's exactly what I had in mind.

Thanks again,
Daniel

bearophileHUGS · Nov 22, 2006

Peter Otten, your solution is very nice, it uses groupby splitting on
empty lines, so it doesn't need to read the whole files into memory.

But Daniel Nogradi says:

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

Your version with the converters dict fails to convert the number of
node, z fields, etc. (generally using such converters dict is an
elegant solution, it allows to define string, float, etc fields):

converters = dict(
x=int,
y=int
)

I have created a version with a RE, but it's probably too much rigid,
it doesn't handle files with the z field, etc:

data = """node 10
y 1
x -1

node 11
x -2
y 1
z 5

node 12
x -3
y 1
z 6"""

import re
unpack = re.compile(r"(\D+) \s+ ([-+]? \d+) \s+" * 3, re.VERBOSE)

result = []
for obj in unpack.finditer(data):
block = obj.groups()
d = dict((block, int(block[i+1])) for i in xrange(0, 6, 2))
result.append(d)

print result

So I have just modified and simplified your quite nice solution (I have
removed the pprint, but it's the same):

def open(filename):
from cStringIO import StringIO
return StringIO(data)

from itertools import groupby

records = []
for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
pairs = ([k, int(v)] for k,v in map(str.split, record))
records.append(dict(pairs))

print records

Bye,
bearophile

Php combine identical lines in text file	4	Oct 11, 2023
Docplex package in python	0	Nov 8, 2022
awk like usage in python	0	Nov 9, 2012
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
Bootstrap Tree View doesnt search properly. I am searching for 954116679 (FSP) but it returns 0 matches found	2	May 27, 2024
Python Gurobi Optimizing Cost has no errors but I get no sensible solution	0	Aug 30, 2022
Convert AWK regex to Python	6	May 16, 2011
Why is Python telling me variable is local not global?	3	Sep 2, 2023

text file parsing (awk -> python)

Daniel Nogradi

Peter Otten

Daniel Nogradi

bearophileHUGS

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads