Parse each line by character location

Tyler · Nov 4, 2008

Hello All:

I hope this is the right place to ask, but I am trying to come up with
a way to parse each line of a file. Unfortunately, the file is neither
comma, nor tab, nor space delimited. Rather, the character locations
imply what field it is.

For example:

The first ten characters would be the record number, the next
character is the client type, the next ten characters are a volume,
and the next three are order type, and the last character would be an
optional type depending on the order type.

The lines are somewhat more complicated, but they work like that, and
not all have to be populated, in that they may contain spaces. For
example, the order number may be 2345, and it is space padded at the
beginning of the line, and other might be zero padded in the front.
Imagine I have a line:

______2345H0000300000_NC_

where the underscores indicate a space. I then want to map this to:

2345,H,0000300000,NC,

In other words, I want to preserve ALL of the fields, but map to
something that awk could easily cut up afterwords, or open in a CSV
editor. I am unsure how to place the commas based on character
location.

Any ideas?

Tim Chase · Nov 4, 2008

I hope this is the right place to ask, but I am trying to come up with

a way to parse each line of a file. Unfortunately, the file is neither
comma, nor tab, nor space delimited. Rather, the character locations
imply what field it is.

For example:

The first ten characters would be the record number, the next
character is the client type, the next ten characters are a volume,
and the next three are order type, and the last character would be an
optional type depending on the order type.

Sounds like you could do something like

recno_idx = slice(0,10)
client_idx = slice(10, 11)
volume_idx = slice(11,11+10)
order_type_idx = slice(11+10, 11+10+3)

out = file('out.txt', 'w')
for line in file('in.txt'):
out.write(','.join([
s[recno_idx],
s[client_idx],
s[volume_idx],
s[order_type_idx],
s[11+10+3:],
]))
out.close()

-tkc

George Sakkis · Nov 4, 2008

Hello All:

I hope this is the right place to ask, but I am trying to come up with
a way to parse each line of a file. Unfortunately, the file is neither
comma, nor tab, nor space delimited. Rather, the character locations
imply what field it is.

For example:

The first ten characters would be the record number, the next
character is the client type, the next ten characters are a volume,
and the next three are order type, and the last character would be an
optional type depending on the order type.

The lines are somewhat more complicated, but they work like that, and
not all have to be populated, in that they may contain spaces. For
example, the order number may be 2345, and it is space padded at the
beginning of the line, and other might be zero padded in the front.
Imagine I have a line:

______2345H0000300000_NC_

where the underscores indicate a space. I then want to map this to:

2345,H,0000300000,NC,

In other words, I want to preserve ALL of the fields, but map to
something that awk could easily cut up afterwords, or open in a CSV
editor. I am unsure how to place the commas based on character
location.

Any ideas?

Here's a general solution for fixed size records:
.... slices = len(sizes) * [None]
.... start = 0
.... for i,size in enumerate(sizes):
.... stop = start+size
.... slices = slice(start,stop)
.... start = stop
.... return lambda string: [string.strip() for s in slices]
....['2345', 'H', '0000300000', 'NC']

HTH,
George

Tyler · Nov 4, 2008

Wow! Thanks for the suggestions everyone.

Cheers,

t.

Cameron Laird · Nov 4, 2008

Sounds like you could do something like

recno_idx = slice(0,10)
client_idx = slice(10, 11)
volume_idx = slice(11,11+10)
order_type_idx = slice(11+10, 11+10+3)

.
.
.
!? That seems to me confusingly far from a working solution,
at least in comparison to

recno_idex = the_line[0:10]
client_idx = the_line[10:11]
...

What am I missing?

Arnaud Delobelle · Nov 4, 2008

Sounds like you could do something like

recno_idx = slice(0,10)
client_idx = slice(10, 11)
volume_idx = slice(11,11+10)
order_type_idx = slice(11+10, 11+10+3)

Click to expand...

.
.
.
!? That seems to me confusingly far from a working solution,
at least in comparison to

recno_idex = the_line[0:10]
client_idx = the_line[10:11]
...

What am I missing?

I suppose in your case the slice objects will be re-created for each
line, whereas with Tim's solution they are created once and for all
before the parsing starts. This may result in speedier parsing.

Tim Chase · Nov 4, 2008

recno_idx = slice(0,10)

client_idx = slice(10, 11)
volume_idx = slice(11,11+10)
order_type_idx = slice(11+10, 11+10+3)

Click to expand...

.
!? That seems to me confusingly far from a working solution,
at least in comparison to

recno_idex = the_line[0:10]
client_idx = the_line[10:11]
...

What am I missing?

The "11+10" and "11+10+3" are to help show where the magic
numbers come from...that they're column-offsets from the previous
position...I suppose to have been consistent, I should have used

client_idx = the_line[10:10+1]

Somewhat like a kludgy version of George Sakkis's more elegant
version of slicing, but with the advantage of associating names
with the slice-boundaries.

It would be possible to write it as something like

for line in file('in.txt'):
out.write(','.join([
line[0:10], # recno
line[10:11], # client
line[11:21], # volume
line[21:24], # order
line[24:], # remainder
]))

but it's harder to verify that the slicing doesn't incur a
fence-posting error, and makes it harder to follow if
manipulations need further checking like

if line[client_idx] == 'F': continue # skip this client

There are a number of ways to slice & dice the line. I recommend
whichever is easiest to read/understand.

-tkc

bearophileHUGS · Nov 4, 2008

George Sakkis:

Here's a general solution for fixed size records:
... slices = len(sizes) * [None]
... start = 0
... for i,size in enumerate(sizes):
... stop = start+size
... slices = slice(start,stop)
... start = stop
... return lambda string: [string.strip() for s in slices]
...>>> order_slicer = slicer(10,1,10,4)['2345', 'H', '0000300000', 'NC']

Nice. Here's a little modified version:

from collections import namedtuple

def slicer(names, sizes):
"""

>>> sl = slicer(["code", "p1", "progressive", "label"], (10, 1, 10, 4))
>>> txt = "______2345H0000300000_NC_"
>>> print sl(txt.replace('_', ' '))

Click to expand...

Click to expand...

Click to expand...

Sliced(code='2345', p1='H', progressive='0000300000', label='NC')
"""
# several input controls can be added here
slices = []
start = 0
for size in sizes:
stop = start + size
slices.append(slice(start, stop))
start = stop
Sliced = namedtuple("Sliced", names)
return lambda txt: Sliced(*(txt.strip() for s in slices))

if __name__ == "__main__":
import doctest
doctest.testmod()
print "Doctests done.\n"

Bye,
bearophile

Tyler · Nov 5, 2008

So you have a classic (especially for COBOL and older FORTRAN) fixed

field record layout, no?

Exactly, I believe COBOL. It is a daily reconciliation with an
exchange and our system's orders. One of the problems of dealing with
these old legacy systems that never seem to go away....

I presume the entire file is of a single layout? That would mean
only one splitting format is needed...

Again, correct, I only need the one format as it does not change. I
want to move it to a csv file because the application that we are
implementing reads these natively, and does not have a built in
mechanism to parse a file like this effectively. Further, this will
allow me to add some other fields for our own use. As such, the final
file will look somewhat different, but this step kinda threw me. I
started out with AWK, and didn't know whether to laugh or cry.
Thankfully, I was able to convince the "guys" to allow me to install
Python on my dev workstation (although already on production SUSE
servers).

Cheers,

t.

Lie · Nov 5, 2008

On Nov 5 said:
So you have a classic (especially for COBOL and older FORTRAN) fixed
field record layout, no?

I presume the entire file is of a single layout? That would mean
only one splitting format is needed...
[snip]

Note that all fields are still in character format. And has been
noted, I'm sure, if you try to turn the third field into an integer, you
may have problems, depending upon how you do the conversion -- leading 0
implies octal if it were a literal, though it seems int() handles it
correctly (Python 2.5)

from help(int)
| ... If base is zero, the proper base is guessed based on the
| string content. ...

int(x) will always convert x in base 10
int(x, 0) will convert based on literal int notation, i.e. the prefix
'0x' is base 16 (hex), prefix '0' is base 8 (octal), everything else
is base 10.
int(x, n) will convert on base n, where 2 <= n <= 36

if you're still in doubt, just pass the base explicitly to the int:
int(x, 10) (superfluous though)

Shawn Milochik · Nov 5, 2008

I work with tab-delimited files for a living. Because of the same need
you have, I created a Python script to do this. It has usage
information that is easy to follow (just run it without any
arguments).

I hope someone else finds this useful. I have, and use it every month.
It can be easily modified to create comma-delimited files, but that's
something I never use, so it does tabs.

http://milochik.com/shawn/fwconvert.zip

Usage:
fwconvert -r rulesFile fileName [-t|-f]
or
cat filename | fwconvert -r rulesFile" (-t|-f)

-t (to tab) or -f (to fixed-width) required when piping input to
script. Otherwise, it will be auto-determined.

Rules file format:
fieldStart:fieldLength,fieldStart:fieldLength...
Example:
1:3,4:20,24:5

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
<b>Parse error</b>: syntax error, unexpected variable "$POST", expecting ")" in <b>the respective file</b> on line<b>2<b>	1	Feb 9, 2022
Decoding no of ways and printing each decode message	2	Jun 1, 2021
Python point location of intersect between two lines	0	Feb 28, 2018
How to replace UniCode representation with actual character?	6	Dec 18, 2013
OT Komodo Edit, line selection gutter is one pixel wide?	5	Jul 1, 2010
How can I fix my pattern coding error in c++	0	Mar 19, 2023

Parse each line by character location

Tyler

Tim Chase

George Sakkis

Tyler

Cameron Laird

Arnaud Delobelle

Tim Chase

bearophileHUGS

Tyler

Lie

Shawn Milochik

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads