Parse each line by character location

T

Tyler

Hello All:

I hope this is the right place to ask, but I am trying to come up with
a way to parse each line of a file. Unfortunately, the file is neither
comma, nor tab, nor space delimited. Rather, the character locations
imply what field it is.

For example:

The first ten characters would be the record number, the next
character is the client type, the next ten characters are a volume,
and the next three are order type, and the last character would be an
optional type depending on the order type.

The lines are somewhat more complicated, but they work like that, and
not all have to be populated, in that they may contain spaces. For
example, the order number may be 2345, and it is space padded at the
beginning of the line, and other might be zero padded in the front.
Imagine I have a line:

______2345H0000300000_NC_

where the underscores indicate a space. I then want to map this to:

2345,H,0000300000,NC,

In other words, I want to preserve ALL of the fields, but map to
something that awk could easily cut up afterwords, or open in a CSV
editor. I am unsure how to place the commas based on character
location.

Any ideas?
 
T

Tim Chase

I hope this is the right place to ask, but I am trying to come up with
a way to parse each line of a file. Unfortunately, the file is neither
comma, nor tab, nor space delimited. Rather, the character locations
imply what field it is.

For example:

The first ten characters would be the record number, the next
character is the client type, the next ten characters are a volume,
and the next three are order type, and the last character would be an
optional type depending on the order type.

Sounds like you could do something like

recno_idx = slice(0,10)
client_idx = slice(10, 11)
volume_idx = slice(11,11+10)
order_type_idx = slice(11+10, 11+10+3)

out = file('out.txt', 'w')
for line in file('in.txt'):
out.write(','.join([
s[recno_idx],
s[client_idx],
s[volume_idx],
s[order_type_idx],
s[11+10+3:],
]))
out.close()

-tkc
 
G

George Sakkis

Hello All:

I hope this is the right place to ask, but I am trying to come up with
a way to parse each line of a file. Unfortunately, the file is neither
comma, nor tab, nor space delimited. Rather, the character locations
imply what field it is.

For example:

The first ten characters would be the record number, the next
character is the client type, the next ten characters are a volume,
and the next three are order type, and the last character would be an
optional type depending on the order type.

The lines are somewhat more complicated, but they work like that, and
not all have to be populated, in that they may contain spaces. For
example, the order number may be 2345, and it is space padded at the
beginning of the line, and other might be zero padded in the front.
Imagine I have a line:

______2345H0000300000_NC_

where the underscores indicate a space. I then want to map this to:

2345,H,0000300000,NC,

In other words, I want to preserve ALL of the fields, but map to
something that awk could easily cut up afterwords, or open in a CSV
editor. I am unsure how to place the commas based on character
location.

Any ideas?

Here's a general solution for fixed size records:
.... slices = len(sizes) * [None]
.... start = 0
.... for i,size in enumerate(sizes):
.... stop = start+size
.... slices = slice(start,stop)
.... start = stop
.... return lambda string: [string.strip() for s in slices]
....['2345', 'H', '0000300000', 'NC']

HTH,
George
 
C

Cameron Laird

Sounds like you could do something like

recno_idx = slice(0,10)
client_idx = slice(10, 11)
volume_idx = slice(11,11+10)
order_type_idx = slice(11+10, 11+10+3)
.
.
.
!? That seems to me confusingly far from a working solution,
at least in comparison to

recno_idex = the_line[0:10]
client_idx = the_line[10:11]
...

What am I missing?
 
A

Arnaud Delobelle

Sounds like you could do something like

recno_idx = slice(0,10)
client_idx = slice(10, 11)
volume_idx = slice(11,11+10)
order_type_idx = slice(11+10, 11+10+3)
.
.
.
!? That seems to me confusingly far from a working solution,
at least in comparison to

recno_idex = the_line[0:10]
client_idx = the_line[10:11]
...

What am I missing?

I suppose in your case the slice objects will be re-created for each
line, whereas with Tim's solution they are created once and for all
before the parsing starts. This may result in speedier parsing.
 
T

Tim Chase

recno_idx = slice(0,10)
client_idx = slice(10, 11)
volume_idx = slice(11,11+10)
order_type_idx = slice(11+10, 11+10+3)
.
!? That seems to me confusingly far from a working solution,
at least in comparison to

recno_idex = the_line[0:10]
client_idx = the_line[10:11]
...

What am I missing?

The "11+10" and "11+10+3" are to help show where the magic
numbers come from...that they're column-offsets from the previous
position...I suppose to have been consistent, I should have used

client_idx = the_line[10:10+1]

Somewhat like a kludgy version of George Sakkis's more elegant
version of slicing, but with the advantage of associating names
with the slice-boundaries.

It would be possible to write it as something like

for line in file('in.txt'):
out.write(','.join([
line[0:10], # recno
line[10:11], # client
line[11:21], # volume
line[21:24], # order
line[24:], # remainder
]))

but it's harder to verify that the slicing doesn't incur a
fence-posting error, and makes it harder to follow if
manipulations need further checking like

if line[client_idx] == 'F': continue # skip this client

There are a number of ways to slice & dice the line. I recommend
whichever is easiest to read/understand.

-tkc
 
B

bearophileHUGS

George Sakkis:
Here's a general solution for fixed size records:
...     slices = len(sizes) * [None]
...     start = 0
...     for i,size in enumerate(sizes):
...         stop = start+size
...         slices = slice(start,stop)
...         start = stop
...     return lambda string: [string.strip() for s in slices]
...>>> order_slicer = slicer(10,1,10,4)['2345', 'H', '0000300000', 'NC']



Nice. Here's a little modified version:

from collections import namedtuple

def slicer(names, sizes):
"""
>>> sl = slicer(["code", "p1", "progressive", "label"], (10, 1, 10, 4))
>>> txt = "______2345H0000300000_NC_"
>>> print sl(txt.replace('_', ' '))
Sliced(code='2345', p1='H', progressive='0000300000', label='NC')
"""
# several input controls can be added here
slices = []
start = 0
for size in sizes:
stop = start + size
slices.append(slice(start, stop))
start = stop
Sliced = namedtuple("Sliced", names)
return lambda txt: Sliced(*(txt.strip() for s in slices))

if __name__ == "__main__":
import doctest
doctest.testmod()
print "Doctests done.\n"

Bye,
bearophile
 
T

Tyler

        So you have a classic (especially for COBOL and older FORTRAN) fixed
field record layout, no?

Exactly, I believe COBOL. It is a daily reconciliation with an
exchange and our system's orders. One of the problems of dealing with
these old legacy systems that never seem to go away....
        I presume the entire file is of a single layout? That would mean
only one splitting format is needed...

Again, correct, I only need the one format as it does not change. I
want to move it to a csv file because the application that we are
implementing reads these natively, and does not have a built in
mechanism to parse a file like this effectively. Further, this will
allow me to add some other fields for our own use. As such, the final
file will look somewhat different, but this step kinda threw me. I
started out with AWK, and didn't know whether to laugh or cry.
Thankfully, I was able to convince the "guys" to allow me to install
Python on my dev workstation (although already on production SUSE
servers).

Cheers,

t.
 
L

Lie

        So you have a classic (especially for COBOL and older FORTRAN) fixed
field record layout, no?

        I presume the entire file is of a single layout? That would mean
only one splitting format is needed...
[snip]

        Note that all fields are still in character format. And has been
noted, I'm sure, if you try to turn the third field into an integer, you
may have problems, depending upon how you do the conversion -- leading 0
implies octal if it were a literal, though it seems int() handles it
correctly (Python 2.5)

from help(int)
| ... If base is zero, the proper base is guessed based on the
| string content. ...

int(x) will always convert x in base 10
int(x, 0) will convert based on literal int notation, i.e. the prefix
'0x' is base 16 (hex), prefix '0' is base 8 (octal), everything else
is base 10.
int(x, n) will convert on base n, where 2 <= n <= 36

if you're still in doubt, just pass the base explicitly to the int:
int(x, 10) (superfluous though)
 
S

Shawn Milochik

I work with tab-delimited files for a living. Because of the same need
you have, I created a Python script to do this. It has usage
information that is easy to follow (just run it without any
arguments).

I hope someone else finds this useful. I have, and use it every month.
It can be easily modified to create comma-delimited files, but that's
something I never use, so it does tabs.

http://milochik.com/shawn/fwconvert.zip



Usage:
fwconvert -r rulesFile fileName [-t|-f]
or
cat filename | fwconvert -r rulesFile" (-t|-f)

-t (to tab) or -f (to fixed-width) required when piping input to
script. Otherwise, it will be auto-determined.


Rules file format:
fieldStart:fieldLength,fieldStart:fieldLength...
Example:
1:3,4:20,24:5
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,266
Messages
2,571,342
Members
48,018
Latest member
DelilahDen

Latest Threads

Top