Browsing text ; Python the right tool?

P

Paul Kooistra

I need a tool to browse text files with a size of 10-20 Mb. These
files have a fixed record length of 800 bytes (CR/LF), and containt
records used to create printed pages by an external company.

Each line (record) contains an 2-character identifier, like 'A0' or
'C1'. The identifier identifies the record format for the line,
thereby allowing different record formats to be used in a textfile.
For example:

An A0 record may consist of:
recordnumber [1:4]
name [5:25]
filler [26:800]

while a C1 record consists of:
recordnumber [1:4]
phonenumber [5:15]
zipcode [16:20]
filler [21:800]

As you see, all records have a fixed column format. I would like to
build a utility which allows me (in a windows environment) to open a
textfile and browse through the records (ideally with a search
option), where each recordtype is displayed according to its
recordformat ('Attributename: Value' format). This would mean that
browsing from a A0 to C1 record results in a different list of
attributes + values on the screen, allowing me to analyze the data
generated a lot easier then I do now, browsing in a text editor with a
stack of printed record formats at hand.

This is of course quite a common way of encoding data in textfiles.
I've tried to find a generic text-based browser which allows me to do
just this, but cannot find anything. Enter Python; I know the language
by name, I know it handles text just fine, but I am not really
interested in learning Python just now, I just need a tool to do what
I want.

What I would REALLY like is way to define standard record formats in a
separate definition, like:
- defining a common record length;
- defining the different record formats (attributes, position of the
line);
- and defining when a specific record format is to be used, dependent
on 1 or more identifiers in the record.

I CAN probably build something from scratch, but if I can (re)use
something that already exists it would be so much better and faster...
And a utility to do what I just described would be REALLY usefull in
LOTS of environments.

This means I have the following questions:

1. Does anybody now of a generic tool (not necessarily Python based)
that does the job I've outlined?
2. If not, is there some framework or widget in Python I can adapt to
do what I want?
3. If not, should I consider building all this just from scratch in
Python - which would probably mean not only learning Python, but some
other GUI related modules?
4. Or should I forget about Python and build someting in another
environment?

Any help would be appreciated.
 
B

beliavsky

Here is an elementary suggestion. It would not be difficult to write a
Python script to make a csv file from your text files, adding commas at
the appropriate places to separate fields. Then the csv file can be
browsed in Excel (or some other spreadsheet). A0 and C1 records could
be written to separate csv files.

There are Python programs to create Excel spreadsheets, and they could
be used to format the data in more sophisticated ways.
 
J

John Machin

Paul said:
I need a tool to browse text files with a size of 10-20 Mb. These
files have a fixed record length of 800 bytes (CR/LF), and containt
records used to create printed pages by an external company.

Each line (record) contains an 2-character identifier, like 'A0' or
'C1'. The identifier identifies the record format for the line,
thereby allowing different record formats to be used in a textfile.
For example:

An A0 record may consist of:
recordnumber [1:4]
name [5:25]
filler [26:800]

1. Python syntax calls these [0:4], [4:25], etc. One has to get into
the habit of deducting 1 from the start column position given in a
document.

2. So where's the "A0"? Are the records really 804 bytes wide -- "A0"
plus the above plus CR LF? What is "recordnumber" -- can't be a line
number (4 digits -> max 10k; 10k * 800 -> only 8Mb); looks too small to
be a customer identifier; is it the key to a mapping that produces
"A0", "C1", etc?
while a C1 record consists of:
recordnumber [1:4]
phonenumber [5:15]
zipcode [16:20]
filler [21:800]

As you see, all records have a fixed column format. I would like to
build a utility which allows me (in a windows environment) to open a
textfile and browse through the records (ideally with a search
option), where each recordtype is displayed according to its
recordformat ('Attributename: Value' format). This would mean that
browsing from a A0 to C1 record results in a different list of
attributes + values on the screen, allowing me to analyze the data
generated a lot easier then I do now, browsing in a text editor with a
stack of printed record formats at hand.

This is of course quite a common way of encoding data in textfiles.
I've tried to find a generic text-based browser which allows me to do
just this, but cannot find anything. Enter Python; I know the language
by name, I know it handles text just fine, but I am not really
interested in learning Python just now, I just need a tool to do what
I want.

What I would REALLY like is way to define standard record formats in a
separate definition, like:
- defining a common record length;
- defining the different record formats (attributes, position of the
line);

Add in the type, number of decimal places, etc as well ..
- and defining when a specific record format is to be used, dependent
on 1 or more identifiers in the record.

I CAN probably build something from scratch, but if I can (re)use
something that already exists it would be so much better and faster...
And a utility to do what I just described would be REALLY usefull in
LOTS of environments.

This means I have the following questions:

1. Does anybody now of a generic tool (not necessarily Python based)
that does the job I've outlined?

No, but please post if you hear of one.
2. If not, is there some framework or widget in Python I can adapt to
do what I want?
3. If not, should I consider building all this just from scratch in
Python - which would probably mean not only learning Python, but some
other GUI related modules?

Approach I use is along the lines of what you suggested, but w/o the
GUI.
I have a Python script that takes layout info and an input file and can
produce an output file in one of two formats:

Format 1:
something like:
Rec:A0 recordnumber:0001 phonenumber:(123) 555-1234 zipcode:12345

This is usually much shorter than the fixed length record, because you
leave out the fillers (after checking they are blank!), and strip
trailing spaces from alphanumeric fields. Whether you leave integers,
money, date etc fields as per file or translated into human-readable
form depends on who will be reading it.

You then use a robust text editor (preferably one which supports
regular expressions in its find function) to browse the output file.

Format 2:
Rec:A0
recordnumber:0001
etc etc i.e. one field per line? Why, you ask? If you are a consumer of
such files, so that you can take small chunks of this, drop it into
Excel, testers take copy, make lots of juicy test data, run it through
another script which makes a flat file out of it.
4. Or should I forget about Python and build someting in another
environment?

No way!
 
J

John Machin

Paul said:
I need a tool to browse text files with a size of 10-20 Mb. These
files have a fixed record length of 800 bytes (CR/LF), and containt
records used to create printed pages by an external company.

Each line (record) contains an 2-character identifier, like 'A0' or
'C1'. The identifier identifies the record format for the line,
thereby allowing different record formats to be used in a textfile.
For example:

An A0 record may consist of:
recordnumber [1:4]
name [5:25]
filler [26:800]

1. Python syntax calls these [0:4], [4:25], etc. One has to get into
the habit of deducting 1 from the start column position given in a
document.

2. So where's the "A0"? Are the records really 804 bytes wide -- "A0"
plus the above plus CR LF? What is "recordnumber" -- can't be a line
number (4 digits -> max 10k; 10k * 800 -> only 8Mb); looks too small to
be a customer identifier; is it the key to a mapping that produces
"A0", "C1", etc?
while a C1 record consists of:
recordnumber [1:4]
phonenumber [5:15]
zipcode [16:20]
filler [21:800]

As you see, all records have a fixed column format. I would like to
build a utility which allows me (in a windows environment) to open a
textfile and browse through the records (ideally with a search
option), where each recordtype is displayed according to its
recordformat ('Attributename: Value' format). This would mean that
browsing from a A0 to C1 record results in a different list of
attributes + values on the screen, allowing me to analyze the data
generated a lot easier then I do now, browsing in a text editor with a
stack of printed record formats at hand.

This is of course quite a common way of encoding data in textfiles.
I've tried to find a generic text-based browser which allows me to do
just this, but cannot find anything. Enter Python; I know the language
by name, I know it handles text just fine, but I am not really
interested in learning Python just now, I just need a tool to do what
I want.

What I would REALLY like is way to define standard record formats in a
separate definition, like:
- defining a common record length;
- defining the different record formats (attributes, position of the
line);

Add in the type, number of decimal places, etc as well ..
- and defining when a specific record format is to be used, dependent
on 1 or more identifiers in the record.

I CAN probably build something from scratch, but if I can (re)use
something that already exists it would be so much better and faster...
And a utility to do what I just described would be REALLY usefull in
LOTS of environments.

This means I have the following questions:

1. Does anybody now of a generic tool (not necessarily Python based)
that does the job I've outlined?

No, but please post if you hear of one.
2. If not, is there some framework or widget in Python I can adapt to
do what I want?
3. If not, should I consider building all this just from scratch in
Python - which would probably mean not only learning Python, but some
other GUI related modules?

Approach I use is along the lines of what you suggested, but w/o the
GUI.
I have a Python script that takes layout info and an input file and can
produce an output file in one of two formats:

Format 1:
something like:
Rec:A0 recordnumber:0001 phonenumber:(123) 555-1234 zipcode:12345

This is usually much shorter than the fixed length record, because you
leave out the fillers (after checking they are blank!), and strip
trailing spaces from alphanumeric fields. Whether you leave integers,
money, date etc fields as per file or translated into human-readable
form depends on who will be reading it.

You then use a robust text editor (preferably one which supports
regular expressions in its find function) to browse the output file.

Format 2:
Rec:A0
recordnumber:0001
etc etc i.e. one field per line? Why, you ask? If you are a consumer of
such files, so that you can take small chunks of this, drop it into
Excel, testers take copy, make lots of juicy test data, run it through
another script which makes a flat file out of it.
4. Or should I forget about Python and build someting in another
environment?

No way!
 
J

Jeff Shannon

Paul said:
1. Does anybody now of a generic tool (not necessarily Python based)
that does the job I've outlined?
2. If not, is there some framework or widget in Python I can adapt to
do what I want?

Not that I know of, but...
3. If not, should I consider building all this just from scratch in
Python - which would probably mean not only learning Python, but some
other GUI related modules?

This should be pretty easy. If each record is CRLF terminated, then
you can get one record at a time simply by iterating over the file
("for line in open('myfile.dat'): ..."). You can have a dictionary of
classes or factory functions, one for each record type, keyed off of
the 2-character identifier. Each class/factory would know the layout
of that record type, and return a(n) instance/dictionary with fields
separated out into attributes/items.

The trickiest part would be in displaying the data; you could
potentially use COM to insert it into a Word or Excel document, or
code your own GUI in Python. The former would be pretty easy if
you're happy with fairly simple formatting; the latter would require a
bit more effort, but if you used one of Python's RAD tools (Boa
Constructor, or maybe PythonCard, as examples) you'd be able to get
very nice results.

Jeff Shannon
Technician/Programmer
Credit International
 
J

John Machin

Jeff said:
Not that I know of, but...


This should be pretty easy. If each record is CRLF terminated, then
you can get one record at a time simply by iterating over the file
("for line in open('myfile.dat'): ..."). You can have a dictionary of
classes or factory functions, one for each record type, keyed off of
the 2-character identifier. Each class/factory would know the layout
of that record type,

This is plausible only under the condition that Santa Claus is paying
you $X per class/factory or per line of code, or you are so speed-crazy
that you are machine-generating C code for the factories.

I'd suggest "data driven" -- you grab the .doc or .pdf that describes
your layouts, ^A^C, fire up Excel, paste special, massage it, so you
get one row per field, with start & end posns, type, dec places,
optional/mandatory, field name, whatever else you need. Insert a column
with the record name. Save it as a CSV file.

Then you need a function to load this layout file into dictionaries,
and build cross-references field_name -> field_number (0,1,2,...) and
vice versa.

As your record name is not in a fixed position in the record, you will
also need to supply a function (file_type, record_string) ->
record_name.

Then you have *ONE* function that takes a file_type, a record_name, and
a record_string, and gives you a list of the values. That is all you
need for a generic browser application.

For working on a _specific_ known file_type, you can _then_ augment
that to give you record objects that you use like a0.zipcode or record
dictionaries that you use like a0['zipcode'].

You *don't* have to hand-craft a class for each record type. And you
wouldn't want to, if you were dealing with files whose spec keeps on
having fields added and fields obsoleted.

Notice: in none of the above do you ever have to type in a column
position, except if you manually add updates to your layout file.

Then contemplate how productive you will be when/if you need to
_create_ such files -- you will push everything through one function
which will format each field correctly in the correct column positions
(and chuck an exception if it won't fit). Slightly better than an
approach that uses
something like nbytes = sprintf(buffer, "%04d%-20s%-5s", a0_num,
a0_phone, a0_zip);

HTH,
John
 
J

Jeff Shannon

John said:
Jeff said:
[...] If each record is CRLF terminated, then
you can get one record at a time simply by iterating over the file
("for line in open('myfile.dat'): ..."). You can have a dictionary
classes or factory functions, one for each record type, keyed off
of the 2-character identifier. Each class/factory would know the
layout of that record type,

This is plausible only under the condition that Santa Claus is paying
you $X per class/factory or per line of code, or you are so speed-crazy
that you are machine-generating C code for the factories.

I think that's overly pessimistic. I *was* presuming a case where the
number of record types was fairly small, and the definitions of those
records reasonably constant. For ~10 or fewer types whose spec
doesn't change, hand-coding the conversion would probably be quicker
and/or more straightforward than writing a spec-parser as you suggest.

If, on the other hand, there are many record types, and/or those
record types are subject to changes in specification, then yes, it'd
be better to parse the specs from some sort of data file.

The O.P. didn't mention anything either way about how dynamic the
record specs are, nor the number of record types expected. I suspect
that we're both assuming a case similar to our own personal
experiences, which are different enough to lead to different preferred
solutions. ;)

Jeff Shannon
Technician/Programmer
Credit International
 
J

John Machin

Jeff said:
John said:
Jeff said:
[...] If each record is CRLF terminated, then
you can get one record at a time simply by iterating over the file
("for line in open('myfile.dat'): ..."). You can have a dictionary
classes or factory functions, one for each record type, keyed off
of the 2-character identifier. Each class/factory would know the
layout of that record type,

This is plausible only under the condition that Santa Claus is paying
you $X per class/factory or per line of code, or you are so speed-crazy
that you are machine-generating C code for the factories.

I think that's overly pessimistic. I *was* presuming a case where the
number of record types was fairly small, and the definitions of those
records reasonably constant. For ~10 or fewer types whose spec
doesn't change, hand-coding the conversion would probably be quicker
and/or more straightforward than writing a spec-parser as you
suggest.

I didn't suggest writing a "spec-parser". No (mechanical) parsing is
involved. The specs that I'm used to dealing with set out the record
layouts in a tabular fashion. The only hassle is extracting that from a
MSWord document or a PDF.
If, on the other hand, there are many record types, and/or those
record types are subject to changes in specification, then yes, it'd
be better to parse the specs from some sort of data file.

"Parse"? No parsing, and not much code at all: The routine to "load"
(not "parse") the layout from the layout.csv file into dicts of dicts
is only 35 lines of Python code. The routine to take an input line and
serve up an object instance is about the same. It does more than the
OP's browsing requirement already. The routine to take an object and
serve up a correctly formatted output line is only 50 lines of which
1/4 is comment or blank.
The O.P. didn't mention anything either way about how dynamic the
record specs are, nor the number of record types expected.

My reasoning: He did mention A0 and C1 hence one could guess from that
he maybe had 6 at least. Also, files used to "create printed pages by
an external company" (especially by a company that had "leaseplan" in
its e-mail address) would indicate "many" and "complicated" to me.
I suspect
that we're both assuming a case similar to our own personal
experiences, which are different enough to lead to different preferred
solutions. ;)

Indeed. You seem to have lead a charmed life; may the wizards and the
rangers ever continue to protect you from the dark riders! :)

My personal experiences and attitudes: (1) extreme aversion to having
to type (correctly) lots of numbers (column positions and lengths), and
to having to mentally translate start = 663, len = 13 to [662:675] or
having ugliness like [663-1:663+13-1] (2) cases like 17 record types
and 112 fields in one file, 8 record types and 86 fields in a second --
this being a new relatively clean simple exercise in exchanging files
with a government department (3) Past history of this govt dept is that
there are at least another 7 file types in regular use and they change
the _major_ version number of each file type about once a year on
average (3) These things tend to start out deceptively small and simple
and turn into monsters.

Cheers,
John
 
J

Jeff Shannon

John said:
Jeff said:
[...] For ~10 or fewer types whose spec
doesn't change, hand-coding the conversion would probably be quicker
and/or more straightforward than writing a spec-parser as you
suggest.

I didn't suggest writing a "spec-parser". No (mechanical) parsing is
involved. The specs that I'm used to dealing with set out the record
layouts in a tabular fashion. The only hassle is extracting that from a
MSWord document or a PDF.

The "specs" I'm used to dealing with are inconsistent enough that it's
more work to "massage" them into strict tabular format than it is to
retype and verify them. Typically it's one or two file types, with
one or two record types each, from each vendor -- and of course no
vendor uses anything similar to any other, nor is there a standardized
way for them to specify what they *do* use. Everything is almost
completely ad-hoc.
"Parse"? No parsing, and not much code at all: The routine to "load"
(not "parse") the layout from the layout.csv file into dicts of dicts
is only 35 lines of Python code. The routine to take an input line and
serve up an object instance is about the same. It does more than the
OP's browsing requirement already. The routine to take an object and
serve up a correctly formatted output line is only 50 lines of which
1/4 is comment or blank.

There's a tradeoff between the effort involved in writing multiple
custom record-type classes, and the effort necessary to write the
generic loading routines plus the effort to massage coerce the
specifications into a regular, machine-readable format. I suppose
that "parsing" may not precisely be the correct term here, but I was
using it in parallel to, say, ConfigParser and Optparse. Either
you're writing code to translate some sort of received specification
into a usable format, or you're manually pushing bytes around to get
them into a format that your code *can* translate. I'd say that my
creation of custom classes is just a bit further along a continuum
than your massaging of specification data -- I'm just massaging it
into Python code instead of CSV tables.
Indeed. You seem to have lead a charmed life; may the wizards and the
rangers ever continue to protect you from the dark riders! :)

Hardly charmed -- more that there's so little regularity in what I'm
given that massaging it to a standard format is almost as much work as
just buckling down and retyping it. My one saving grace is that I'm
usually able to work with delimited files, rather than
column-width-specified files. I'll spare you the rant about my many
job-related frustrations, but trust me, there ain't no picnics here!

Jeff Shannon
Technician/Programmer
Credit International
 
P

Paul Kooistra

Sorry to reply this late guys - I cannot access news from Work, and Google
Groups cannot reply to a message so I had to do it at home. Let me address a
few of the remarks and questions you guys asked:

First of all, the example I gave was just that - an example. Yes, I know
Python starts with 0, and I know that you cannot fit a 4-digit number in 2
positions, this was just to give the idea. To clarify, at THIS moment I need
to browse 1-80 Mb size tekstfiles. At this moment, I have 16 different
record definitions, numbered A,B, C1-C8, D-H. Each record definition has
20-60 different attributes.

Not only that, but these formats change regularly; and I want to create or
use something I can use on *other* applications or sites as well. As I said,
I have encountered the type of problem I've described in numberous places
already.
John wrote:
I have a Python script that takes layout info and an input file and can
produce an output file in one of two formats:

Yes John, I was thinking along these lines myself. The problem is that I
have to parse several of these large files each day (debugging) and browsing
converted output seems just to tedious and inefficient. I would REALLY like
a GIU, and preferable something portable I can re-use later on.
This should be pretty easy. If each record is CRLF terminated, then you
can get one record at a time simply by iterating over the file ("for line
in open('myfile.dat'): ...").

Jeff, this was indeed the way I was thinking. But instead of iterating I
need the ability to browse forward and backward.
You can have a dictionary of classes or factory functions, one for each
record type, keyed off of the 2-character identifier. Each class/factory
would know the layout of that record type, and return a(n)
instance/dictionary with fields separated out into attributes/items.

This is of course a clean approach, but would mean re-coding every time a
records is changed - frequently! I really would like to edit only a data
definition file.
The trickiest part would be in displaying the data; you could potentially
use COM to insert it into a Word or Excel document, or code your own GUI
in Python. The former would be pretty easy if you're happy with fairly
simple formatting; the latter would require a bit more effort, but if you
used one of Python's RAD tools (Boa Constructor, or maybe PythonCard, as
examples) you'd be able to get very nice results.

I will at least look into Boa and PythonCard. Thanks for the hint.
This is plausible only under the condition that Santa Claus is paying
you $X per class/factory or per line of code, or you are so speed-crazy
that you are machine-generating C code for the factories.

Unfortunately, neither is the case :)
I'd suggest "data driven"
Yeah!

Then you need a function to load this layout file into dictionaries,
and build cross-references field_name -> field_number (0,1,2,...) and
vice versa.
As your record name is not in a fixed position in the record, you will
also need to supply a function (file_type, record_string) ->
record_name.

I thought about supplying a flat ASCII definition such as:

Then you have *ONE* function that takes a file_type, a record_name, and
a record_string, and gives you a list of the values. That is all you
need for a generic browser application.

I like this.
You *don't* have to hand-craft a class for each record type. And you
wouldn't want to, if you were dealing with files whose spec keeps on
having fields added and fields obsoleted.
Exactly.

I think that's overly pessimistic. I *was* presuming a case where the
number of record types was fairly small, and the definitions of those
records reasonably constant. For ~10 or fewer types whose spec doesn't
change, hand-coding the conversion would probably be quicker and/or more
straightforward than writing a spec-parser as you suggest.

Unfortunately, all wrong :)

Lots of records, lots of changes, lots of different record types -
hardcoding doesnt seem the right way.
"Parse"? No parsing, and not much code at all: The routine to "load"
(not "parse") the layout from the layout.csv file into dicts of dicts
is only 35 lines of Python code. The routine to take an input line and
serve up an object instance is about the same. It does more than the
OP's browsing requirement already. The routine to take an object and
serve up a correctly formatted output line is only 50 lines of which
1/4 is comment or blank.

John,do you have suggestions where I can find examples of these functions? I
can program, but not being proficient in Python, any help or examples I can
adapt would be nice
Also, files used to "create printed pages by
an external company" (especially by a company that had "leaseplan" in
its e-mail address) would indicate "many" and "complicated" to me.

How right you are. Think about production runs of 150.000 invoices, each
invoice consisting of 2-10 records, and you are on the right track.
I suspect
that we're both assuming a case similar to our own personal
experiences, which are different enough to lead to different
preferred solutions. ;)
Seconded.

My personal experiences and attitudes: (1) extreme aversion to having
to type (correctly) lots of numbers (column positions and lengths), and
to having to mentally translate start = 663, len = 13 to [662:675] or
having ugliness like [663-1:663+13-1] (2) cases like 17 record types
and 112 fields in one file, 8 record types and 86 fields in a second --
this being a new relatively clean simple exercise in exchanging files
with a government department (3) Past history of this govt dept is that
there are at least another 7 file types in regular use and they change
the _major_ version number of each file type about once a year on
average (3) These things tend to start out deceptively small and simple
and turn into monsters.

Our experiences are remarkably similair...

Cheers,
Paul
 
J

Jorgen Grahn

Here is an elementary suggestion. It would not be difficult to write a
Python script to make a csv file from your text files, adding commas at
the appropriate places to separate fields. Then the csv file can be
browsed in Excel (or some other spreadsheet).

I'd create text files like someone else suggested, because I'm more
comfortable with at least three text editors/viewers than with Excel.

But the bottom line is that it's a waste of time to design a new GUI around
a file format, when you can tweak the data enough to reuse something that
exists, and /has/ all the features you will eventually want.
A0 and C1 records could
be written to separate csv files.

(Assuming that's OK, I wonder why they shared a file to begin with. Is the
order between A0 and C1 records important?)

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,833
Latest member
BettyeMacf

Latest Threads

Top