Advice on dealing with a legacy file format

The Frog · Sep 30, 2010

Hi Everyone,

Before I begin trudging away at this I thought it prudent to ask for
some advice. I have a legacy file format from an application that is
common to both the company I work for and the business partners. The
file is a 'sort of' structured, in that it is text based, and there
are paragraphs of information, each following its own structural bent.

I am going to write a parser for this file type so that we can do more
than we are currently limited to with the existing application. My
question is therefore: What would be an 'elegant' approach to reading
this file and its various parapgraphs? An example is below....

<This is a 'blank slate' file>
PROSPACE SCHEMATIC FILE
; Version 2006.3.2
Project,Space Planning-Projekt,,
0,,7,1.5,1.5,1,1,0,,1,0,0,0,0,0,3,1,1,0,1,0,0,0,0,2,3,1,0,1,0,0,0,0,3,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,2,2,2,,0,0,0,0,0,0,0,0,,,0,

<This is with some basic data in it>
PROSPACE SCHEMATIC FILE
; Version 2006.3.2
Project,Space Planning-Projekt,,
0,,7,1.5,1.5,1,1,0,,1,1,0,0,0,0,3,1,1,0,1,0,0,0,0,2,3,1,0,1,0,0,0,0,3,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,2,2,2,,0,0,0,0,0,0,0,0,,,0,
Planogram,Test,,
100,200,60,16777215,2,1,100,10,60,1,16777215,0,10,2,0,32896,1,0,0,1,0,,,-1,-1,-1,-1,0,0,0,-1,-1,-1,-1,-1,-1,0,0,0,-1,-1,-1,-1,-1,-1,0,0,0,-1,-1,0,1,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,1,,,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,,,,0,0,0,0,0,0,0,0,,,,0.5,30,30,0,0,0,0,0,0,1.5,1.5,1,1,0,0,0,0,0,0,0,0,,,,,,F766F39C-5610-4ced-
AB0A-FDFD4D6DC166,,,,,0
Segment,,,
0,100,0,0,0,0,0,0,0,0,0,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,,

What I am hoping to do is to eliminate the need to do a brute force
approach. I was hoping that perhaps someone might have a suggestion
using regular expressions. The goal at this stage is to read the file
and place the data into suitable class objects with some sort of
heirarchy. I understand the elements themselves (or mostly,
unfortunately there is no official documentation for the file type),
so interpreting the values into suitable objects isnt the issue, its
reading the file in the first place and isolating those entities from
the text in a 'clean' way.

Any advice would be greatly appreciated.

The Frog

Tom Anderson · Sep 30, 2010

The file is a 'sort of' structured, in that it is text based, and there
are paragraphs of information, each following its own structural bent.

Could you expand on what you mean by that? The lexical structure looks
rather regular to me.

An example is below....

<This is a 'blank slate' file>
PROSPACE SCHEMATIC FILE
; Version 2006.3.2
Project,Space Planning-Projekt,,
0,,7,1.5,1.5,1,1,0,,1,0,0,0,0,0,3,1,1,0,1,0,0,0,0,2,3,1,0,1,0,0,0,0,3,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,2,2,2,,0,0,0,0,0,0,0,0,,,0,

<This is with some basic data in it>
PROSPACE SCHEMATIC FILE
; Version 2006.3.2
Project,Space Planning-Projekt,,
0,,7,1.5,1.5,1,1,0,,1,1,0,0,0,0,3,1,1,0,1,0,0,0,0,2,3,1,0,1,0,0,0,0,3,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,2,2,2,,0,0,0,0,0,0,0,0,,,0,
Planogram,Test,,
100,200,60,16777215,2,1,100,10,60,1,16777215,0,10,2,0,32896,1,0,0,1,0,,,-1,-1,-1,-1,0,0,0,-1,-1,-1,-1,-1,-1,0,0,0,-1,-1,-1,-1,-1,-1,0,0,0,-1,-1,0,1,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,1,,,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,2,,,,0,0,0,0,0,0,0,0,,,,0.5,30,30,0,0,0,0,0,0,1.5,1.5,1,1,0,0,0,0,0,0,0,0,,,,,,F766F39C-5610-4ced-
AB0A-FDFD4D6DC166,,,,,0
Segment,,,
0,100,0,0,0,0,0,0,0,0,0,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,,

Ye gods.

What I am hoping to do is to eliminate the need to do a brute force
approach. I was hoping that perhaps someone might have a suggestion
using regular expressions.

"Now, they have two problems."

Okay, so AFAICT, the format of a file is that it's a file header followed
by a sequence of paragraphs. The file header is the string "PROSPACE
SCHEMATIC FILE" followed by a line break, followed by a version info line,
followed by a line break (perhaps more than one info line, but each
starting with a semicolon and terminated by a line break). A paragraph is
a metadata line followed by a data line. Lines are sequences of values,
separated by commas, terminated by a line break. The first value in each
metadata line is the type of the paragraph, which governs the semantics of
the values on the data line.

Is that about right?

If so, i wouldn't bother with anything except a manual parse, because this
is a very straightforward format. I can't see any helpful way to use
regular expressions.

I have the day off, and i really should be packing to go on holiday, so
naturally, as a means of positive procrastination, i have written a little
parser to show how i would do it:

http://urchin.earth.li/~twic/Code/Prospace/

So far, it only deals with the Project paragraph, but you can see how to
extend it. The interesting entry point is BuildProspace.main.

tom

The Frog · Sep 30, 2010

Thankyou Tom for the guidance. I understand the approach you are
taking in your illustration and thankyou for taking the time to
provide a coded example. I really appreciate the effort you have made.

Let me see if I can clarify my desire to use Regular Expressions,
though now perhaps that is not necessary. Simply put the file has
various revisions to the structure made over time, and it was my
understanding that I could provide a type of version based reader,
with different 'rules' for each paragraph type for each version. I was
hoping to allow a user to construct such a rule definition with an XML
file acting as the descriptor for the data files construction. Then,
parse the data file, and the information comes back to the calling
application.

To clarify the file itself, the paragraphs are basically someones idea
of an xml like approach to storing data. Each line in a paragraph
starts with a descriptor of sorts that identifies the type of data
that follows. The data that follows is in CSV format (minus any
headers). Various types of paragraphs are then used 'in order' to
construct the final data structure. If anyone is interested this is
for a planogram (shelf design used for retail stores) software tool.
The order of the paragraphs and lines follows the shelf logic build
order for the software tool. In the end you have a file that tells you
what products (and all their descriptive characteristics) belong on
what shelf, in what position, and how they are placed or stacked, per
segment of each gondola. Thats a mouthful. When you walk down a
supermarket aisle you are most likely seeing the result of one of
these tools (and its processes) and all that data is stored in a file
like this.

What I am ultimately hoping to achieve is to take a data file and
parse it into a heirarchy or objects that represent the same shelf
design principals and can hand those objects back to the calling
application, while keeping the parsing logic encapsulated and also
extensible. I thought that XML definition files for the object
definitions and regular expressions might be a good way to go for
this.

I thankyou once again for your most instructive feedback. I will have
a play with this and see how I do.

Cheers

The Frog

Roedy Green · Oct 1, 2010

An example is below....

That looks like an ordinary CSV file. See
http://mindprod.com/jgloss/csv.html

The Frog · Oct 4, 2010

Hi Roedy,

Thanks for the input. I at first thought the same, but turns out that
the structure of the file is not so regular as a csv. Each of these
'sentences' has a different number of fields, and the fields per
sentence have their own unique order. Unfortunately its not as nicely
regular as a CSV - that would have been wonderful!

I have managed to discern from the files descriptor document that
there are 14 different types of sentence:
- Project (247 Fields)
- Planogram (259 Fields)
- Fixture (177 Fields)
- Product (320 Fields)
- Position (180 Fields)
- Divider (22 Fields)
- Supplier (31 Fields)
- Segment (54 Fields)
- Performance (189 Fields)
- Drawing (50 Fields)
- EmbeddedObject (15 Fields)
- Configuration (494 Fields)
- Peg (16 Fields)
- Point (4 Fields)

These represent objects in a heirarchical order. Still working out
which ones belong to others, but everything belongs to a project (top
level). My hope is to be able to develop a package that can parse
these files and return / deliver to an application a set of objects
that represents the 'real world' structure this data file represents.
In researching this file structure it turns out that the structure has
undergone some revisions over the years. Part of what I am trying to
figure out is how I can provide an XML document as part of a Java
Package that would contain the necesssary descriptive information to
allow for the class objects to be built on the fly. The parser would
determine the data files version number, 'load' the appropriate XML
schema document so it could generate the appropriate objects, then
parse the data file and return to the calling code a set of objects
that represent the contents of the data file. If a new version of the
data file is released it is much easier to add a new XML schema
document to the package than to rewrite the code.

Tom has given me a great head start in doing the 'physical' side of
the parsing, but I am still at a loss for the XML to Object side. Is
there anything that you might point me to?

Cheers

The Frog

Roedy Green · Oct 4, 2010

Each of these
'sentences' has a different number of fields, and the fields per
sentence have their own unique order. Unfortunately its not as nicely
regular as a CSV - that would have been wonderful!

You can read that with my CSV package. Just read the first field on
the line, then branch to the code to read that format of line.

The Frog · Oct 5, 2010

Hi Roedy,

Nice tool, thankyou for sharing this with the world. Looks like it is
up to the task of reading the file itself, but I still have no idea
how to feed some form of structural information to an app so that it
can build objects from that structural data and populate the members /
fields with the values in the data file. I have sniffed around
serialization but am not sure if this is the way to go, or perhaps
there is a 'better' approach.

As I see it, the package will evolve with the evolution of the data
files themselves. As a new format becomes available / known, some form
of descriptive document is added to the package that allows the
package to correctly interpret the data file and build appropriate
objects. I am just not sure how to approach this last part. Is there
anything you can point me to that might help me solve this?

Cheers (and many thanks)

The Frog

Lew · Oct 5, 2010

The said:
Nice tool, thankyou for sharing this with the world. Looks like it is
up to the task of reading the file itself, but I still have no idea
how to feed some form of structural information to an app so that it
can build objects from that structural data and populate the members /
fields with the values in the data file. I have sniffed around
serialization but am not sure if this is the way to go, or perhaps
there is a 'better' approach.

As I see it, the package will evolve with the evolution of the data
files themselves. As a new format becomes available / known, some form
of descriptive document is added to the package that allows the
package to correctly interpret the data file and build appropriate
objects. I am just not sure how to approach this last part. Is there
anything you can point me to that might help me solve this?

Your questions are the logical guideposts for your next iteration. You are
showing good software design sense.

Forgive my taking a side topic here, but often it pays big benefits to
consider a problem in holistic terms and let analysis control your thinking.

Instead of considering implementations - serialization, CSV, a 'better'
approach (how can you even tell?) - without understanding the behavior these
specific implementations must support, document the behavior itself.

For your application I would draw a state diagram, among other things.

Based on that diagram, I'd write a code framework - I wouldn't pick libraries
to support an unwritten code framework for an undocumented algorithm.

To support the framework you likely will benefit most from a lower-level
library that does part or most of what you want, but not all, encapsulated in
custom code that handles the particulars of your situation, matching your
state diagram and other documentation.

Your approach to understand candidate libraries, and to iteratively refine
your solution, is a good one. But there is no one-size-fits-all 'better'
approach independent of the purpose at hand. Your situation is one for which
the 'better' approach is to analyze and document the problem in detail prior
to coding, and to consider custom code in the mix.

That's a pattern for you to solve your own problem rather than a solution to
your problem. Others in this thread have already excelled at suggesting
particulars; my aim is to describe literally what you asked for, an approach.

The Frog · Oct 6, 2010

Lew,

Thankyou for the guidance. It is indeed approaching the scenario from
a top down rather than a bottom up approach, and your words make a lot
of sense. I will do as you have suggested and come back another time
(thread) with more specific issues.

To all who have helped guide me, a very big thankyou.

Cheers

The Frog

Help with recompiling a small software into 32 bit format	0	Jun 2, 2023
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023
Some advice on dealing with a networking issue	4	Jan 21, 2010
Problem with codewars.	5	Dec 4, 2023
Dealing with dates & timezones	4	Aug 20, 2010
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
Just started coding and im stuck on a lesson?	1	Oct 30, 2022
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022

Advice on dealing with a legacy file format

The Frog

Tom Anderson

The Frog

Roedy Green

The Frog

Roedy Green

The Frog

Lew

The Frog

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads