logic question for text file updates

C

ccc31807

We have a csv source file of many thousands of records, with two
columns, the ID and a
status field. It has very recently come to my attention that
occasionally the status of a record will change, with the change being
significant enough that the record must be updated before the process
runs. The update files consist of a small subset, sometimes a very
small subset, of the records in the source file. (The update file has
a number of other fields that can change also, but I'm only concerned
with the status field.)

My first inclination is to open the update file, create a hash with
the ID as the key and the status as value, then open the source file,
read each line, update the line if it exists in the hash, and write
each line to a new output file. However, I can think of several
different ways to do this -- I just don't know which way would be
best. I don't particularly want to read every line and write every
line of a source file when only a few lines (if any) need to be
modified.

My second inclination would be to use a database and write an update
query for the records in the update file. But this seems a heavy
weight solution to a light weight problem -- I would only be using the
database to modify records, not to to any of the things we ordinarily
use databases for.

I've never had to do a small number of updates to a large file before,
and it seems too trivial a task to use a database for. Any suggestions
on a better way to do this?

Thanks, CC.

P.S. - The end product of this process is a data file with
approximately 20 fields, written as comma separated, double quote
delimited text, designed to be imported into Excel and Access by end
users in performance of their duties.
 
S

sln

We have a csv source file of many thousands of records, with two
columns, the ID and a
status field. It has very recently come to my attention that
occasionally the status of a record will change, with the change being
significant enough that the record must be updated before the process
runs. The update files consist of a small subset, sometimes a very
small subset, of the records in the source file. (The update file has
a number of other fields that can change also, but I'm only concerned
with the status field.)

My first inclination is to open the update file, create a hash with
the ID as the key and the status as value, then open the source file,
read each line, update the line if it exists in the hash, and write
each line to a new output file. However, I can think of several
different ways to do this -- I just don't know which way would be
best. I don't particularly want to read every line and write every
line of a source file when only a few lines (if any) need to be
modified.

My second inclination would be to use a database and write an update
query for the records in the update file. But this seems a heavy
weight solution to a light weight problem -- I would only be using the
database to modify records, not to to any of the things we ordinarily
use databases for.

I've never had to do a small number of updates to a large file before,
and it seems too trivial a task to use a database for. Any suggestions
on a better way to do this?

Thanks, CC.

P.S. - The end product of this process is a data file with
approximately 20 fields, written as comma separated, double quote
delimited text, designed to be imported into Excel and Access by end
users in performance of their duties.

You don't have to write a new file back out to disk for a small change.
You could design a disk that can grow or shrink its magnetic material
on the fly, and just insert/remove metal sectors as needed.

But, I think they trashed that idea when they invented frag-
mentation capabilities.

To circumvent fragments on a data level, you could re-design
the file record so that a particular field of a record is
fixed width relative to surrounding fields, sufficient enough
to hold the largest variable data that field could possibly
encounter.

And if the field is small enough to accomodate all possible
values, there is not that much "air" involved in relation to
the overall file size.

Since the field is fixed, the offset into the record to the
field can be surmised and added to the location of the last
record end position, allowing you to write a new fixed width
value, guaranteeing not to overwrite the next field in that
record.

-sln
 
C

ccc31807

You don't have to write a new file back out to disk for a small change.
You could design a disk that can grow or shrink its magnetic material
on the fly, and just insert/remove metal sectors as needed.

But, I think they trashed that idea when they invented frag-
mentation capabilities.

To circumvent fragments on a data level, you could re-design
the file record so that a particular field of a record is
fixed width relative to surrounding fields, sufficient enough
to hold the largest variable data that field could possibly
encounter.

And if the field is small enough to accomodate all possible
values, there is not that much "air" involved in relation to
the overall file size.

Since the field is fixed, the offset into the record to the
field can be surmised and added to the location of the last
record end position, allowing you to write a new fixed width
value, guaranteeing not to overwrite the next field in that
record.

-sln

The key will always be a seven character integer. The value will
always be a string with fewer than 20 characters. I COULD use a fixed
width format, but my current format (for the source file) is pipe
separated (e.g. 0059485|Current) and all my logic splits input on the
pipe symbol.

The keys are not consecutive, not ordered, and have large skips, i.e.,
for several million records I might have ten thousand records in the
source file which are randomly ordered (is that an oxymoron?).
Treating the source file as an array would require many more array
elements than records in the file.

CC.
 
S

sln

[snip]
Since the field is fixed, the offset into the record to the
field can be surmised and added to the location of the last
record end position, allowing you to write a new fixed width
value, guaranteeing not to overwrite the next field in that
record.

The key will always be a seven character integer. The value will
always be a string with fewer than 20 characters. I COULD use a fixed
width format, but my current format (for the source file) is pipe
separated (e.g. 0059485|Current) and all my logic splits input on the
pipe symbol.

The keys are not consecutive, not ordered, and have large skips, i.e.,
for several million records I might have ten thousand records in the
source file which are randomly ordered (is that an oxymoron?).
Treating the source file as an array would require many more array
elements than records in the file.

CC.

So, if you have a source file, delimited by | that you eventually make
a dbl quote comma delimited csv file, you could make that status field fixed
width (what 20 chars tops?) in the source. When you generate the dat, csv
file, just strip white space from the beginning and end of the field
before you double quote it to a csv file.

Source file:
- fields all dynamic width except status (which is fixed 20 char).
- format

<field1>|<field2>|<field3>|<field4>|<- status, 20 char ->|<field6>|<field7>|<field_last>\n

You know the file position of the previous EOR. Use index() to find the pipe '|' char
of the status field of the current record (4th in the example), add that to the previous
EOR to get the write() position for the new status (if it changed).

To find out if the status changed, do your split /'|'/ to get all the fields, check
the ID/status from the update file, write out new "fixed width" status (format with
printf or something) to the source file.

When it comes time to generate the csv from the source, just trim spaces before you
write it out.

-sln
 
S

sln

[snip]
Is there any way of 'blanking' a record? Normal CSV doesn't support
comments, and if you're importing into Excel you can't extend it to do
so; what does Excel do if you give it a file like

one|two|three
||||||||||||||
four|five|six

In this case, the newline is the record delimeter, '|' is the
field delimeter.
You can have excel treat consecutive field delimeters as one.
In this case, the ||||||||||| produces a blank record.
This is the way Excel 2002 works, don't know if you can
auto-remove blank records in newer versions though.

-sln
 
C

ccc31807

So, if you have a source file, delimited by | that you eventually make
a dbl quote comma delimited csv file, you could make that status field fixed
width (what 20 chars tops?) in the source. When you generate the dat, csv
file, just strip white space from the beginning and end of the field
before you double quote it to a csv file.

Working backwards, my ultimate output file looks like this:
"id","field2","field3","status","field5","field6"\n

The 'status' field should be the current status, which rarely changes,
but it's critical to use the most current status.

I get about ten update files a year with the current status and a
number of other fields that I don't care about. I take these files,
strip out everything except the ID and the STATUS, and write that data
into memory.

Working frontwards, I build a source file with the two fields I
referenced, like this: [id|status]

I went ahead and bit the bullet, since I had to do something. I (1)
save the source file to a backup, (2) read in the source file and save
it to a hash on the ids, (3) read in the update file the same way, and
(4) print out the hash to the source file. It's reasonably quick, less
than a second (although I haven't bench marked it) and seems to be
reliable.

That said, I'd like to learn a more elegant way to do it.

CC.
You know the file position of the previous EOR. Use index() to find the pipe '|' char
of the status field of the current record (4th in the example), add that to the previous
EOR to get the write() position for the new status (if it changed).

To find out if the status changed, do your split /'|'/ to get all the fields, check
the ID/status from the update file, write out new "fixed width" status (format with
printf or something) to the source file.

When it comes time to generate the csv from the source, just trim spaces before you
write it out.

That sounds a lot more complicated than the brute force approach I
used. But I appreciate your suggestion as treating the files as fixed
width, and I will explore that later.

CC
 
C

ccc31807

Is there any way of 'blanking' a record? Normal CSV doesn't support
comments, and if you're importing into Excel you can't extend it to do
so; what does Excel do if you give it a file like

Actually, I comment CSV files all the time, not for use by Excel, but
for use by my scripts. The 'comments' are on interspersed lines
beginning with #, so I can do this:
while (<INPUT>)
{
next if /^#/;
...
}
    - read the update file(s) into a hash,
    - open the source file read/write,
    - go through it looking for the appropriate records,
    - when you find one, wipe it out without changing the length or
      removing the newline,
    - add the changed records onto the end of the file, since the
      records weren't in order anyway.

I don't see any real difference between this and reading the entire
file into memory, at least for the size files I'm dealing this. IO is
always a bottleneck, and unless space is limited it's better to use
space than time.
It's generally not worth messing around with approaches like this,
though. Rewriting a file of a few MB doesn't exactly take long, and it's
much easier to get right.

Yeah, I'm beginning to think that my investment in my time isn't worth
the results.

Thanks, CC.
 
S

sln

That sounds a lot more complicated than the brute force approach I
used. But I appreciate your suggestion as treating the files as fixed
width, and I will explore that later.

CC

Actually, the brute force method you cite is far and away the
much more complicated approach.

Good luck!

-sln
 
C

ccc31807

Actually, the brute force method you cite is far and away the
much more complicated approach.

I would be very interested in why you think this. It may depend on
your definition of 'complicated.'

In terms of writing the code, it was pretty simple. First, open the
source file and read it into a hash. Second, open the update file and
read it into the SAME(!) hash (thereby overwriting the old values
where the hash keys are duplicated.) Third, write the hash back out to
the source file.

As to 'complicated' I have know people that use Access for data
processing files, spending hours on end creating and building Access
databases, queries, and reports to manipulate data. It takes them a
lot longer to generate a report using Access than it does me, using
Perl to munge the data. They say that my way is more 'complicated'
because I use Perl (which is 'harder') and Access is easier. I say my
way is less 'complicated' because I don't have to mess around with
Access. Frankly, when I read some of the scripts you post to c.l.p.m.,
I have a very hard time understanding them, and (from my POV) I would
say that you have a weird conception of 'complicated.'

CC.
 
M

Martijn Lievaart

The key will always be a seven character integer. The value will always
be a string with fewer than 20 characters. I COULD use a fixed width
format, but my current format (for the source file) is pipe separated
(e.g. 0059485|Current) and all my logic splits input on the pipe symbol.

Hurray for encapsulation. If from the start you encapsulated this (i.e.
get a line, call a sub split_to_fields) you only would have had to update
one sub.

Not much help now, but something to keep in the back of your mind when
designing your next program.

M4
 
S

sln

I would be very interested in why you think this. It may depend on
your definition of 'complicated.'

In terms of writing the code, it was pretty simple. First, open the
source file and read it into a hash. Second, open the update file and
read it into the SAME(!) hash (thereby overwriting the old values
where the hash keys are duplicated.) Third, write the hash back out to
the source file.

As to 'complicated' I have know people that use Access for data
processing files, spending hours on end creating and building Access
databases, queries, and reports to manipulate data. It takes them a
lot longer to generate a report using Access than it does me, using
Perl to munge the data. They say that my way is more 'complicated'
because I use Perl (which is 'harder') and Access is easier. I say my
way is less 'complicated' because I don't have to mess around with
Access. Frankly, when I read some of the scripts you post to c.l.p.m.,
I have a very hard time understanding them, and (from my POV) I would
say that you have a weird conception of 'complicated.'

CC.

Unfortunately, I can't find any scripts you have posted here.

However, if you would like to make a technical comment on anything
I write and post here, feel free to do so. And I will be glad to help
if you run into dificulty understanding it.

Cheers.

-sln
 
X

Xho Jingleheimerschmidt

ccc31807 said:
We have a csv source file of many thousands of records, with two
columns, the ID and a
status field. It has very recently come to my attention that
occasionally the status of a record will change, with the change being
significant enough that the record must be updated before the process
runs. The update files consist of a small subset, sometimes a very
small subset, of the records in the source file. (The update file has
a number of other fields that can change also, but I'm only concerned
with the status field.)

Is the status field of fixed length? If so, it can be changed in place.
My first inclination is to open the update file, create a hash with
the ID as the key and the status as value, then open the source file,
read each line, update the line if it exists in the hash, and write
each line to a new output file. However, I can think of several
different ways to do this -- I just don't know which way would be
best. I don't particularly want to read every line and write every
line of a source file when only a few lines (if any) need to be
modified.

When you say the source file has "many thousand" records, how many
thousand are you talking? Unless you are talking hundreds of thousands
or thousands of thousands, I think that even spending the time to worry
about alternatives to rewriting the file, much less implementing those
alternatives, is a false economy.

But why write it out at all? Read it in the exception file to a hash,
read in the source file applying the exceptions in memory, and do
whatever you need to do with the now accurate in memory records. Leave
the source file as it is, and next time you need to do something with
it, just re-apply the exception file to it again, again in memory.
My second inclination would be to use a database and write an update
query for the records in the update file. But this seems a heavy
weight solution to a light weight problem -- I would only be using the
database to modify records, not to to any of the things we ordinarily
use databases for.

If you already have a database that is maintained and backed up, etc.,
using it for an additional use may not be very heave weight.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,007
Messages
2,570,267
Members
46,866
Latest member
Aletlirm

Latest Threads

Top