Process header record and concatenate files

S

Scott Bass

Hi,

I'm not looking for a full blown solution, just architectural advice
for the following design criteria...

Input File(s): (tilde delimited)
Line 1:
Header Record:
SourceSystem~EffectiveDate~ExtractDateAndTime~NumberRecords~FileFormatVersion

RemainingRecords:
72 columns of delimited data

Ouput File:
Concatenate the input files into a single output file. A subset of
the header fields are prepended to the data lines as follows:

SourceSystem~EffectiveDate~ExtractDateAndTime~72 columns of delimited
data

Design Criteria:
1) If number of records in the file does not match the number of
records reported in the header (incomplete FTP), abort the entire
file, print an error message, but continue processing the remaining
files.

(I'll use split and join to process the header and prepend to the
remainder).

2) Specify the list of input files on the command line. Specify the
output file on the command line. For example:

concat.pl -in foo.dat bar.dat blah.dat -out concat.dat

or possibly:

concat.pl -in src_*.dat -out concat.dat

(I'll use GetOptions to process the command line)

My thoughts:

1) Slurp the file into an array (minus first record). Count the
elements in the array. Abort if not equal to the number in the
header, else concat to the output file.

2) Process the file, reading records. At EOF, get record number from
$. . If correct, rewind to beginning of file handle and concat to
output file. (Not sure how to do the rewind bit).

3) Process the file, writing to a temp file. At EOF, get record
number from $. . If correct, concat the temp file to the output file.

Questions:

A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

B) When that happens, how do I reset $. to 1?

C) Of the three approaches above, which is the "best"? Performance
is important but not critical. I lean toward #3, since I need to
cater for files too large for #1. Or if you have a better idea please
let me know.

I hope this wasn't too cryptic...I was trying to keep it short.

Thanks,
Scott
 
J

John W. Krahn

Scott said:
I'm not looking for a full blown solution, just architectural advice
for the following design criteria...

Input File(s): (tilde delimited)
Line 1:
Header Record:
SourceSystem~EffectiveDate~ExtractDateAndTime~NumberRecords~FileFormatVersion

RemainingRecords:
72 columns of delimited data

Ouput File:
Concatenate the input files into a single output file. A subset of
the header fields are prepended to the data lines as follows:

SourceSystem~EffectiveDate~ExtractDateAndTime~72 columns of delimited
data

Design Criteria:
1) If number of records in the file does not match the number of
records reported in the header (incomplete FTP), abort the entire
file, print an error message, but continue processing the remaining
files.

(I'll use split and join to process the header and prepend to the
remainder).

2) Specify the list of input files on the command line. Specify the
output file on the command line. For example:

concat.pl -in foo.dat bar.dat blah.dat -out concat.dat

or possibly:

concat.pl -in src_*.dat -out concat.dat

(I'll use GetOptions to process the command line)

My thoughts:

1) Slurp the file into an array (minus first record). Count the
elements in the array. Abort if not equal to the number in the
header, else concat to the output file.

2) Process the file, reading records. At EOF, get record number from
$. . If correct, rewind to beginning of file handle and concat to
output file. (Not sure how to do the rewind bit).

3) Process the file, writing to a temp file. At EOF, get record
number from $. . If correct, concat the temp file to the output file.

Questions:

A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

perldoc -f eof

[ snip ]

In a "while (<>)" loop, "eof" or "eof(ARGV)" can be used to
detect the end of each file, "eof()" will only detect the
end of the last file.

B) When that happens, how do I reset $. to 1?

When you reach the end-of-file as determined by the eof function close
the ARGV filehandle and $. will be reset.

C) Of the three approaches above, which is the "best"? Performance
is important but not critical.

You'd probably have to test them with real data to determine the "best".




John
 
T

Tad J McClellan

Scott Bass said:
Questions:

A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

B) When that happens, how do I reset $. to 1?


perldoc -f eof
 
E

Eric Pozharski

On 2009-04-05 said:
A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

B) When that happens, how do I reset $. to 1?

If I got your problem right, you've missed I<@ARGV>; then you could
make your own cycle off command-line files. Or alternatively monitor
I said:
C) Of the three approaches above, which is the "best"? Performance
is important but not critical. I lean toward #3, since I need to
cater for files too large for #1. Or if you have a better idea please
let me know.

use Your::Taste qw| full |;

or

use Your::Intuiton qw| reverse |;

or

use Benchmark qw| timethese |;
I hope this wasn't too cryptic...I was trying to keep it short.

You better show your code. Perl is powerfully expressive or
expressively powerful (I doubt I would ever get that right).
 
S

sln

Hi,

I'm not looking for a full blown solution, just architectural advice
for the following design criteria...
Just submit this to my office for a bid. Arch advice is not available
however, since we consider that non-advice and a free service, which
we don't offer.

Any other advice is free however. Like should you buy GM stock.
That advice is always free.

-sln
 
S

smallpond

Hi,

I'm not looking for a full blown solution, just architectural advice
for the following design criteria...

Input File(s): (tilde delimited)
Line 1:
Header Record:
SourceSystem~EffectiveDate~ExtractDateAndTime~NumberRecords~FileFormatVersion

RemainingRecords:
72 columns of delimited data

Ouput File:
Concatenate the input files into a single output file. A subset of
the header fields are prepended to the data lines as follows:

SourceSystem~EffectiveDate~ExtractDateAndTime~72 columns of delimited
data

Design Criteria:
1) If number of records in the file does not match the number of
records reported in the header (incomplete FTP), abort the entire
file, print an error message, but continue processing the remaining
files.

(I'll use split and join to process the header and prepend to the
remainder).

2) Specify the list of input files on the command line. Specify the
output file on the command line. For example:

concat.pl -in foo.dat bar.dat blah.dat -out concat.dat

or possibly:

concat.pl -in src_*.dat -out concat.dat

(I'll use GetOptions to process the command line)

My thoughts:

1) Slurp the file into an array (minus first record). Count the
elements in the array. Abort if not equal to the number in the
header, else concat to the output file.

2) Process the file, reading records. At EOF, get record number from
$. . If correct, rewind to beginning of file handle and concat to
output file. (Not sure how to do the rewind bit).

3) Process the file, writing to a temp file. At EOF, get record
number from $. . If correct, concat the temp file to the output file.

Questions:

A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

B) When that happens, how do I reset $. to 1?

C) Of the three approaches above, which is the "best"? Performance
is important but not critical. I lean toward #3, since I need to
cater for files too large for #1. Or if you have a better idea please
let me know.

I hope this wasn't too cryptic...I was trying to keep it short.

Thanks,
Scott

If you want to process your input in one pass, use tell to save the
position of the output at the start of each input file. If you get
to the end of input and the number of records is wrong, use seek
to discard it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,735
Latest member
HikmatRamazanov

Latest Threads

Top