Perl script to clean up file -- Dont know if it can be done

L

LHradowy

I thought I would throw this out there, I think it can not be done, but I am
not a guru.

This is the problem I get a file that I must pull the pertanent data out. I
has a header and footer, as well as page breaks, this is all in ASCII
format. I need to pull out just the columns.
I do this all manually (delete the header and footer, and well as all the
page breaks) there are also at times a 0 at the beginning of a record that I
do not want there as well.

This is what the file looks like...


REPORT NAME: FACICL
0 CLIENT JOB NUMBER: 23405
0 CLIENT NAME: LAURA XXXXXXXX
0 CLIENT MAILING CODE: D509H
0 REPORT DATE: 04/07/08
0 REPORT TIME: 12:46
1REPORT NO: FACRPT14 SOME INFO HERE
RUN DATE: 04JUL10
0 JOB NAME: FACCTICL CUTOVER ACCARE INTERFACE
REJECT REPORT PAGE NO: 2
0 PROGRAM : FACB5500 CUTOVER: LEAF CUTOVER
DATE: 04JUL09
0 TELN CUTTELN CUTOEN REJECT REASON
---- ------- -------------- -------------
-----------------
0 1555200 00 0 12 02 CUSTOMER HAS
2555206 00 0 05 01 CUSTOMER HAS
4555208 00 0 03 06 TELN NOT BILL
1REPORT NO: FACRPT14 SOME DATA HERE
RUN DATE: 04JUL10
0 JOB NAME: FACCTICL CUTOVER ACCARE INTERFACE
REJECT REPORT PAGE NO: 3
0 PROGRAM : FACB5500 CUTOVER: LEAF CUTOVER
DATE: 04JUL09
0 TELN CUTTELN CUTOEN REJECT REASON

---- ------- -------------- -------------
-----------------
0 1555200 00 0 12 02 CUSTOMER HAS
2555206 00 0 05 01 CUSTOMER HAS
4555208 00 0 03 06 TELN NOT BILL
- REJECTED = 000000145 CUTOVER = 000000213
- *** SUCCESSFUL COMPLETION OF
FACCTICL ***


I manually disect this file to make it look like this...
1555002 00 0 04 27 TELN NOT BILL
3555007 00 0 06 00 CUSTOMER HAS
5555410 00 0 12 10 CUSTOMER HAS
6755012 00 0 12 06 CUSTOMER HAS

I have manually removed the header, footer and page breaks. As well as there
always seems to be a 0 at start of the first record. I remove this as well.
I then run this perl script:

while (<>) {
chomp; # Will remove the leading , or new line
s,^\s+,,; #Remove leading spaces
my @cols=split m/\s{2,}/, $_, -1; # Split on two (or more) white space
characters
@cols == 2 and splice @cols, 1, 0, "";
print join (',',@cols)."\n";
}

And I get this: WHAT I NEED!
5555002,00 0 04 27,TELN NOT BILL
1555007,00 0 06 00,CUSTOMER HAS
2555010,00 0 12 10,CUSTOMER HAS

I want to try to eliminate as much manual intervention as I can.
 
C

Craig Ciquera

Would this work:

# Read in the datafile (assume name is datafile.txt)
open(DATAFILE,"<datafile.txt") or die "Cannot open datafile.txt";

while (defined ($line = <DATAFILE>) )
{
# Skip everything but the info we are interested in
next unless $line =~ /\d{2} \d{1} \d{2} \d{2}/;

# Remove any leading 0's and/or whitespace
$line =~ s/(^0{0,}\s{0,})(.*)/$2/;

# Remove any potential empty fields
$line =~ s/\s{2,}/,/g;

print $line;
}
 
T

thundergnat

LHradowy said:
I thought I would throw this out there, I think it can not be done, but I am
not a guru.

This is the problem I get a file that I must pull the pertanent data out. I
has a header and footer, as well as page breaks, this is all in ASCII
format. I need to pull out just the columns.
I do this all manually (delete the header and footer, and well as all the
page breaks) there are also at times a 0 at the beginning of a record that I
do not want there as well.

This is what the file looks like...

Assuming the datas format ALWAYS looks like that:

pass the data file as a parameter.

while(<>){
if (/CUSTOMER HAS|TELN NOT BILL/){
$_ =~ s/^0?\s+(\d+)\D+(.{10})\W+(.+?)/$1,$2,$3/;
print;
}
}
 
T

thundergnat

LHradowy said:
I thought I would throw this out there, I think it can not be done, but I am
not a guru.

This is the problem I get a file that I must pull the pertanent data out. I
has a header and footer, as well as page breaks, this is all in ASCII
format. I need to pull out just the columns.
I do this all manually (delete the header and footer, and well as all the
page breaks) there are also at times a 0 at the beginning of a record that I
do not want there as well.

This is what the file looks like...

[snip]

0 1555200 00 0 12 02 CUSTOMER HAS
2555206 00 0 05 01 CUSTOMER HAS
4555208 00 0 03 06 TELN NOT BILL
[snip]

0 1555200 00 0 12 02 CUSTOMER HAS
2555206 00 0 05 01 CUSTOMER HAS
4555208 00 0 03 06 TELN NOT BILL

[snip]

What tranforms are you applying to change the numbers?
Where did the extra CUSTOMER HAS line come from?
I manually disect this file to make it look like this...
1555002 00 0 04 27 TELN NOT BILL
3555007 00 0 06 00 CUSTOMER HAS
5555410 00 0 12 10 CUSTOMER HAS
6755012 00 0 12 06 CUSTOMER HAS

[snip]

What tranforms are you applying to change the numbers?
What happened to the other CUSTOMER HAS line?
And I get this: WHAT I NEED!
5555002,00 0 04 27,TELN NOT BILL
1555007,00 0 06 00,CUSTOMER HAS
2555010,00 0 12 10,CUSTOMER HAS

I want to try to eliminate as much manual intervention as I can.

Is there more than one different TELN NOT BILL lines in any one file?
If so, how do you tell which CUSTOMER HAS goes with which?

I have no idea how you got "WHAT YOU NEED" from the example data.
None of the lines in the final can be directly derive without applying
some unknown transform.

If you JUST want the lines containing TELN NOT BILL or CUSTOMER HAS
then sift them out and reformat the lines.
 
L

LHradowy

My apologies, I have managed to waggle it out, and fix my problem. Htanks
all for all your help!
 
L

Larry Felton Johnson

thundergnat said:
Assuming the datas format ALWAYS looks like that:

pass the data file as a parameter.

while(<>){
if (/CUSTOMER HAS|TELN NOT BILL/){
$_ =~ s/^0?\s+(\d+)\D+(.{10})\W+(.+?)/$1,$2,$3/;
print;
}
}


thundergnat's correct, and in fact if a file has enough consistency to
describe in plain English, you can almost always find a set of regexes
to include, rearrange and exclude the things you'd like. Assuming
that you are working in two passes on this data (the one described in
your previous post to deal with the individual data line and this more
verbose file) you could probably substitute the line from the regex I
posted to the other thread for the one above and do the entire
transformation in one pass. I'll cut and paste and run it later, but
the if statement above might even be unneccessary, since the
substitution probably only matches the relevant lines.

Larry
 
L

Larry Felton Johnson

LHradowy said:
I thought I would throw this out there, I think it can not be done, but I am
not a guru.

This is the problem I get a file that I must pull the pertanent data out. I
has a header and footer, as well as page breaks, this is all in ASCII
format. I need to pull out just the columns.
I do this all manually (delete the header and footer, and well as all the
page breaks) there are also at times a 0 at the beginning of a record that I
do not want there as well.

This is what the file looks like...

The following code did it in one pass (stripping the file of the extraneous
stuff and doing the transformation). Since I only have access to what you've
presented, I don't consider my run a real test. But you get the general
picture. If you are only interested in certain lines you can just focus
your match on the characteristics of those lines and try to do any transformations
at the same time.

Here's the code snippett (I named the file I copied your code into file2

#!/usr/bin/perl -w

use strict;

my $infile = 'file2';
my $outfile = 'outfile';

open INFILE, "$infile" or die "Can't open $infile: $!\n";
open OUTFILE, ">$outfile" or die "Can't open $outfile: $!\n";


while (<INFILE>) {
if (s/^0?\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(.*)/$1,$2 $3 $4 $5,$6/) {
print OUTFILE "$_";
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top