Perl script to clean up file -- Dont know if it can be done

LHradowy · Sep 22, 2004

I thought I would throw this out there, I think it can not be done, but I am
not a guru.

This is the problem I get a file that I must pull the pertanent data out. I
has a header and footer, as well as page breaks, this is all in ASCII
format. I need to pull out just the columns.
I do this all manually (delete the header and footer, and well as all the
page breaks) there are also at times a 0 at the beginning of a record that I
do not want there as well.

This is what the file looks like...

REPORT NAME: FACICL
0 CLIENT JOB NUMBER: 23405
0 CLIENT NAME: LAURA XXXXXXXX
0 CLIENT MAILING CODE: D509H
0 REPORT DATE: 04/07/08
0 REPORT TIME: 12:46
1REPORT NO: FACRPT14 SOME INFO HERE
RUN DATE: 04JUL10
0 JOB NAME: FACCTICL CUTOVER ACCARE INTERFACE
REJECT REPORT PAGE NO: 2
0 PROGRAM : FACB5500 CUTOVER: LEAF CUTOVER
DATE: 04JUL09
0 TELN CUTTELN CUTOEN REJECT REASON
---- ------- -------------- -------------
-----------------
0 1555200 00 0 12 02 CUSTOMER HAS
2555206 00 0 05 01 CUSTOMER HAS
4555208 00 0 03 06 TELN NOT BILL
1REPORT NO: FACRPT14 SOME DATA HERE
RUN DATE: 04JUL10
0 JOB NAME: FACCTICL CUTOVER ACCARE INTERFACE
REJECT REPORT PAGE NO: 3
0 PROGRAM : FACB5500 CUTOVER: LEAF CUTOVER
DATE: 04JUL09
0 TELN CUTTELN CUTOEN REJECT REASON

---- ------- -------------- -------------
-----------------
0 1555200 00 0 12 02 CUSTOMER HAS
2555206 00 0 05 01 CUSTOMER HAS
4555208 00 0 03 06 TELN NOT BILL
- REJECTED = 000000145 CUTOVER = 000000213
- *** SUCCESSFUL COMPLETION OF
FACCTICL ***

I manually disect this file to make it look like this...
1555002 00 0 04 27 TELN NOT BILL
3555007 00 0 06 00 CUSTOMER HAS
5555410 00 0 12 10 CUSTOMER HAS
6755012 00 0 12 06 CUSTOMER HAS

I have manually removed the header, footer and page breaks. As well as there
always seems to be a 0 at start of the first record. I remove this as well.
I then run this perl script:

while (<>) {
chomp; # Will remove the leading , or new line
s,^\s+,,; #Remove leading spaces
my @cols=split m/\s{2,}/, $_, -1; # Split on two (or more) white space
characters
@cols == 2 and splice @cols, 1, 0, "";
print join (',',@cols)."\n";
}

And I get this: WHAT I NEED!
5555002,00 0 04 27,TELN NOT BILL
1555007,00 0 06 00,CUSTOMER HAS
2555010,00 0 12 10,CUSTOMER HAS

I want to try to eliminate as much manual intervention as I can.

Craig Ciquera · Sep 22, 2004

Would this work:

# Read in the datafile (assume name is datafile.txt)
open(DATAFILE,"<datafile.txt") or die "Cannot open datafile.txt";

while (defined ($line = <DATAFILE>) )
{
# Skip everything but the info we are interested in
next unless $line =~ /\d{2} \d{1} \d{2} \d{2}/;

# Remove any leading 0's and/or whitespace
$line =~ s/(^0{0,}\s{0,})(.*)/$2/;

# Remove any potential empty fields
$line =~ s/\s{2,}/,/g;

print $line;
}

thundergnat · Sep 22, 2004

LHradowy said:
I thought I would throw this out there, I think it can not be done, but I am
not a guru.

This is the problem I get a file that I must pull the pertanent data out. I
has a header and footer, as well as page breaks, this is all in ASCII
format. I need to pull out just the columns.
I do this all manually (delete the header and footer, and well as all the
page breaks) there are also at times a 0 at the beginning of a record that I
do not want there as well.

This is what the file looks like...

Assuming the datas format ALWAYS looks like that:

pass the data file as a parameter.

while(<>){
if (/CUSTOMER HAS|TELN NOT BILL/){
$_ =~ s/^0?\s+(\d+)\D+(.{10})\W+(.+?)/$1,$2,$3/;
print;
}
}

thundergnat · Sep 22, 2004

LHradowy said:
I thought I would throw this out there, I think it can not be done, but I am
not a guru.

This is the problem I get a file that I must pull the pertanent data out. I
has a header and footer, as well as page breaks, this is all in ASCII
format. I need to pull out just the columns.
I do this all manually (delete the header and footer, and well as all the
page breaks) there are also at times a 0 at the beginning of a record that I
do not want there as well.

This is what the file looks like...

[snip]

0 1555200 00 0 12 02 CUSTOMER HAS
2555206 00 0 05 01 CUSTOMER HAS
4555208 00 0 03 06 TELN NOT BILL
[snip]

0 1555200 00 0 12 02 CUSTOMER HAS
2555206 00 0 05 01 CUSTOMER HAS
4555208 00 0 03 06 TELN NOT BILL

[snip]

What tranforms are you applying to change the numbers?
Where did the extra CUSTOMER HAS line come from?

I manually disect this file to make it look like this...
1555002 00 0 04 27 TELN NOT BILL
3555007 00 0 06 00 CUSTOMER HAS
5555410 00 0 12 10 CUSTOMER HAS
6755012 00 0 12 06 CUSTOMER HAS

[snip]

What tranforms are you applying to change the numbers?
What happened to the other CUSTOMER HAS line?

And I get this: WHAT I NEED!
5555002,00 0 04 27,TELN NOT BILL
1555007,00 0 06 00,CUSTOMER HAS
2555010,00 0 12 10,CUSTOMER HAS

I want to try to eliminate as much manual intervention as I can.

Is there more than one different TELN NOT BILL lines in any one file?
If so, how do you tell which CUSTOMER HAS goes with which?

I have no idea how you got "WHAT YOU NEED" from the example data.
None of the lines in the final can be directly derive without applying
some unknown transform.

If you JUST want the lines containing TELN NOT BILL or CUSTOMER HAS
then sift them out and reformat the lines.

LHradowy · Sep 22, 2004

My apologies, I have managed to waggle it out, and fix my problem. Htanks
all for all your help!

Larry Felton Johnson · Sep 22, 2004

thundergnat said:
Assuming the datas format ALWAYS looks like that:

pass the data file as a parameter.

while(<>){
if (/CUSTOMER HAS|TELN NOT BILL/){
$_ =~ s/^0?\s+(\d+)\D+(.{10})\W+(.+?)/$1,$2,$3/;
print;
}
}

thundergnat's correct, and in fact if a file has enough consistency to
describe in plain English, you can almost always find a set of regexes
to include, rearrange and exclude the things you'd like. Assuming
that you are working in two passes on this data (the one described in
your previous post to deal with the individual data line and this more
verbose file) you could probably substitute the line from the regex I
posted to the other thread for the one above and do the entire
transformation in one pass. I'll cut and paste and run it later, but
the if statement above might even be unneccessary, since the
substitution probably only matches the relevant lines.

Larry

Larry Felton Johnson · Sep 22, 2004

LHradowy said:
I thought I would throw this out there, I think it can not be done, but I am
not a guru.

This is the problem I get a file that I must pull the pertanent data out. I
has a header and footer, as well as page breaks, this is all in ASCII
format. I need to pull out just the columns.
I do this all manually (delete the header and footer, and well as all the
page breaks) there are also at times a 0 at the beginning of a record that I
do not want there as well.

This is what the file looks like...

The following code did it in one pass (stripping the file of the extraneous
stuff and doing the transformation). Since I only have access to what you've
presented, I don't consider my run a real test. But you get the general
picture. If you are only interested in certain lines you can just focus
your match on the characteristics of those lines and try to do any transformations
at the same time.

Here's the code snippett (I named the file I copied your code into file2

#!/usr/bin/perl -w

use strict;

my $infile = 'file2';
my $outfile = 'outfile';

open INFILE, "$infile" or die "Can't open $infile: $!\n";
open OUTFILE, ">$outfile" or die "Can't open $outfile: $!\n";

while (<INFILE>) {
if (s/^0?\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(.*)/$1,$2 $3 $4 $5,$6/) {
print OUTFILE "$_";
}
}

How do i get numberOfItemsHired to only accept 1-500 if it is outside those values error message should be displayed	10	Jul 5, 2024
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
Only one table shows up with the information	2	Mar 29, 2023
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Help figuring out a directory permission change problem	1	May 12, 2023
space deliminated to comma delinated with varried and need spaces between some columns	14	Sep 20, 2004
Perl script to replace awk	3	Jul 26, 2004

Perl script to clean up file -- Dont know if it can be done

LHradowy

Craig Ciquera

thundergnat

thundergnat

LHradowy

Larry Felton Johnson

Larry Felton Johnson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads