File handling and regex

Luca Villa · Nov 5, 2007

Hi all!

I need help with Perl under Windows command-line to solve the
following task:

I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt

These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"

I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt

How can I do it?

John W. Krahn · Nov 5, 2007

Luca said:
I need help with Perl under Windows command-line to solve the
following task:

I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt

These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"

I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt

How can I do it?

*UNTESTED* YMMV

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use File::Copy;

my $from = 'c:/dir';
my $to = 'c:/output';

my %trans = qw(
abc(\d+)def type1
lmn(\d+)opq type2
rst(\d+)uvw type3
);

find sub {
return unless open my $fh, '<', $_;
return unless -f $fh;
read $fh, my $data, -s _;
close $fh;
for my $pat ( keys %trans ) {
next unless $data =~ $pat;
copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
last;
}
}, $from;

__END__

John

jordilin · Nov 6, 2007

Luca said:
Luca said:

I need help with Perl under Windows command-line to solve the
following task:

Click to expand...

I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt

Click to expand...

These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"

Click to expand...

I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt

Click to expand...

How can I do it?

Click to expand...

*UNTESTED* YMMV

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use File::Copy;

my $from = 'c:/dir';
my $to = 'c:/output';

my %trans = qw(
abc(\d+)def type1
lmn(\d+)opq type2
rst(\d+)uvw type3
);

find sub {
return unless open my $fh, '<', $_;
return unless -f $fh;
read $fh, my $data, -s _;
close $fh;
for my $pat ( keys %trans ) {
next unless $data =~ $pat;
copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
last;
}
}, $from;

__END__

John

One doubt,
when you write
read $fh, my $data, -s _;
should not be
read $fh, my $data, -s $_;

I have searched along the web without success. I don't know if _
equals $_ in this particular case
best regards,
jordi

Josef Moellers · Nov 6, 2007

jordilin said:
Luca Villa wrote:

I need help with Perl under Windows command-line to solve the
following task:

Click to expand...

I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt

Click to expand...

These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"

Click to expand...

I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt

Click to expand...

How can I do it?

Click to expand...

*UNTESTED* YMMV

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use File::Copy;

my $from = 'c:/dir';
my $to = 'c:/output';

my %trans = qw(
abc(\d+)def type1
lmn(\d+)opq type2
rst(\d+)uvw type3
);

find sub {
return unless open my $fh, '<', $_;
return unless -f $fh;
read $fh, my $data, -s _;
close $fh;
for my $pat ( keys %trans ) {
next unless $data =~ $pat;
copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
last;
}
}, $from;

__END__

John

Click to expand...

One doubt,
when you write
read $fh, my $data, -s _;
should not be
read $fh, my $data, -s $_;

I have searched along the web without success. I don't know if _
equals $_ in this particular case

No, it doesn't, at least not "literally" or conceptually.
"_" is the special filehandle which refers to the filehandle used in the
most recently used stat operation:

"If any of the file tests (or either the "stat" or "lstat" operators)
are given the special filehandle consisting of a solitary underline,
then the stat structure of the previous file test (or stat operator) is
used, saving a system call."
(perldoc -f -s)

Luca Villa · Nov 9, 2007

Thanks to all and to John in particular.

John's solution perhaps worked but I had difficulty in adapting it for
my needs so I ended using this alternative solution:

use File::Find;

find(\&found, 'c:/dir');

sub found {
unless(open(IN,"<$File::Find::name")) {
warn "Could not open $File::Find::name: $! (SKIPPING)\n";
return;
}
local $/;
my $data=<IN>;
close(IN);

my($type, $number);
if($data =~ /abc([0-9]+)def/) {
$number=$1;
$type=1;
}
elsif($data =~ /lmn([0-9]+)opq/) {
$number=$1;
$type=2;
}
elsif($data =~ /rst([0-9]+)uvw/) {
$number=$1;
$type=3;
}
else {
warn "File $File::Find::name is unknown type\n";
return;
}

my $outfn="c:/output/$number-type$type.txt";
if(-e $outfn) {
warn "File $outfn already exists.\n";
return;
}
unless(open(OUT,">$outfn")) {
warn "Could not open $outfn: $!\n";
return;
}
print OUT $data;
close(OUT);
}

Tad McClellan · Nov 10, 2007

unless(open(IN,"<$File::Find::name")) {
warn "Could not open $File::Find::name: $! (SKIPPING)\n";
return;
}
local $/;
my $data=<IN>;
close(IN);

If you are going to mess with the special variables anyway,
then you could replace all of that with:

local @ARGV = $_;
local $/;
my $data = <>;

Luca Villa · Nov 10, 2007

If you are going to mess with the special variables anyway,

then you could replace all of that with:

local @ARGV = $_;
local $/;
my $data = <>;

I received this error:
"Can't do inplace edit: . is not a regular file at c:\script.src line
12."

inplace edit? What does it want to do?

Tad McClellan · Nov 10, 2007

Luca Villa said:
I received this error:
"Can't do inplace edit: . is not a regular file at c:\script.src line
12."

The error message has nothing to do with the code you quoted above.

inplace edit? What does it want to do?

It wants to edit the file "inplace", that is, with the same name.

You have turned on inplace editing either with the -i command line
switch, or by setting the $^I variable somewhere...

Also, what it is trying to edit is not a file, it is a directory. You
may want to test what find() is operating on with the -d or -f filetest.

Luca Villa · Nov 10, 2007

Hi Tad,

I'm not using any argument a part of the "source.src" that contains
the script.

I started to get the error since I used your suggested substitutive
block.

This is the source.src exact content, that gives the mentioned error:

use File::Find;

find(\&found, 'c:/tempebay/1');

sub found {
local @ARGV = $_;
local $/;
my $data = <>;

my($type, $number);
if($data =~ /<td align="right" nowrap>\s+Item number:\s+(\d+)<\/
td>/) {
$number=$1;
$type="item_description_html";
}
elsif($data =~ /Item number:\s*<img src="http:\/\/pics
\.ebaystatic\.com\/aw\/pics\/s\.gif" width="\d+">(\d+)<\/div>/) {
$number=$1;
$type="buyers_history_html";
}
else {
warn "File $File::Find::name is of not interesting type,
for example an eBay page of item\n";
return;
}

my $outfn="c:/tempebay/2/$number-$type.htm";
if(-e $outfn) {
warn "File $outfn already exists.\n";
return;
}
unless(open(OUT,">$outfn")) {
warn "Could not open $outfn: $!\n";
return;
}
print OUT $data;
close(OUT);
}

___

I launch: perl script.src
and despite that initial error message it actually works!

Can you understand why does it want to do that inplace edit?

Seek for help..linked list..urgent!!!	1	May 12, 2013
regular expression for beow text	8	Aug 20, 2010
Using regexes versus "in" membership test?	6	Dec 12, 2012
compound regex	0	Feb 9, 2009
Newbie with simple File handling problem	3	Nov 14, 2006
Regex testing and UTF8 awarenes or Regex and numeric pattern matching	2	Mar 10, 2009
Read File and Output to new format	7	Mar 24, 2009
$5 reward for this Perl script (regular expression searches through an entire directory of files)	1	Nov 10, 2007

File handling and regex

Luca Villa

John W. Krahn

jordilin

Josef Moellers

Luca Villa

Tad McClellan

Luca Villa

Tad McClellan

Luca Villa

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads