File handling and regex

L

Luca Villa

Hi all!

I need help with Perl under Windows command-line to solve the
following task:

I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt

These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"

I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt

How can I do it?
 
J

John W. Krahn

Luca said:
I need help with Perl under Windows command-line to solve the
following task:

I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt

These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"

I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt

How can I do it?

*UNTESTED* YMMV :)

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use File::Copy;

my $from = 'c:/dir';
my $to = 'c:/output';

my %trans = qw(
abc(\d+)def type1
lmn(\d+)opq type2
rst(\d+)uvw type3
);

find sub {
return unless open my $fh, '<', $_;
return unless -f $fh;
read $fh, my $data, -s _;
close $fh;
for my $pat ( keys %trans ) {
next unless $data =~ $pat;
copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
last;
}
}, $from;

__END__



John
 
J

jordilin

Luca said:
I need help with Perl under Windows command-line to solve the
following task:
I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt
These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"
I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt
How can I do it?

*UNTESTED* YMMV :)

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use File::Copy;

my $from = 'c:/dir';
my $to = 'c:/output';

my %trans = qw(
abc(\d+)def type1
lmn(\d+)opq type2
rst(\d+)uvw type3
);

find sub {
return unless open my $fh, '<', $_;
return unless -f $fh;
read $fh, my $data, -s _;
close $fh;
for my $pat ( keys %trans ) {
next unless $data =~ $pat;
copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
last;
}
}, $from;

__END__

John

One doubt,
when you write
read $fh, my $data, -s _;
should not be
read $fh, my $data, -s $_;

I have searched along the web without success. I don't know if _
equals $_ in this particular case
best regards,
jordi
 
J

Josef Moellers

jordilin said:
Luca Villa wrote:

I need help with Perl under Windows command-line to solve the
following task:
I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt
These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"
I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt
How can I do it?

*UNTESTED* YMMV :)

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use File::Copy;

my $from = 'c:/dir';
my $to = 'c:/output';

my %trans = qw(
abc(\d+)def type1
lmn(\d+)opq type2
rst(\d+)uvw type3
);

find sub {
return unless open my $fh, '<', $_;
return unless -f $fh;
read $fh, my $data, -s _;
close $fh;
for my $pat ( keys %trans ) {
next unless $data =~ $pat;
copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
last;
}
}, $from;

__END__

John


One doubt,
when you write
read $fh, my $data, -s _;
should not be
read $fh, my $data, -s $_;

I have searched along the web without success. I don't know if _
equals $_ in this particular case

No, it doesn't, at least not "literally" or conceptually.
"_" is the special filehandle which refers to the filehandle used in the
most recently used stat operation:

"If any of the file tests (or either the "stat" or "lstat" operators)
are given the special filehandle consisting of a solitary underline,
then the stat structure of the previous file test (or stat operator) is
used, saving a system call."
(perldoc -f -s)
 
L

Luca Villa

Thanks to all and to John in particular.

John's solution perhaps worked but I had difficulty in adapting it for
my needs so I ended using this alternative solution:


use File::Find;

find(\&found, 'c:/dir');


sub found {
unless(open(IN,"<$File::Find::name")) {
warn "Could not open $File::Find::name: $! (SKIPPING)\n";
return;
}
local $/;
my $data=<IN>;
close(IN);

my($type, $number);
if($data =~ /abc([0-9]+)def/) {
$number=$1;
$type=1;
}
elsif($data =~ /lmn([0-9]+)opq/) {
$number=$1;
$type=2;
}
elsif($data =~ /rst([0-9]+)uvw/) {
$number=$1;
$type=3;
}
else {
warn "File $File::Find::name is unknown type\n";
return;
}

my $outfn="c:/output/$number-type$type.txt";
if(-e $outfn) {
warn "File $outfn already exists.\n";
return;
}
unless(open(OUT,">$outfn")) {
warn "Could not open $outfn: $!\n";
return;
}
print OUT $data;
close(OUT);
}
 
T

Tad McClellan

unless(open(IN,"<$File::Find::name")) {
warn "Could not open $File::Find::name: $! (SKIPPING)\n";
return;
}
local $/;
my $data=<IN>;
close(IN);


If you are going to mess with the special variables anyway,
then you could replace all of that with:

local @ARGV = $_;
local $/;
my $data = <>;
 
L

Luca Villa

If you are going to mess with the special variables anyway,
then you could replace all of that with:

local @ARGV = $_;
local $/;
my $data = <>;

I received this error:
"Can't do inplace edit: . is not a regular file at c:\script.src line
12."

inplace edit? What does it want to do?
 
T

Tad McClellan

Luca Villa said:
I received this error:
"Can't do inplace edit: . is not a regular file at c:\script.src line
12."


The error message has nothing to do with the code you quoted above.

inplace edit? What does it want to do?


It wants to edit the file "inplace", that is, with the same name.

You have turned on inplace editing either with the -i command line
switch, or by setting the $^I variable somewhere...

Also, what it is trying to edit is not a file, it is a directory. You
may want to test what find() is operating on with the -d or -f filetest.
 
L

Luca Villa

Hi Tad,

I'm not using any argument a part of the "source.src" that contains
the script.

I started to get the error since I used your suggested substitutive
block.

This is the source.src exact content, that gives the mentioned error:

use File::Find;

find(\&found, 'c:/tempebay/1');

sub found {
local @ARGV = $_;
local $/;
my $data = <>;


my($type, $number);
if($data =~ /<td align="right" nowrap>\s+Item number:\s+(\d+)<\/
td>/) {
$number=$1;
$type="item_description_html";
}
elsif($data =~ /Item number:\s*<img src="http:\/\/pics
\.ebaystatic\.com\/aw\/pics\/s\.gif" width="\d+">(\d+)<\/div>/) {
$number=$1;
$type="buyers_history_html";
}
else {
warn "File $File::Find::name is of not interesting type,
for example an eBay page of item\n";
return;
}

my $outfn="c:/tempebay/2/$number-$type.htm";
if(-e $outfn) {
warn "File $outfn already exists.\n";
return;
}
unless(open(OUT,">$outfn")) {
warn "Could not open $outfn: $!\n";
return;
}
print OUT $data;
close(OUT);
}


___

I launch: perl script.src
and despite that initial error message it actually works!

Can you understand why does it want to do that inplace edit?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top