How good is PERL at searching ASCII files?

  • Thread starter jmartzoo-google
  • Start date
J

jmartzoo-google

Hi,

Please forgive my ignorance, I'm a C language developer but I've got
some text manipulation that needs doing and I've heard many times that
this is PERL's forte. That being said, I'd like to give it a try
instead of writing a C# app to deal with this. Can PERL do th
following relatively easy, or should I just spend 3 or 4 hours writing
a C# algorithm?

The challenge I'm faced with is to search out all occurrences where a
function named CONVERT() is being called with three parameters. It's a
challenge because the function has several signatures and I need to
segregate the three param calls from the two param calls. To make
things even spicier, it's very possible that the parameters may also be
function calls themselves. I think the best strategy becomes finding
the pattern:
CONVERT(<something>,<something>,
The key elements being the string CONVERT, one open paren ( followed by
two commas at the same nesting level.


Of the following examples, I would want the first two in my results,
the second two ignored:

yes:
~~~~
1)
CONVERT(varchar(16), @variable, 1)
2)
convert (char(12),
(
SELECT date
FROM table1
WHERE id = @x
AND user IN ('raquel', 'marylin', 'jennifer')
), 7)

no:
~~~~
1)
convert(varchar(16), @variable)
2)
CONVERT (char(12),
(
SELECT date
FROM table2
WHERE id = @y
AND user IN ('huey', 'dewey', 'louie')
))

If anybody has a script that I can use to get started, I'd be very
grateful to see it. I've got 1,500 flat files (ascii) that need
parsing and a search on the string convert hits 10,000 times... my
guess is that only 5% of those will be the three parameter version I
need to isolate.


Much obliged,
John M
Montreal, Quebec
 
B

boyd

Hi,

Please forgive my ignorance, I'm a C language developer but I've got
some text manipulation that needs doing and I've heard many times that
this is PERL's forte. That being said, I'd like to give it a try
instead of writing a C# app to deal with this. Can PERL do th
following relatively easy, or should I just spend 3 or 4 hours writing
a C# algorithm?

The challenge I'm faced with is to search out all occurrences where a
function named CONVERT() is being called with three parameters. It's a
challenge because the function has several signatures and I need to
segregate the three param calls from the two param calls. To make
things even spicier, it's very possible that the parameters may also be
function calls themselves. I think the best strategy becomes finding
the pattern:
CONVERT(<something>,<something>,
The key elements being the string CONVERT, one open paren ( followed by
two commas at the same nesting level.

...
Much obliged,
John M
Montreal, Quebec

Perl is very good at that. However, I'm not. The Perl module,
Text::Balanced is tailored for this very task. I can help you learn to
use it (while I learn it myself).

Or you can use the regexp of Perl to do this quite easily, I suspect,
but I haven't done that with multiple lines before.

I guess my approach would be to find the CONVERT call, then keep reading
lines until the count of the occurences "(" and ")" balance out, using
the '.' operator to join the lines together into one string. Then use a
regexp to determine whether the number of parameters was 2 or 3. Or,
better, that determination could be done as you do the counting...

Oops - that latter thought gets more complicated if you have extra
parentheses as in:

CONVERT((1),1,(((3)))). Oh well - I would then, at that point, realize
that to make it bullet proof would take more time and testing than just
learning to use Text::Balanced.

Boyd
 
D

Dave

boyd said:
Perl is very good at that. However, I'm not. The Perl module,
Text::Balanced is tailored for this very task. I can help you learn to
use it (while I learn it myself).

Or you can use the regexp of Perl to do this quite easily, I suspect,
but I haven't done that with multiple lines before.

I guess my approach would be to find the CONVERT call, then keep reading
lines until the count of the occurences "(" and ")" balance out, using
the '.' operator to join the lines together into one string. Then use a
regexp to determine whether the number of parameters was 2 or 3. Or,
better, that determination could be done as you do the counting...

Oops - that latter thought gets more complicated if you have extra
parentheses as in:

CONVERT((1),1,(((3)))). Oh well - I would then, at that point, realize
that to make it bullet proof would take more time and testing than just
learning to use Text::Balanced.

Boyd

I agree that text balenced is the right approach. Depending on the kinds of
arguments CONVERT can take this can be very tricky. For example simply
counting parenthesis would not work if the following was an acceptable call:
CONVERT( 1, "(", 4 )
It is much easier if no strings can be passed.
 
B

Ben Morrow

Quoth (e-mail address removed):
The challenge I'm faced with is to search out all occurrences where a
function named CONVERT() is being called with three parameters. It's a
challenge because the function has several signatures and I need to
segregate the three param calls from the two param calls. To make
things even spicier, it's very possible that the parameters may also be
function calls themselves. I think the best strategy becomes finding
the pattern:
CONVERT(<something>,<something>,
The key elements being the string CONVERT, one open paren ( followed by
two commas at the same nesting level.

This should get you started. Note that it doesn't handle quotes: that
is, parens in quotes will *not* be ignored as they should. You could
perhaps add a first pass to strip out all quoted strings, again using
Regexp::Common.

#!/usr/bin/perl -l

use warnings;
use strict;

# get some test data
my $data = <<'DATA';

Convert( foo, bar, baz )

Convert( (foo, bar), baz)

Convert(foo, bar)

convert (foo, (bar (baz, quux)))

convert (foo, (bar, (baz, quux)), flarp)
DATA

use Regexp::Common;

# $p is regex that matches a balanced parenthesized expression
my $p = $RE{balanced}{-parens => '()'};

# each time round the loop, remove one call to convert and place it in
# $1 (that's what the parens are for)
while ($data =~ s/( convert \s* $p )//xi) {

# put the call into $call
my $call = $1;

# get just the arguments, without the outer parens
(my $args = $call) =~ s/convert \s* \( (.*) \)/$1/xsi;

# strip any balanced sets of parens
$args =~ s/$p//g;

# count the number of commas
my $commas = $args =~ tr/,//;

if ($commas == 2) {
print $call;
}
}

Ben
 
B

Bigus

Of the following examples, I would want the first two in my results,
the second two ignored:

yes:
~~~~
1)
CONVERT(varchar(16), @variable, 1)
2)
convert (char(12),
(
SELECT date
FROM table1
WHERE id = @x
AND user IN ('raquel', 'marylin', 'jennifer')
), 7)

no:
~~~~
1)
convert(varchar(16), @variable)
2)
CONVERT (char(12),
(
SELECT date
FROM table2
WHERE id = @y
AND user IN ('huey', 'dewey', 'louie')
))

The following code is a tad sloppy but I'm no Perl expert. Copy it into a
text file and name it something like test.pl. I'll explain it below:

=========CODE===========
use strict;
my $file;

strng();

my @matches = $file =~ /(convert\s*\([^,\n]+,[^,\n]+,[^\)]+\))/gi;
my @matches2 = $file =~
/(convert\s*\([^,\n]+,\s*\n.+?\n.+?\n.+?\n.+?\n.+?\n.*?\)\s*\,[^\)]+\))/gi;

print "@matches\n\n";
print "@matches2";


sub strng {

$file = " yes:
~~~~
1)
CONVERT(varchar(16), \@variable, 1)
2)
convert (char(12),
(
SELECT date
FROM table1
WHERE id = \@x
AND user IN ('raquel', 'marylin', 'jennifer')
), 7)

no:
~~~~
1)
convert(varchar(16), \@variable)
2)
CONVERT (char(12),
(
SELECT date
FROM table2
WHERE id = \@y
AND user IN ('huey', 'dewey', 'louie')
))";

}
========/CODE=======

Basically, your file contents will be in the scalar variable $file (in this
case I've put your examples into $file via the sub strng).

There's 2 regular expressions - the first one matches single line 3 param
convert functions as you illustrate, and places each match into the array
@matches. The second regular expression will match the multi-line convert
functions you illustrate, although the format will have to be quite rigidly
as per your illustration, so if any of your multi-line converts stretch over
more than 7 lines you could get weird results.

You could then merge the 2 arrays, instead of printing then out
indivudually. Anyway, this is just a starting point to get you off the
ground.

Regards
Bigus
 
C

Craig

The Inline::C::parseRegExp and Inline::C::parseRecDescent modules
already parse C very effectively. These already do all the heavy
lifting. Using these modules as a model, you should be able to develop
a solution to your problem very quickly.
 
B

Ben Morrow

Quoth "Craig said:
The Inline::C::parseRegExp and Inline::C::parseRecDescent modules
already parse C very effectively. These already do all the heavy
lifting. Using these modules as a model, you should be able to develop
a solution to your problem very quickly.

....except the OP wasn't parsing C. It looked to me like a dialect of
SQL.

Ben
 
J

jmartzoo-google

Hi Ben,

This looks very promising. I'm confident that the code base I'll be
searching doesn't have any quoted parens in it, so that is no concern.

I tried running the PL file and it came up with this response:

==================================================================
Can't find string terminator "DATA" anywhere before EOF at
C:\test\findConverts.pl line 9.
==================================================================

Any idea what's missing in the definition of the DATA?

Also, how can I make it accept "DATA" from a flat file as a command
line argument? I'd like to run it like this: findConverts.PL
storedProc1.SQL

Alternately, I would *love* if I could have it test every file in a
directory... but I can wrap the PL calls with a BAT and achive the same
result, so this level of sophistication can wait.

Many thanks for the help! It feels like the solution is right around
the corner and it's getting me jazzed up about Perl.

John
 
B

Ben Morrow

[quoting fixed. please don't top-post.]

Quoth (e-mail address removed):
This looks very promising. I'm confident that the code base I'll be
searching doesn't have any quoted parens in it, so that is no concern.

I tried running the PL file and it came up with this response:

==================================================================
Can't find string terminator "DATA" anywhere before EOF at
C:\test\findConverts.pl line 9.
==================================================================

Any idea what's missing in the definition of the DATA?

At a guess: does the line containing only 'DATA' have any extra
whitespace? Perl's heredoc quote operator quotes everything from <<FOO
to a line containing *only* FOO. See 'Regexp Quote-Like Operators' in
perldoc perlop, the item labelled '<<EOF'.
Also, how can I make it accept "DATA" from a flat file as a command
line argument? I'd like to run it like this: findConverts.PL
storedProc1.SQL

To get the command-line arguments use the array @ARGV, documented in
perldoc perlvar.

To get the contents of a file, the easiest way is to use the File::Slurp
module from CPAN; alternatively, see 'open' in perldoc perlfunc, 'I/O
Operators' in perldoc perlop, and '$/' in perldoc perlvar.
Alternately, I would *love* if I could have it test every file in a
directory... but I can wrap the PL calls with a BAT and achive the same
result, so this level of sophistication can wait.

See 'opendir' and 'readdir' in perldoc perlfunc, or use the standard
module File::Find (which will search all files below a given directory,
including subdirs).
Many thanks for the help! It feels like the solution is right around
the corner and it's getting me jazzed up about Perl.

Good! Feel free to ask for more help if/when you get stuck. It would
probably be wise to get hold of a basic Perl book before you go too much
further: 'Learning Perl' by Randal Schwartz, Tom Phoenix and brian d
foy, published by O'Reilly, is the standard beginners' book.

Ben
 
J

James

Hi,

Please forgive my ignorance, I'm a C language developer but I've got
some text manipulation that needs doing and I've heard many times that
this is PERL's forte. That being said, I'd like to give it a try
instead of writing a C# app to deal with this. Can PERL do th
following relatively easy, or should I just spend 3 or 4 hours writing
a C# algorithm?

The challenge I'm faced with is to search out all occurrences where a
function named CONVERT() is being called with three parameters. It's a
challenge because the function has several signatures and I need to
segregate the three param calls from the two param calls. To make
things even spicier, it's very possible that the parameters may also be
function calls themselves. I think the best strategy becomes finding
the pattern:
CONVERT(<something>,<something>,
The key elements being the string CONVERT, one open paren ( followed by
two commas at the same nesting level.


Of the following examples, I would want the first two in my results,
the second two ignored:

yes:
~~~~
1)
CONVERT(varchar(16), @variable, 1)
2)
convert (char(12),
(
SELECT date
FROM table1
WHERE id = @x
AND user IN ('raquel', 'marylin', 'jennifer')
), 7)

no:
~~~~
1)
convert(varchar(16), @variable)
2)
CONVERT (char(12),
(
SELECT date
FROM table2
WHERE id = @y
AND user IN ('huey', 'dewey', 'louie')
))

If anybody has a script that I can use to get started, I'd be very
grateful to see it. I've got 1,500 flat files (ascii) that need
parsing and a search on the string convert hits 10,000 times... my
guess is that only 5% of those will be the three parameter version I
need to isolate.


Much obliged,
John M
Montreal, Quebec


#!/bin/env perl
$/="";
while (<>) {
$txt = $_;
$par = $com = 0;
next unless /^\s*convert\s*\(/i;
for (split //) {
$par++ if /\(/;
$par-- if /\)/;
$com++ if $par==1 && /\,/;
}
print $txt if $com==2;
}

JL
 
J

John W. Krahn

The challenge I'm faced with is to search out all occurrences where a
function named CONVERT() is being called with three parameters. It's a
challenge because the function has several signatures and I need to
segregate the three param calls from the two param calls. To make
things even spicier, it's very possible that the parameters may also be
function calls themselves. I think the best strategy becomes finding
the pattern:
CONVERT(<something>,<something>,
The key elements being the string CONVERT, one open paren ( followed by
two commas at the same nesting level.


Of the following examples, I would want the first two in my results,
the second two ignored:

yes:
~~~~
1)
CONVERT(varchar(16), @variable, 1)
2)
convert (char(12),
(
SELECT date
FROM table1
WHERE id = @x
AND user IN ('raquel', 'marylin', 'jennifer')
), 7)

no:
~~~~
1)
convert(varchar(16), @variable)
2)
CONVERT (char(12),
(
SELECT date
FROM table2
WHERE id = @y
AND user IN ('huey', 'dewey', 'louie')
))

If anybody has a script that I can use to get started

Sure, try this:

#!/usr/bin/perl
use warnings;
use strict;

while ( <> ) {
if ( /convert/i && tr/(// != tr/)// ) {
$_ .= <>;
redo;
}
if ( /convert/i ) {
my $temp = $_;
1 while $temp =~ s/ \( [^()]+ \) (?=.*\)) //sx;
print if $temp =~ tr/,// == 2;
next;
}
print;
}

__END__




John
 
J

jmartzoo-google

Make sure the second DATA string starts in column one and does not have
any whitespace after it.

Thanks Jim, this fixed the DATA problem and I'm moving on with this as
we speak. I'm about to web-search for a good opendir()/readdir()
example... then an @ARGV and I should have this script ready. Much
appreciated... I don't think I could have reached this point without
all of your help.

John
 
U

Uri Guttman

j> Thanks Jim, this fixed the DATA problem and I'm moving on with this as
j> we speak. I'm about to web-search for a good opendir()/readdir()
j> example... then an @ARGV and I should have this script ready. Much
j> appreciated... I don't think I could have reached this point without
j> all of your help.

File::Slurp has a read_dir sub that makes it easy to read a whole
directory. it also filters out . and ..

and i have plans to add grep filtering with either qr or code refs. so
you could do something like this in the future:

my @files = read_dir( $dir, { filter => sub{ -f } } ) ;

another idea is to support prefixing the dir to the returned entries:

my @files = read_dir( $dir, { prefix_dir => 1 } ) ;

anyone who wants to help with those features, feel free to email me. i
will do code review, editing, mentoring of your work in exchange for the
initial coding and such. i have the designs all worked out so you just
need a few hours to hack in these enhancements.

uri
 
J

jmartzoo-google

Thanks all for your your help.

I was able to go far enough with this to realize that it'll take a fair
amount of effort to understand the Perl's syntax and commit the special
symbols to memory. Time that I don't have at present. In the end, I
turned back to my established skill set and solved the problem using
tools that I'm familiar with.

In the end, it was the multiline searches that forced me to abandon
Perl. I just couldn't tweak the examples to do everything I needed.

Does Perl share syntax with any other languages? What about the
keywords...
<>
$_
do they appear in any other languages as well?

Regards and best wishes,
John
 
J

John W. Krahn

Thanks all for your your help.

I was able to go far enough with this to realize that it'll take a fair
amount of effort to understand the Perl's syntax and commit the special
symbols to memory. Time that I don't have at present. In the end, I
turned back to my established skill set and solved the problem using
tools that I'm familiar with.

In the end, it was the multiline searches that forced me to abandon
Perl. I just couldn't tweak the examples to do everything I needed.

Does Perl share syntax with any other languages? What about the
keywords...

Keywords usually have letters in them somewhere. :)

That is an operator.

perldoc perlop

That is a variable.

perldoc perlvar
do they appear in any other languages as well?

I believe that Ruby does?




John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top