Regex help

K

king

I have the below text:
Want to store it in array using regular expression.

The script is as below.
"
open(SYSCMD, "dccmd -listmodes |");

while(<SYSCMD>)
{
if(/^(\d+)[\s+](\d+)/)
{
chomp;
push(@X_Cord, $1);
}
if(/[\s+](\d+)[\s+](.*)$/)

{
chomp;
push(@Y_Cord, $1);
}
if(/(\d+)(\s+)(\d+)$/)

{
chomp;
push(@depth, $1);

}
if(/(\s+)(\d+)$/)
{
chomp;
push(@refresh, $1);

}

}
close(SYSCMD);

foreach(@X_Cord)
{
print "$_\n";
}
foreach(@Y_Cord)
{
print "$_\n";
}
foreach(@depth)
{
print "$_\n";
}
foreach(@refresh)
{
print "$_\n";
}
"

The output of dccmd command is
"Resolution Changer 3.12 from 12noon (12noon.com)

Width Height Depth Rate

320 200 8 60

320 200 16 60

320 200 32 60

512 384 8 60

512 384 16 60

512 384 32 60

640 400 8 60

640 400 16 60

640 400 32 60

640 480 8 60

640 480 16 60

640 480 32 60

800 600 8 60

800 600 16 60

800 600 32 60

1024 768 8 60

1024 768 16 60

1024 768 32 60

1280 800 8 60

1280 800 16 60

1280 800 32 60

"
From the script I am getting only X_cord and Y_cord. But I am not
getting the depth and refresh.
Can anybody help me with the refresh rate.
 
M

Marc Girod

while(<SYSCMD>)
        {
                if(/^(\d+)[\s+](\d+)/)
                        {
                                chomp;
                                push(@X_Cord, $1);

You match to the end of height, but you use only width...
Be less verbose--less is more:

use strict;
my (@X_Cord, @Y_Cord, @depth, @refresh);
my $v;
while(<SYSCMD>) {
my @f = /\d+/g;
push @X_Cord, $v if $v = shift @f;
push @Y_Cord, $v if $v = shift @f;
push @depth, $v if $v = shift @f;
push @refresh, $v if $v = shift @f;
}

Note that if you have wholes in your table, you'll get arrays of
different sizes, not knowing what value to attribute to what index.

Marc
 
J

J. Gleixner

king said:
I have the below text:
Want to store it in array using regular expression.

The script is as below.
"
open(SYSCMD, "dccmd -listmodes |");

while(<SYSCMD>)
{
if(/^(\d+)[\s+](\d+)/)
{
chomp;
push(@X_Cord, $1);
}
if(/[\s+](\d+)[\s+](.*)$/)

{
chomp;
push(@Y_Cord, $1);
}
if(/(\d+)(\s+)(\d+)$/)

{
chomp;
push(@depth, $1);

}
if(/(\s+)(\d+)$/)
{
chomp;
push(@refresh, $1);

}

}
close(SYSCMD);

foreach(@X_Cord)
{
print "$_\n";
}
foreach(@Y_Cord)
{
print "$_\n";
}
foreach(@depth)
{
print "$_\n";
}
foreach(@refresh)
{
print "$_\n";
}
"

The output of dccmd command is
"Resolution Changer 3.12 from 12noon (12noon.com)

Width Height Depth Rate

320 200 8 60 [...]

1280 800 16 60

1280 800 32 60

"
From the script I am getting only X_cord and Y_cord. But I am not
getting the depth and refresh.
Can anybody help me with the refresh rate.

Excessive indentation there..


First, when asking for help, post your code as something we can
run. As it is, everyone would have to cut and paste your data,
modify your open, and likely your while, etc.. If you provide it
as something everyone can simply cut & paste, you'll make it easier
for everyone. e.g. __DATA__ or something similar.

Second, to help you/us, it might be good to show label what is
being printed to STDOUT.

e.g.

print "X Cord:\n";
foreach(@X_Cord)
{
print "$_\n";
}

etc.

You'll see that you are getting a value for everything, but for
refresh, the value is whitespace, at least I do when I go through
the hassle of getting your script to run.

"Why is it whitespace?", you ask.

Because, you're pushing $1, which is the first match (\s+), here.

if(/(\s+)(\d+)$/)
{
chomp;
push(@refresh, $1);

}

Change it to $2, or better yet don't capture the whitespace:
if( /\s+(\d+)$/ ) { ... }

The other regular expressions could use some work too..

if(/^(\d+)[\s+](\d+)/) {
chomp;
push(@X_Cord, $1);
}

No need to capture the second chunk of digits, since you're not
doing anything with the data, like storing it into an array.

[\s+]

No need for character class [] there..

Possibly, using split, instead of these regular expressions, might
be a better option.

For more help with regular expressions, read through:

perldoc perlre

For split, see:

perldoc -f split
 
M

msouth

while(<SYSCMD>)
        {
                if(/^(\d+)[\s+](\d+)/)
                        {
                                chomp;
                                push(@X_Cord, $1);

You match to the end of height, but you use only width...
Be less verbose--less is more:

use strict;
my (@X_Cord, @Y_Cord, @depth, @refresh);
my $v;
while(<SYSCMD>) {
  my @f = /\d+/g;
  push @X_Cord,  $v if $v = shift @f;
  push @Y_Cord,  $v if $v = shift @f;
  push @depth,   $v if $v = shift @f;
  push @refresh, $v if $v = shift @f;

As a general rule, anything that puts out numbers is likely to put out
a zero at some point--maybe this is included in what you meant by
holes in the table? I would personally check length($v) if I was
using this approach (since it will return a true value on a "0" in the
data). (Then the extra code might suggest a closure so you don't have
to type it four times (all of which recommends the split-on-whitespace
like in the other suggestions, which would also handle a 512.0 should
that get spit out)).

But my main point is that using "if" to mean "data present" is a
common source of bugs, and thus a good habit to get out of.

mike
 
J

John W. Krahn

king said:
I have the below text:
Want to store it in array using regular expression.

The script is as below.
"
open(SYSCMD, "dccmd -listmodes |");

Better written as:

open SYSCMD, '-|', 'dccmd', '-listmodes'
or die "Cannot pipe from command 'dccmd' because: $!";

while(<SYSCMD>)
{
if(/^(\d+)[\s+](\d+)/)

Your character class [\s+] says to match ONE character that is either a
whitespace character or the character '+'. And why capture $2 if you
are not going to use it?


Why chomp if you are only using $1?

push(@X_Cord, $1);
}
if(/[\s+](\d+)[\s+](.*)$/)
{
chomp;
push(@Y_Cord, $1);
}
if(/(\d+)(\s+)(\d+)$/)
{
chomp;
push(@depth, $1);
}
if(/(\s+)(\d+)$/)
{
chomp;
push(@refresh, $1);
}

}
close(SYSCMD);

Better written as:

close SYSCMD or warn $! ? "Error closing sort 'dccmd' because: $!"
: "Exit status $? from 'dccmd'\n";

foreach(@X_Cord)
{
print "$_\n";
}
foreach(@Y_Cord)
{
print "$_\n";
}
foreach(@depth)
{
print "$_\n";
}
foreach(@refresh)
{
print "$_\n";
}

Or simply:

for ( @X_Cord, @Y_Cord, @depth, @refresh ) {
print "$_\n";
}

"

The output of dccmd command is
"Resolution Changer 3.12 from 12noon (12noon.com)

Width Height Depth Rate

320 200 8 60

320 200 16 60

320 200 32 60

512 384 8 60

512 384 16 60

512 384 32 60

640 400 8 60

640 400 16 60

640 400 32 60

640 480 8 60

640 480 16 60

640 480 32 60

800 600 8 60

800 600 16 60

800 600 32 60

1024 768 8 60

1024 768 16 60

1024 768 32 60

1280 800 8 60

1280 800 16 60

1280 800 32 60

"
From the script I am getting only X_cord and Y_cord. But I am not
getting the depth and refresh.
Can anybody help me with the refresh rate.

This may work better (UNTESTED):

#!/usr/bin/perl
use warnings;
use strict;

open SYSCMD, '-|', 'dccmd', '-listmodes'
or die "Cannot pipe from command 'dccmd' because: $!";

my ( %data, @headers );
while ( <SYSCMD> ) {
next unless /\S/;
@headers = split unless /\d/;
my @fields = split if /\d/;

for my $index ( 0 .. $#headers ) {
push @{ $data{ $headers[ $index ] } }, $fields[ $index ];
}
}

close SYSCMD or warn $! ? "Error closing sort 'dccmd' because: $!"
: "Exit status $? from 'dccmd'\n";

for my $header ( @headers ) {
print "$header:\n", map "$_\n", @{ $data{ $header } };
}

__END__



John
 
J

John W. Krahn

king said:
I have the below text:
Want to store it in array using regular expression.

The script is as below.
"
open(SYSCMD, "dccmd -listmodes |");

Better written as:

open SYSCMD, '-|', 'dccmd', '-listmodes'
or die "Cannot pipe from command 'dccmd' because: $!";

while(<SYSCMD>)
{
if(/^(\d+)[\s+](\d+)/)

Your character class [\s+] says to match ONE character that is either a
whitespace character or the character '+'. And why capture $2 if you
are not going to use it?


Why chomp if you are only using $1?

push(@X_Cord, $1);
}
if(/[\s+](\d+)[\s+](.*)$/)
{
chomp;
push(@Y_Cord, $1);
}
if(/(\d+)(\s+)(\d+)$/)
{
chomp;
push(@depth, $1);
}
if(/(\s+)(\d+)$/)
{
chomp;
push(@refresh, $1);
}

}
close(SYSCMD);

Better written as:

close SYSCMD or warn $! ? "Error closing sort 'dccmd' because: $!"
: "Exit status $? from 'dccmd'\n";

foreach(@X_Cord)
{
print "$_\n";
}
foreach(@Y_Cord)
{
print "$_\n";
}
foreach(@depth)
{
print "$_\n";
}
foreach(@refresh)
{
print "$_\n";
}

Or simply:

for ( @X_Cord, @Y_Cord, @depth, @refresh ) {
print "$_\n";
}

"

The output of dccmd command is
"Resolution Changer 3.12 from 12noon (12noon.com)

Width Height Depth Rate

320 200 8 60

320 200 16 60

320 200 32 60

512 384 8 60

512 384 16 60

512 384 32 60

640 400 8 60

640 400 16 60

640 400 32 60

640 480 8 60

640 480 16 60

640 480 32 60

800 600 8 60

800 600 16 60

800 600 32 60

1024 768 8 60

1024 768 16 60

1024 768 32 60

1280 800 8 60

1280 800 16 60

1280 800 32 60

"
From the script I am getting only X_cord and Y_cord. But I am not
getting the depth and refresh.
Can anybody help me with the refresh rate.

This may work better (UNTESTED):

#!/usr/bin/perl
use warnings;
use strict;

open SYSCMD, '-|', 'dccmd', '-listmodes'
or die "Cannot pipe from command 'dccmd' because: $!";

my ( %data, @headers );
while ( <SYSCMD> ) {
next unless /\S/;
@headers = split unless /\d/;
my @fields = split if /\d/;

for my $index ( 0 .. $#headers ) {
push @{ $data{ $headers[ $index ] } }, $fields[ $index ];
}
}

close SYSCMD or warn $! ? "Error closing sort 'dccmd' because: $!"
: "Exit status $? from 'dccmd'\n";

for my $header ( @headers ) {
print "$header:\n", map "$_\n", @{ $data{ $header } };
}

__END__



John
 
S

sln

This may work better (UNTESTED):

#!/usr/bin/perl
use warnings;
use strict;

open SYSCMD, '-|', 'dccmd', '-listmodes'
or die "Cannot pipe from command 'dccmd' because: $!";

my ( %data, @headers );
while ( <SYSCMD> ) {
next unless /\S/; -------------------------------------------
@headers = split unless /\d/;
my @fields = split if /\d/;

unless ( /\d/ ) {
@headers = split();
next;
};
my @fields = split();
-------------------------------------------
for my $index ( 0 .. $#headers ) {
push @{ $data{ $headers[ $index ] } }, $fields[ $index ];
}
}

close SYSCMD or warn $! ? "Error closing sort 'dccmd' because: $!"
: "Exit status $? from 'dccmd'\n";

for my $header ( @headers ) {
print "$header:\n", map "$_\n", @{ $data{ $header } };
}

__END__

-sln
 
S

sln

This may work better (UNTESTED):

#!/usr/bin/perl
use warnings;
use strict;

open SYSCMD, '-|', 'dccmd', '-listmodes'
or die "Cannot pipe from command 'dccmd' because: $!";

my ( %data, @headers );
while ( <SYSCMD> ) {
next unless /\S/; -------------------------------------------
@headers = split unless /\d/;
my @fields = split if /\d/;

unless ( /\d/ ) {
@headers = split();
next;
};
my @fields = split();
-------------------------------------------
for my $index ( 0 .. $#headers ) {
push @{ $data{ $headers[ $index ] } }, $fields[ $index ];
}
}

close SYSCMD or warn $! ? "Error closing sort 'dccmd' because: $!"
: "Exit status $? from 'dccmd'\n";

for my $header ( @headers ) {
print "$header:\n", map "$_\n", @{ $data{ $header } };
}

__END__

-sln
 
S

sln

This may work better (UNTESTED):

#!/usr/bin/perl
use warnings;
use strict;

open SYSCMD, '-|', 'dccmd', '-listmodes'
or die "Cannot pipe from command 'dccmd' because: $!";

my ( %data, @headers );
while ( <SYSCMD> ) {
next unless /\S/; -------------------------------------------
@headers = split unless /\d/;
my @fields = split if /\d/;

unless ( /\d/ ) {
@headers = split();
next;
};
my @fields = split();
-------------------------------------------
for my $index ( 0 .. $#headers ) {
push @{ $data{ $headers[ $index ] } }, $fields[ $index ];
}
}

close SYSCMD or warn $! ? "Error closing sort 'dccmd' because: $!"
: "Exit status $? from 'dccmd'\n";

for my $header ( @headers ) {
print "$header:\n", map "$_\n", @{ $data{ $header } };
}

__END__

-sln

Also, it looks like the predominate (natural) delimeter is a
tab on both the field header and field data lines.

To exclude the other lines, instead of:
next unless /\S/;
it might be better to do this:
next unless /\t/;

Then leave the other split() 's intact.
Otherwise /\d/ is too broad a brush, but if the output is
constant, it doesen't matter.

As it is, "Resolution Changer 3.12 from 12noon (12noon.com)"
gets down to 'my @fields = split if /\d/;' but since its
header follows this, it never gets to the push statement.
And that is a weak defense, but like said above, if its output never
changes, it don't really matter.

-sln
 
C

C.DeRykus

"king" <[email protected]> skrev i melding

I have the below text:
Want to store it in array using regular expression.
The script is as below.
"
open(SYSCMD, "dccmd -listmodes |");
while(<SYSCMD>)
{
if(/^(\d+)[\s+](\d+)/)
{
chomp;
push(@X_Cord, $1);
}
if(/[\s+](\d+)[\s+](.*)$/)
{
chomp;
push(@Y_Cord, $1);
}
if(/(\d+)(\s+)(\d+)$/)
{
chomp;
push(@depth, $1);
}
if(/(\s+)(\d+)$/)
{
chomp;
push(@refresh, $1);


foreach(@X_Cord)
{
print "$_\n";
}
foreach(@Y_Cord)
{
print "$_\n";
}
foreach(@depth)
{
print "$_\n";
}
foreach(@refresh)
{
print "$_\n";
}
"
The output of dccmd command is
"Resolution Changer 3.12 from 12noon (12noon.com)
Width Height Depth Rate
320 200 8 60
320 200 16 60
320 200 32 60
512 384 8 60
512 384 16 60
512 384 32 60
640 400 8 60
640 400 16 60
640 400 32 60
640 480 8 60
640 480 16 60
640 480 32 60
800 600 8 60
800 600 16 60
800 600 32 60
1024 768 8 60
1024 768 16 60
1024 768 32 60
1280 800 8 60
1280 800 16 60
1280 800 32 60
"
From the script I am getting only X_cord and Y_cord. But I am not
getting the depth and refresh.
Can anybody help me with the refresh rate.

Many good solutions, TIMTOWTDI :)

This is my suggestion:

while(<SYSCMD>)
{
  if (/^(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/){
    push(@X_Cord, $1);
    push(@Y_Cord, $2);
    push(@depth, $3);
    push(@refresh, $4);
  }}
[ ... ]

you could even use /x and named capture buffers for
even more readability IMO:

if ( /^( ?<xcord> \d+ ) \s+ ( ?<ycord> \d+ ) \s+
( ?<depth> \d+) \s+ ( ?<refresh> \d+) /x )
{
push @X_Cord, $+{xcord};
push @Y_Cord, $+{ycord};
push @depth, $+{depth};
push @refresh,$+{refresh};
}
 
S

sln

Many good solutions, TIMTOWTDI :)

This is my suggestion:

while(<SYSCMD>)
{
if (/^(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/){
push(@X_Cord, $1);
push(@Y_Cord, $2);
push(@depth, $3);
push(@refresh, $4);
}
}
close(SYSCMD);

IMHO: Gives a good balance between what my poor head can read and input
control :)

Cheers

The regex you use is interresting.
/^(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/

The separator in the OP's sample data is a single tab \t
but you use \s+.

All indications from your regex show validation
requirements with the + quantifier. As well, there is
an assertion of beginning of string ^ followed by digits,
however, you don't have the end of string assertion with
preceding digits.

This cannot validate input on an apparent fixed output, and
with his data, that would appear to be this:
/^(\d+)\t(\d+)\t(\d+)\t(\d+)$/

otherwise, its just as valid to say this
/\s*(\d+)\s*(\d+)\s*(\d+)\s*(\d+)\s*/
because the assumption is slop data both ways and your
beginning of string assertion ^ is useless when used
without $ in a line of data that should be completely
consumed as 4 groups of digits only.

If the alternate assumption is that there could be partial
data of limited form with only slight validation, then this
might be a better trap:

/^\s*(\d+)\s*(\d*)\s*(\d*)\s*(\d*)/

-sln
 
S

sln

The regex you use is interresting.
/^(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/

The separator in the OP's sample data is a single tab \t
but you use \s+.

All indications from your regex show validation
requirements with the + quantifier. As well, there is
an assertion of beginning of string ^ followed by digits,
however, you don't have the end of string assertion with
preceding digits.

This cannot validate input on an apparent fixed output, and
with his data, that would appear to be this:
/^(\d+)\t(\d+)\t(\d+)\t(\d+)$/

otherwise, its just as valid to say this
/\s*(\d+)\s*(\d+)\s*(\d+)\s*(\d+)\s*/
because the assumption is slop data both ways and your
beginning of string assertion ^ is useless when used
without $ in a line of data that should be completely
consumed as 4 groups of digits only.

If the alternate assumption is that there could be partial
data of limited form with only slight validation, then this
might be a better trap:

/^\s*(\d+)\s*(\d*)\s*(\d*)\s*(\d*)/

-sln
Thanks for you input. Interesting.
IMHO it's always a bit tricky to know exactly what could be expected on the
input side.
My assumption is based on the regexp in the task:

"if(/^(\d+)[\s+](\d+)/)"

Which gave me a clue that he knew that the line began with a number, but
that he was somehow uncertain
about the end of the line. This of course gives room for speculations, and
normally I would have checked the
input requirements closer if I was given this assignment.
/\s*(\d+)\s*(\d+)\s*(\d+)\s*(\d+)\s*/

Yes, this was a bad example on my part, I was thinking ahead of
myself.
will allow a line like this:
1234
will allow a line like this:
1

But, on this, my point was that matching
(match) ($v1,$v2,$v3,$v4) = " 1 " =~ /^\s*(\d+)\s*(\d*)\s*(\d*)\s*(\d*)/;
is no less valid than matching
(match) ($v1,$v2,$v3,$v4) = "1 2 3 4 5 6 7 " =~ /^(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/;
as opposed to this
(no match) ($v1,$v2,$v3,$v4) = "1 2 3 4 5 6 7 " =~ /^(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$/;
and this
(match) ($v1,$v2,$v3,$v4) = "1 2 3 4" =~ /^(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$/;

All the $v's are defined.
Clearly, only one of these examples portends to be the real valid match.
The other matches are either subset or superset to that validation.
Maybe I misunderstand you here...., anyways, I agree that it's a good thing
to try to "anchor" it,
but again, that means
that I must know a bit more about the input.

Cheers

I don't know why anchor means anything special in this case, and that was not
my point. My point was that there is an attempt to validate form and content.
You validated partial form and partial content. There is no reason to do this
partially, it gets you the same result, unreliable data.
And this is perfectly fine if your confidence is high and that you are filterring
unformed, invalid and superflous, injected or malformed lines of data.

But don't think for a minute the data is validated as a structure.
In that sence, /^\s*(\d+)\s*(\d*)\s*(\d*)\s*(\d*)/ becomes just as valid
because "a minimum of 4 groups of digits" doesen't guarantee integrity,
nor does a minimum of 1. They have the same weight of validity, which is inconclusive.

This is just my opinion. If there is a question of data integrity, I would like
to catch partial data of otherwise invalid lines. I give a little leeway for variability.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,233
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top