regexp: segmentation fault

S

S.Marion

Hello,

I have a problem with my regexp.
I'm trying to match the following pattern:

$cmd =~ /static \{\};(.*\n)(.*)Signature:
\(\)V(.*\n){1,3}.*Code:\n(.+\n){1,$limit}\s*$offset:(.*)/g;

The problem is that there can be many lines between the "Code" and the
$offset.
By many line I mean thousands.
When the offset is further than about 8000 lines, I have a segfault !
I guess the problem is that it's feeding too much info into $3 whereas
I'm only interested in the remaining of the line after $offset (that is $4).
Basically, if I could avoid using $3 I wouldn't mind !


Do you know any way I could fix this?

Thank you for your help,

Sebastien
 
A

A. Sinan Unur

I have a problem with my regexp.
I'm trying to match the following pattern:

$cmd =~ /static \{\};(.*\n)(.*)Signature:
\(\)V(.*\n){1,3}.*Code:\n(.+\n){1,$limit}\s*$offset:(.*)/g;

The problem is that there can be many lines between the "Code" and the
$offset.
By many line I mean thousands.
When the offset is further than about 8000 lines, I have a segfault !
I guess the problem is that it's feeding too much info into $3 whereas
I'm only interested in the remaining of the line after $offset (that
is $4). Basically, if I could avoid using $3 I wouldn't mind !

Without any idea what the input looks like, the regex above does not
mean much to me.

By the way, the (.*) after $offset is the fifth capture group.

If you are not interested in capturing anything before that, why are you
using capturing groups?

I have a feeling (since I have no data, I cannot test this), anchoring
the pattern, using .+ rather than .* might help.

On the other hand, depending on what the input looks like, I might be
tempted to use the .. operator.

See

perldoc perlre for non-capturing groups
perldoc perlop for range operators

Sinan
 
S

S.Marion

Hello,

Thank you for your reply.
Let me apologise if I wasn't clear enough.
Basically the inputs are from javap, and I want to match a particular
offset of a given output in the given method with the given signature.
By the way, the (.*) after $offset is the fifth capture group.
That's right, my mistake, got confused after moving it around.
If you are not interested in capturing anything before that, why are you
using capturing groups?
well... simply because I have no idea how else I could say "ok jump as
many lines as you want until you find my offset".
I have a feeling (since I have no data, I cannot test this), anchoring
the pattern, using .+ rather than .* might help.
No unfortunately that doesn't do the trick.
On the other hand, depending on what the input looks like, I might be
tempted to use the .. operator.
I'm not sure I understand what this does, but in any case it does not
work unfortunately :(
 
S

S.Marion

Ok, I'll try to simplify the question.

I found the following which I thing is exactly my problem:
"Items governed by * (and *?) are optional not only once, but repeatedly
forever (well, to be pedantic, Perl currently has an internal limit of
32K repeats for parenthetical items)."

basically the file I parse is something like:

bla
2000 lines
what i want: secret
blablabla
10 000 lines (more than 32k)
what i want: secret

I only want to get the "secret" after the "what I want" stuff while
being sure this is below "blablabla" AND NOT below "bla".

So the regexp looks like:
$cmd =~ /blablabla (.*\n)what i want: (.*)/g

PS: I can't really use a /gs modifier due to the complexity of the file
to parse. If I do so, I would end-up with duplicates.

Sebastien
 
E

ednotover

S.Marion said:
Ok, I'll try to simplify the question.
basically the file I parse is something like:

bla
2000 lines
what i want: secret
blablabla
10 000 lines (more than 32k)
what i want: secret

I only want to get the "secret" after the "what I want" stuff while
being sure this is below "blablabla" AND NOT below "bla".

So the regexp looks like:
$cmd =~ /blablabla (.*\n)what i want: (.*)/g

Two quick thoughts:

1) Why are you trying to do this with a regexp? Why not loop through
the input file and take actions as needed as you see the significant
input lines?

2) If you're staying with a regexp, you might be better off using
non-greedy matches for the portions that are your "filler":

$cmd =~ /blablahblah (.*?\n)what i what: (.*?)/g

The segfault may be due to the size of the input and the amount of
backtracking the RE engine has to manage.

Hope that helps,
Ed
 
R

robic0

Two quick thoughts:

1) Why are you trying to do this with a regexp? Why not loop through
the input file and take actions as needed as you see the significant
input lines?

2) If you're staying with a regexp, you might be better off using
non-greedy matches for the portions that are your "filler":

$cmd =~ /blablahblah (.*?\n)what i what: (.*?)/g

The segfault may be due to the size of the input and the amount of
backtracking the RE engine has to manage.
Naaa, the size and backtracking will not produce a segmentation fault.
Personally, the OP's posted problem is absurd. In the real world, no one
would program to such an abstraction. His "blablabla's" and 32K represents
random, non-repeatable form expressions. Might as well try to do regex on
the Dictionary.

Most likely this guy has got a DFI board with overclocked cpu in the 4.0gz
and ram in the 700 ddr range, *on air* with temps in the 70c range.
OR, has got 128 meg installed for grins. Ever try to run a bunch of programs
in XP with 128 meg of ram? I would bet Perl would seg fault, wouldn't you?

-robic0-
 
J

jl_post

S.Marion said:
basically the file I parse is something like:

bla
2000 lines
what i want: secret
blablabla
10 000 lines (more than 32k)
what i want: secret

I only want to get the "secret" after the "what I want" stuff while
being sure this is below "blablabla" AND NOT below "bla".


Here's a short Perl script that does what you want:


#!/usr/bin/perl
use strict;
use warnings;

while (<DATA>)
{
# Skip line unless it's between the "bla" and "blablabla" lines:
next unless m/^bla$/ .. m/^blablabla$/;

if (m/^what i want: (.*)/)
{
my $wantedObject = $1;
print "I found what I want! It's $wantedObject!\n";
}
}

__DATA__
bla
2000 lines
what i want: secret1
blablabla
10 000 lines (more than 32k)
what i want: secret2


Run this program, and you'll see that the output is:

I found what I want! It's secret1!

Notice that it found "secret1" but not "secret2". That's because
the ".." operator only returns true when it is between the "bla" and
the "blablabla" lines. We told it to skip any lines that aren't
between those two lines with the line of code:

next unless m/^bla$/ .. m/^blablabla$/;

I hope this helps, Sebastien.

Have a great weekend!

-- Jean-Luc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,183
Messages
2,570,969
Members
47,524
Latest member
ecomwebdesign

Latest Threads

Top