s modifier doesn't seem to work

fmassion · Aug 10, 2013

Hi everybody,

I am currently testing a string search over line breaks.

My file is UTF-8 encoded.

This is my test text (with linebreaks at the end):
----------
Das ist ein Beispiel mit 3 Sätzen
Das ist ein 1122-22-11 Format
Hier ist keine Zahl.
Hier ist kein Punkt
nur Text Hier ist nur Text ist aber nur Text
----------

This is a code extract:

foreach $satz (@satz) {
chomp $satz;
if ($satz =~ m/\d(?s)(.*)keine/g) {
$satz =~ s/$&/xxxx/g;
}
print "$satz\n";
}

I would expect the following result for the first three lines:
'Das ist ein Beispiel mit xxxxx Zahl.'

With this search string, I get however no match. I have entered the same expression in UltraEdit (Regex-Perl-Search) and it works correctly.

What is wrong here?

Peter J. Holzer · Aug 10, 2013

I am currently testing a string search over line breaks. [...]
This is my test text (with linebreaks at the end):
----------
Das ist ein Beispiel mit 3 Sätzen
Das ist ein 1122-22-11 Format
Hier ist keine Zahl.
Hier ist kein Punkt
nur Text Hier ist nur Text ist aber nur Text
---------- [...]
if ($satz =~ m/\d(?s)(.*)keine/g) { [...]
With this search string, I get however no match. [...]
What is wrong here?

Read the section "Modifiers" in perldoc perlre.

hp

George Mpouras · Aug 10, 2013

I would expect the following result for the first three lines:

'Das ist ein Beispiel mit xxxxx Zahl.'

With this search string, I get however no match. I have entered the same expression in UltraEdit (Regex-Perl-Search) and it works correctly.

What is wrong here?

while (<DATA>)
{
s/(\d|-|keine)+/xxxx/g;
print "$_"
}

__DATA__
Das ist ein Beispiel mit 3 Sätzen
Das ist ein 1122-22-11 Format
Hier ist keine Zahl.
Hier ist kein Punkt
nur Text Hier ist nur Text ist aber nur Text

Peter J. Holzer · Aug 10, 2013

Quoth "Peter J. Holzer said:
Quoth "Peter J. Holzer said:

if ($satz =~ m/\d(?s)(.*)keine/g) { [...]
With this search string, I get however no match. [...]
What is wrong here?

Click to expand...

Read the section "Modifiers" in perldoc perlre.

Click to expand...

Read the section '(?adlupimsx-imsx)' in perldoc perlre .

I've cancelled that article. Either I wasn't fast enough or your
Newsserver doesn't honor cancels (without cancel-lock).

hp

fmassion · Aug 10, 2013

I think Ben has the right hint. Indeed I read the file into the array (@satz) and then I go
'foreach $satz (@satz)'
Geaorge's code doesn't work though. It returns the following result for thefirst 3 lines:

Das ist ein Beispiel mit xxxx Sätzen
Das ist ein xxxx Format
Hier ist xxxx Zahl.

The solution is still pending but thanks for the help.

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):

fmassion · Aug 10, 2013

This works as expected, but I don't quite understand what happens

undef $/;
while (<DATA>) {
chomp;
print "$_<<\n";
s/\d(.*)Zahl/xxxx/sg;
print "\n$_\n"
}
It searches over the first 3 lines and outputs as expected:
'Das ist ein Beispiel mit xxxx'

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):

George Mpouras · Aug 10, 2013

please explain again more detailed the requirements. I can not
understand what you expect

Charles DeRykus · Aug 10, 2013

This works as expected, but I don't quite understand what happens

undef $/;

while (<DATA>) {
chomp;
print "$_<<\n";
s/\d(.*)Zahl/xxxx/sg;
print "\n$_\n"
}
It searches over the first 3 lines and outputs as expected:
'Das ist ein Beispiel mit xxxx'

See: perldoc perlvar --> $/

See: perldoc perlretut --> why '.' matches everything but "\n"
or
See: perldoc perlre -> Modifiers --> s Treat string as single line

fmassion · Aug 11, 2013

Am Samstag, 10. August 2013 21:57:07 UTC+2 schrieb Ben Morrow:

[Please quote properly: that is, put your reply underneath the bit of

text you are replying to. It's also not helpful to keep replying to

yourself; instead you should reply to the article you are, um, replying

to. You appear to be using Google Groups, which has recently started

inserting extra blank lines whenever it quotes something; if you can't

find any way of turning this off you need to remove them by hand before

posting.]

Quoth (e-mail address removed):

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):

I am currently testing a string search over line breaks.

Click to expand...

[...]

This is a code extract:

foreach $satz (@satz) {
chomp $satz;
if ($satz =~ m/\d(?s)(.*)keine/g) {
$satz =~ s/$&/xxxx/g;
}
print "$satz\n";
}

I would expect the following result for the first three lines:
'Das ist ein Beispiel mit xxxxx Zahl.'

With this search string, I get however no match. I have entered the

Click to expand...

This works as expected, but I don't quite understand what happens
undef $/;

Click to expand...

This is documented in perldoc perlvar, under $/. Setting $/ to undef

causes <> to read the whole file in one go. This means you now have your

whole file in one string, so the s/// works over multiple lines.

while (<DATA>) {

Click to expand...

Since you are reading the whole file, there will only ever be one entry

to loop over, so you don't really need a loop.

chomp;

Click to expand...

With $/=undef chomp doesn't do anything.

print "$_<<\n";

Click to expand...

s/\d(.*)Zahl/xxxx/sg;

Click to expand...

print "\n$_\n"

Click to expand...

}

Click to expand...

It searches over the first 3 lines and outputs as expected:

Click to expand...

'Das ist ein Beispiel mit xxxx'

Click to expand...

Since you're only doing one substitution it would be better to use an

ordinary named variable and no loop:

my $text = <DATA>;

print "$text<<\n";

$text =~ s/\d(.*)Zahl/xxxx/sg;

print "\n$text\n";

Ben

[Sorry for not replying properly. I hope this is OK now]

I understand what 'undef $/' does but it seems to be a workaround. Basically my goal is:

1) Read a text in an array
2) Iterate through the variables of the array: 'foreach $satz (@satz)'
3) Test various search and replace Regex (as a matter of fact I am working through the Regex Cookbook of Jan Goyvaerts & Steven Levithan). In this context, one of several tests concerns the s modifier. I just wonder why it isn't possible to search for an expressions which spread over more than one line if I add this modifier. It works in UltraEdit. It works in a few other tools as well but I can't make it function in my perl script. If I use the undefine-workaround, other search expressions (e.g. with $ to mark the end of the string) won't work.

In one of the tools I use (Expresso), I see that the EOL is coded as [CR][LF]. Is this a reason for the problem with the s modifier?

Peter J. Holzer · Aug 11, 2013

Am Samstag, 10. August 2013 21:57:07 UTC+2 schrieb Ben Morrow:

[Please quote properly: that is, put your reply underneath the bit of

text you are replying to. It's also not helpful to keep replying to

yourself; instead you should reply to the article you are, um, replying

to. You appear to be using Google Groups, which has recently started

inserting extra blank lines whenever it quotes something; if you can't

find any way of turning this off you need to remove them by hand before

posting.]

Quoth (e-mail address removed):

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):
I am currently testing a string search over line breaks.
[...]

This is a code extract:
foreach $satz (@satz) {

Click to expand...

chomp $satz;

Click to expand...

if ($satz =~ m/\d(?s)(.*)keine/g) {

Click to expand...

$satz =~ s/$&/xxxx/g;

print "$satz\n";

I would expect the following result for the first three lines:

Click to expand...

'Das ist ein Beispiel mit xxxxx Zahl.'

With this search string, I get however no match. I have entered the

This works as expected, but I don't quite understand what happens

undef $/;

Click to expand...

Click to expand...

[...]
[Sorry for not replying properly. I hope this is OK now]

Not really. You are still quoting everything (whether it is relevant or
not) and you haven't removed the empty lines inserted by google. So we
have scroll/read through 130 lines on quotes which may or may not be
relevant. I dare say that not every one of us has the patience.

Do yourself and us a favour, get a real Newsreader and use one of the
free news servers (e.g. albasani).

I understand what 'undef $/' does but it seems to be a workaround.
Basically my goal is:

1) Read a text in an array

What are the elements of the array? Lines?

2) Iterate through the variables of the array: 'foreach $satz (@satz)'

So in each iteration of the loop you are looking at one line in
isolation.

3) Test various search and replace Regex (as a matter of fact I am
working through the Regex Cookbook of Jan Goyvaerts & Steven
Levithan). In this context, one of several tests concerns the s
modifier. I just wonder why it isn't possible to search for an
expressions which spread over more than one line if I add this
modifier.

That's what the /s modifier does. But there have to be actually several
lines in the variable you are looking at for this to work. If the other
lines are in different variables, how can perl know that you would want
to match those other variables, too, especially if to tell it
explicitely to look only at this variable?

It works in UltraEdit. It works in a few other tools as well

That's because UltraEdit and those other tools treat the whole text as
unit. But your script (not Perl - *your* script) splits it into many
small units and looks at each of them in isolation. None of these small
units matches.

hp

fmassion · Aug 11, 2013

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):

[Sorry for not replying properly. I hope this is OK now]

Click to expand...

Nope, the blank lines are still there.

Sorry to Peter, Ben and all of you. I hope it's fine now
[...]

Why do you want it in an array, rather than a single string?

Because I may want to do things only with the $satz variables which meet the regex. E.g. send them to another array or whatever. This isn't possible when I read only one big large string.

Peter J. Holzer · Aug 11, 2013

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):

[Sorry for not replying properly. I hope this is OK now]

Click to expand...

Nope, the blank lines are still there.

Click to expand...

Sorry to Peter, Ben and all of you. I hope it's fine now

Not quite, but a lot better, thanks.

Because I may want to do things only with the $satz variables which
meet the regex.

Apparently none of them does.

hp

Charles DeRykus · Aug 11, 2013

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):

[Sorry for not replying properly. I hope this is OK now]

Click to expand...

Nope, the blank lines are still there.

Click to expand...

Sorry to Peter, Ben and all of you. I hope it's fine now
[...]

Why do you want it in an array, rather than a single string?

Click to expand...

Because I may want to do things only with the $satz variables which meet the regex.

E.g. send them to another array or whatever. This isn't possible when I
read only one big large string.
The problem is the match may extend over several $satz. If you wanted
to identify those individual $satz which are part of the match, you
could do something like this:

my @satz = <DATA>;
my $alles = join('', @satz);

my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsmap ) {
$match = ${^MATCH};
foreach my $satz (@satz) {
if ( $match =~ /$satz/ ) {
#print "sentence is part of match: $satz"
...
}
}
}

Charles DeRykus · Aug 11, 2013

...

my @satz = <DATA>;
my $alles = join('', @satz);

my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsmap ) {
$match = ${^MATCH};
foreach my $satz (@satz) {
if ( $match =~ /$satz/ ) {
#print "sentence is part of match: $satz"
...
}
}
}

You could omit /p too:

my $match;
if ( $alles =~ /^(.*\d.*Zahl.*?\n)/gsma ) {
$match = $1;
foreach my $satz (@satz) {
if ( $match =~ /$satz/ ) {
#print "sentence is part of match: $satz"
...
}
}
}

Charles DeRykus · Aug 12, 2013

Quoth Charles DeRykus <[email protected]>:

/^\Q$satz/m

Also, this may pick up lines that were not part of the originally-
matched text. Given that the match is anchored to a whole line fore-and-
aft ($satz will contain a trailing newline) this can only happen with
whole duplicated lines, but it may still be a problem.

And it's been bothering me since posting... here's a messier solution
to ensure sentences overlap the target begin/end positions:

use 5.012; # so each will work on arrays
....
my @satz = <DATA>;
my $alles = join('', @satz);

my ($b, $e) = (0, 0);
my @pos = map { $b= $e+1 if $e; $e += (length($_)-1); [$b,$e] } @satz;

if ( $alles =~ /^(.*\d.*Zahl.*?\n)/gsma ) {
my($match, $begin, $end) = ($1, $-[0], $+[0]);

while( my($i,$satz) = each @satz ) {
next unless ($pos[$i][0] >= $begin and $pos[$i][0] <= $end)
or ($pos[$i][1] >= $begin and $pos[$i][1] <= $end);

if ( $match =~ /$satz/ ) {
print "sentence is part of match: $satz\n\n"
...
}
}
}

Charles DeRykus · Aug 12, 2013

...
use 5.012; # so each will work on arrays
...
my @satz = <DATA>;
my $alles = join('', @satz);

my ($b, $e) = (0, 0);
my @pos = map { $b= $e+1 if $e; $e += (length($_)-1); [$b,$e] } @satz;

if ( $alles =~ /^(.*\d.*Zahl.*?\n)/gsma ) {

Click to expand...

That .* will match across newlines, so the ^ (and the /m) does nothing.

my($match, $begin, $end) = ($1, $-[0], $+[0]);

while( my($i,$satz) = each @satz ) {
next unless ($pos[$i][0] >= $begin and $pos[$i][0] <= $end)
or ($pos[$i][1] >= $begin and $pos[$i][1] <= $end);

if ( $match =~ /$satz/ ) {

Click to expand...

You're still not quoting $satz. It's really important to quote user data
before interpolating it into a pattern. You're also not anchoring the
match at the beginning, so a line will match if it only ends with $satz.

Thanks, I'm missed that... very important to be there.

print "sentence is part of match: $satz\n\n"
...
}
}
}

Click to expand...

But this is still a great deal more complicated than

my $alles = slurp \*DATA;

while (my ($match) =
$alles =~ /([^\n]* \d .* Zahl [^\n]*)/gsx
# or perhaps /(.* \d (?s:.)* Zahl .*)/gx
# or /(\N* \d .* Zahl \N*)/gsx if you've got 5.12
) {
for my $satz (split /\n/, $match) {
# make that /(?<=\n)/ if you don't want to chomp
print "sentence is part of match: $satz\n\n";
}
}

Yes, I agree that's conciser and clearer to many.
But, it'll loop endlessly

I think you probably meant:

while ( $alles =~ /([^\n]* \d .* Zahl [^\n]*)/gsx;
my $match = $1;
...
}

Charles DeRykus · Aug 12, 2013

...
I think you probably meant:

while ( $alles =~ /([^\n]* \d .* Zahl [^\n]*)/gsx;

^^^^

And of course that should be: "/gsx ) {" rather than: ";"

fmassion · Aug 13, 2013

Thanks to all of you for your efforts and ideas. Let me summarize the lessons I've learned in this discussion.
The task was: Import a text, apply a regex which extends over a linebreak and display/modify the lines matching the expression.
The original approach failed because the text was not read in one string, but split into lines in an array.
I then wanted to be able to print each individual line of the array and to use ^ and $ in line-based regular expression.
I have tried all the suggested code. Not everything has worked. This is my current code with which I manage to get the matched lines and the entire text:

use utf8; # damit lassen sich UTF8 Dateien bearbeiten
binmode STDIN, ":utf8"; # input
binmode STDOUT, ":utf8"; # output

#undef $/; # is not required as <DATA> read into array and then joined
open(DATA,'D:\temp\a.txt') || die("Datei kann nicht geöffnet werden!\n");
seek(DATA, 3, 0);
my @satz = <DATA>;
my $alles = join('', @satz);
my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsma ) {
$match = ${^MATCH}; # I don't understand what is this ${^MATCH}
# $match = $1; # doesn't work
print "$match<<\n"; # prints only the match
foreach my $satz (@satz) {
# if ( $match =~ /$satz/ ) { # if activated prints nothing
print "sentence is part of match: $satz\n"; # prints the entire text
# }
}
}

Rainer Weikusat · Aug 13, 2013

(e-mail address removed) writes:

[...]

use utf8; # damit lassen sich UTF8 Dateien bearbeiten

This is only needed if your source code contains UTF-*.

binmode STDIN, ":utf8"; # input
binmode STDOUT, ":utf8"; # output

#undef $/; # is not required as <DATA> read into array and then joined

Except in 'short files' (as here), it is usually better to use

local $/;

instead. This creates a new binding for $/ while preserving the old
one which will be restored after the containing block.

open(DATA,'D:\temp\a.txt') || die("Datei kann nicht geöffnet werden!\n");

This "it didn't work" style of error reporting is a bit useless. The
message should also contain the system error code/ message.

seek(DATA, 3, 0);
my @satz = <DATA>;
my $alles = join('', @satz);
my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsma ) {
$match = ${^MATCH}; # I don't understand what is this ${^MATCH}

As 'perldoc perlvar' could have told you: The text which matched the
regex. At least for the perl version I'm using (5.10.1), the
documentation also says the /p match modifier is needed in order to
use this builtin variable.

# $match = $1; # doesn't work

Since the regex isn't capturing anyhing, that is to be expected.

Charles DeRykus · Aug 13, 2013

Thanks to all of you for your efforts and ideas. Let me summarize the lessons I've learned in this discussion.
The task was: Import a text, apply a regex which extends over a linebreak and display/modify the lines matching the expression.
The original approach failed because the text was not read in one string, but split into lines in an array.
I then wanted to be able to print each individual line of the array and to use ^ and $ in line-based regular expression.
I have tried all the suggested code. Not everything has worked. This is my current code with which I manage to get the matched lines and the entire text:

use utf8; # damit lassen sich UTF8 Dateien bearbeiten
binmode STDIN, ":utf8"; # input
binmode STDOUT, ":utf8"; # output

#undef $/; # is not required as <DATA> read into array and then joined
open(DATA,'D:\temp\a.txt') || die("Datei kann nicht geöffnet werden!\n");
seek(DATA, 3, 0);
my @satz = <DATA>;
my $alles = join('', @satz);
my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsma ) {
$match = ${^MATCH}; # I don't understand what is this ${^MATCH}
# $match = $1; # doesn't work
print "$match<<\n"; # prints only the match
foreach my $satz (@satz) {
# if ( $match =~ /$satz/ ) { # if activated prints nothing
print "sentence is part of match: $satz\n"; # prints the entire text
# }
}
}

The $^{MATCH} is only valid with /p and was not needed. I'm not certain
it's at all relevant to what you're doing now either.

I think Ben's suggestion is the most promising if you want to identify
the sentences over which the match extends:

while (my ($match) =
$alles =~ /([^\n]* \d .* Zahl [^\n]*)/gsx
# or perhaps /(.* \d (?s:.)* Zahl .*)/gx
# or /(\N* \d .* Zahl \N*)/gsx if you've got 5.12
) {
for my $satz (split /\n/, $match) {
# make that /(?<=\n)/ if you don't want to chomp
print "sentence is part of match: $satz\n\n";
}
}

Q: Hi-HO! How to implement this search engine... ?	1	Sep 20, 2010
Problem mit framesets	1	Jul 19, 2005
Problem Wahrscheinlichkeitsrechnung bei Programmierung Fussballmanager	1	Aug 4, 2005
Lesen und Schreiben mit einem file handle O_RDWR geht nicht	2	Jan 25, 2004
document.write	1	Jul 1, 2005
CFP: 6th German Perl-Workshop 2004	2	Aug 29, 2003
Einsendeaufgabe	5	Feb 17, 2004
Insurance colleague from Baveria, Germany	0	Apr 23, 2004

s modifier doesn't seem to work

fmassion

Peter J. Holzer

George Mpouras

Peter J. Holzer

fmassion

fmassion

George Mpouras

Charles DeRykus

fmassion

Peter J. Holzer

fmassion

Peter J. Holzer

Charles DeRykus

Charles DeRykus

Charles DeRykus

Charles DeRykus

Charles DeRykus

fmassion

Rainer Weikusat

Charles DeRykus

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads