I (almost) wish that the syntax for grouping without backrefs
were at least as terse as the syntax for grouping with backrefs.
Having to add extra punctuation to indicate *not* doing something
just seems counterintuitive.
The (?
syntax was a later addition to the language, when () was
already well established, so that wasn't really an option. (?
is
also slightly harder to read for people familiar with regexp syntaxes
other than Perl's (and for those of us who first learned Perl before
it had (?
). I'm not saying that's an excuse for creating backrefs
unnecessarily, but there is some pressure to use () because it works.
I think that (?: ) was a logical step in the process, in the face of
( ) which doesen't make sense when combined with quantifiers.
In that case, it really doesen't work and is basically useless for capture
in this sense of (\s([\w]+\s*)*\s)+.
And its very hard to read.
In that sense, (?: ) has moderately easier to discern than a capture
grouping (As a bonus you get extra's (?imsx-imsx: ) ) but imho all groupings,
especially nested, are hard to read.
When modifying or reading a regexs groupings, its sometimes more important
to me to separate the capture ones as it shifts the output when alterred.
Most unique syntax is taken already.
In need of a tool, I tried to cull out the start of the capture groups
separate from the non-capture. I didn't even attempt closures, although
if the start can be determined, I'd imagine the ends can too, but not sure.
-sln
-------------------
use strict;
use warnings;
require 5.010_000;
##
my $rxgroup = qr/
([[:cntrl:]] | $) # Formatting control character
| # or, the rest ..
(?:
(?<!\\) # Not an escape behind us
(?:\\.)* # 0 or more "escape + any char"
(?:
# Exclude character class'
\[
\]?
(?: \\.| \[:[a-z]*:\] | [^\]\n] )*
(\n?)
(?: \\.| \[:[a-z]*:\] | [^\]] )*
\]
|
(?# Exclude extended comments )
\(\?(\#) [^)]* \)
|
# Exclude free comments
(\#) (?:[^\n])*
|
# Start of a capture group
\( # (
(?:
(?!\?) # unnamed: not a ? in front of us
| # or (Perl 5.10 and above)
# named: a ?<name> or ?'name' is ok
(?= \?[<'][^\W\d][\w]*['>] )
)
)
)
/x;
my $testrx = qr/
\(\$th(\\(?:.) [(]
(?# Extended lines
of comment
)
\\\\(.\)\\\(.)(i(s))\t(i(s)) ] )
/x;
##
# Sample object
print FindRXCaptureGroups(
qr/ \(\$th(\\(?:.) [(] \\\\(.\)\\\(.)(i(s))\t(i(s)) ] )/x ), "\n";
# Sample reference
print FindRXCaptureGroups( \$testrx ), "\n";
# Show groups for that which finds the groups
print FindRXCaptureGroups( \$rxgroup ),"\n";
exit(0);
##
sub FindRXCaptureGroups
{
@_ > 0 || die "Expected a parameter";
my $sample;
if ( ref( $_[0]) eq 'SCALAR' ) { $sample = $_[0] }
elsif (ref(\$_[0]) eq 'SCALAR' ) { $sample = \$_[0] }
elsif (ref( $_[0]) eq 'Regexp' ) { $sample = \$_[0] }
elsif (ref( $_[0]) eq 'REF' &&
ref(${$_[0]}) eq 'Regexp') { $sample = $_[0] }
else {
die "Not a string, Regexp object, or reference to one";
}
my ($All,
$grpstring,
$group,
$lastpos ) = ('', '', 1, 0);
while ($$sample =~ /$rxgroup/g )
{
if (defined $1) {
my $cntrlen = length $1;
my $cntrlcode = $cntrlen ? $1 : "\n";
$All .= substr( $$sample, $lastpos, ($+[0]-$lastpos-$cntrlen) ) . $cntrlcode;
$grpstring .= '-' x ($+[0]-$lastpos-$cntrlen) . $cntrlcode;
$lastpos = $+[0];
if ($cntrlcode eq "\n") {
$All .= $grpstring if ($grpstring =~ /\d/);
$grpstring = '';
}
next;
}
if (defined $2) {
my ($cntrlcode, $match0, $match2) = ($2, $+[0], $+[2]);
if (length( $2 ) && $grpstring =~ /\d/) {
$All .= substr( $$sample, $lastpos, ($match2-$lastpos) );
$grpstring .= '-' x ($match2-$lastpos-1) . $cntrlcode;
$lastpos = $match2;
$All .= $grpstring;
$grpstring = '';
}
$All .= substr( $$sample, $lastpos, ($match0-$lastpos) );
$grpstring .= '-' x ($match0-$lastpos);
$lastpos = $match0;
next;
}
if (defined $3 || defined $4) {
$All .= substr( $$sample, $lastpos, ($+[0]-$lastpos) );
$grpstring .= '-' x ($+[0]-$lastpos);
$lastpos = $+[0];
next;
}
$All .= substr( $$sample, $lastpos, ($+[0]-$lastpos) );
$grpstring .= '-' x ($+[0]-$lastpos-1) . $group++ % 10;
$lastpos = $+[0];
}
return $All;
}
__END__
(?x-ism: \(\$th(\\(?:.) [(] \\\\(.\)\\\(.)(i(s))\t(i(s)) ] ))
---------------1----------------2---------3-4-----5-6--------
(?x-ism:
\(\$th(\\(?:.) [(]
----------1-----------
(?# Extended lines
of comment
)
\\\\(.\)\\\(.)(i(s))\t(i(s)) ] )
--------2---------3-4-----5-6-------
)
(?x-ism:
([[:cntrl:]] | $) # Formatting control character
-----1------------------------------------------------
| # or, the rest ..
(?:
(?<!\\) # Not an escape behind us
(?:\\.)* # 0 or more "escape + any char"
(?:
# Exclude character class'
\[
\]?
(?: \\.| \[:[a-z]*:\] | [^\]\n] )*
(\n?)
-----------------2----
(?: \\.| \[:[a-z]*:\] | [^\]] )*
\]
|
(?# Exclude extended comments )
\(\?(\#) [^)]* \)
-------------------3------------
|
# Exclude free comments
(\#) (?:[^\n])*
--------------4--------------
|
# Start of a capture group
\( # (
(?:
(?!\?) # unnamed: not a ? in front of us
| # or (Perl 5.10 and above)
# named: a ?<name> or ?'name' is ok
(?= \?[<'][^\W\d][\w]*['>] )
)
)
)
)