understanding regexp, Text::ParseWords

C

ccc31807

This is copied from Text::parseWords. It appears in the function
parse_line(delimiter, boolean, string). I understand most of this, but
need some help understanding some if it. This appears in a loop:
while (length($line)) {
and parses a line with this call:
my ($f, $m, $l) = parse_line(/,/, 0, $line)
where line will be like this:
"Barack","Hussein","Obama"
I have numbered the lines for reference.

<quote>
# This pattern is optimised to be stack conservative on older perls.
# Do not refactor without being careful and testing it on very long
strings.
# See Perl bug #42980 for an example of a stack busting input.
1 $line =~ s/^
2 (?:
# double quoted string
3 (") # $quote
4 ((?>[^\\"]*(?:\\.[^\\"]*)*))" # $quoted
5 | # --OR--
# singe quoted string
6 (') # $quote
7 ((?>[^\\']*(?:\\.[^\\']*)*))' # $quoted
8 | # --OR--
# unquoted string
9 ( # $unquoted
10 (?:\\.|[^\\"'])*?
11 )
# followed by
12 ( # $delim
13 \Z(?!\n) # EOL
14 | # --OR--
15 (?-x:$delimiter) # delimiter
16 | # --OR--
17 (?!^)(?=["']) # a quote
18 )
)//xs or return; # extended layout
my ($quote, $quoted, $unquoted, $delim) = (($1 ? ($1,$2) : ($3,$4)),
$5, $6);
</quote>

Thanks, CC.
 
S

sln

This is copied from Text::parseWords. It appears in the function
parse_line(delimiter, boolean, string). I understand most of this, but
need some help understanding some if it. This appears in a loop:
while (length($line)) {
and parses a line with this call:
my ($f, $m, $l) = parse_line(/,/, 0, $line)
where line will be like this:
"Barack","Hussein","Obama"
I have numbered the lines for reference.

What is it you want to understand about it?
Its basically 3 sections that peels off chunks of the line into some
apparent quoted/unquoted, delimited/undelimited order.

-sln

-------------------------
use strict;
#use warnings;

my @lines = (
q{ "Barack", "Hussein", "Obama" },
q{ "Bar'a'ck", "test", hello, "Hussein", 'Obama" },
q{ 'Bar'a'ck", "test", hello, "Hussein", 'Obama" },
);

my $delimiter = ',';
print "\n";

for my $line (@lines) {
print "** start line = [$line]\n\n";
while (length($line)) {

$line =~ s/^
(?:
# double quoted string
(") # $quote
((?>[^\\"]*(?:\\.[^\\"]*)*))" # $quoted
| # --OR--
# singe quoted string
(') # $quote
((?>[^\\']*(?:\\.[^\\']*)*))' # $quoted
| # --OR--
# unquoted string
( # $unquoted
(?:\\.|[^\\"'])*?
)
# followed by
( # $delim
\Z(?!\n) # EOL
| # --OR--
(?-x:$delimiter) # delimiter
| # --OR--
(?!^)(?=["']) # a quote
)
)//xs or last; # extended layout

my ($quote, $quoted, $unquoted, $delim) = (($1 ? ($1,$2) : ($3,$4)), $5, $6);
print "quote= <$quote> quoted= <$quoted> unquoted= <$unquoted> delim= <$delim>\n";
print " <$line>\n";
}
print "end line = [$line]\n",'-'x20,"\n\n";
}

__END__
Output:

** start line = [ "Barack", "Hussein", "Obama" ]

quote= <> quoted= <> unquoted= < > delim= <>
<"Barack", "Hussein", "Obama" >
quote= <"> quoted= <Barack> unquoted= <> delim= <>
<, "Hussein", "Obama" >
quote= <> quoted= <> unquoted= <> delim= <,>
< "Hussein", "Obama" >
quote= <> quoted= <> unquoted= < > delim= <>
<"Hussein", "Obama" >
quote= <"> quoted= <Hussein> unquoted= <> delim= <>
<, "Obama" >
quote= <> quoted= <> unquoted= <> delim= <,>
< "Obama" >
quote= <> quoted= <> unquoted= < > delim= <>
<"Obama" >
quote= <"> quoted= <Obama> unquoted= <> delim= <>
< >
quote= <> quoted= <> unquoted= < > delim= <>
<>
end line = []
--------------------

** start line = [ "Bar'a'ck", "test", hello, "Hussein", 'Obama" ]

quote= <> quoted= <> unquoted= < > delim= <>
<"Bar'a'ck", "test", hello, "Hussein", 'Obama" >
quote= <"> quoted= <Bar'a'ck> unquoted= <> delim= <>
<, "test", hello, "Hussein", 'Obama" >
quote= <> quoted= <> unquoted= <> delim= <,>
< "test", hello, "Hussein", 'Obama" >
quote= <> quoted= <> unquoted= < > delim= <>
<"test", hello, "Hussein", 'Obama" >
quote= <"> quoted= <test> unquoted= <> delim= <>
<, hello, "Hussein", 'Obama" >
quote= <> quoted= <> unquoted= <> delim= <,>
< hello, "Hussein", 'Obama" >
quote= <> quoted= <> unquoted= < hello> delim= <,>
< "Hussein", 'Obama" >
quote= <> quoted= <> unquoted= < > delim= <>
<"Hussein", 'Obama" >
quote= <"> quoted= <Hussein> unquoted= <> delim= <>
<, 'Obama" >
quote= <> quoted= <> unquoted= <> delim= <,>
< 'Obama" >
quote= <> quoted= <> unquoted= < > delim= <>
<'Obama" >
end line = ['Obama" ]
--------------------

** start line = [ 'Bar'a'ck", "test", hello, "Hussein", 'Obama" ]

quote= <> quoted= <> unquoted= < > delim= <>
<'Bar'a'ck", "test", hello, "Hussein", 'Obama" >
quote= <'> quoted= <Bar> unquoted= <> delim= <>
<a'ck", "test", hello, "Hussein", 'Obama" >
quote= <> quoted= <> unquoted= <a> delim= <>
<'ck", "test", hello, "Hussein", 'Obama" >
quote= <'> quoted= <ck", "test", hello, "Hussein", > unquoted= <> delim= <>
<Obama" >
quote= <> quoted= <> unquoted= <Obama> delim= <>
<" >
end line = [" ]
--------------------
 
C

ccc31807

What is it you want to understand about it?

Line 2 -- the (?: construct
Lines 4, 7, 10 -- same thing
Line 13 -- \Z(?!\n)
Line 15 -- (?-x:$delimiter)
$delimiter would be the COMMA character
Line 17 -- (?!^)(?=["'])
the ["'] means either one quote or one double-quote

Thanks, CC.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,246
Members
46,839
Latest member
MartinaBur

Latest Threads

Top