split a multiple lines text

B

bingfeng

Hello,
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

but it does not work at all. so how can I do with this?

regards,
bingfeng
 
J

John W. Krahn

bingfeng said:
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

but it does not work at all. so how can I do with this?

my @lines = $cmds =~ /^\s*__begin(?s:.*?)__end;$|^\s*\S+;$/mg;



John
 
D

Dr.Ruud

bingfeng schreef:
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

but it does not work at all. so how can I do with this?

my @blocks;

for my $re ( qr/__begin.*?__end/s, qr/^[^;]+/m ) {
while ($cmds =~ s/\s*($re)\s*;/" "x ($+[0] - $-[0])/es) {
push @blocks, [ $-[0], $1 ];
}
}

for (sort { $a->[0] <=> $b->[0] } @blocks) {
print "<--\n", $_->[1], "\n-->\n";
}


In the end, $cmds will still have the same length, but will contain only
whitespace (and any unmatched content).
 
J

Jürgen Exner

bingfeng said:
Hello,
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

This sounds suspiciously like an X-Y problem to me. Are you reading this
text from a file? If yes, then if a block starts with '__begin{' you can
read that block until the '__end;' token is reached.

And yes, you can use split() if you do it in two steps:
First split on '__end;' . And in the second step repair the now missing
'__end;' for those items, that have a leading '__begin{' and for the
others split again at ';'.

jue
 
B

bingfeng

bingfeng said:
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;
I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".
split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);
but it does not work at all. so how can I do with this?

my @lines = $cmds =~ /^\s*__begin(?s:.*?)__end;$|^\s*\S+;$/mg;

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall- Òþ²Ø±»ÒýÓÃÎÄ×Ö -

- ÏÔʾÒýÓõÄÎÄ×Ö -

Thank you, John, it works very well. You help save some hours!
 
S

sln

Hello,
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);
^^
my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

You were on the right track. [^;] however is first to match all before ';',
which means it grabs the' __begin { .. abc;' then the next, then next.
'__begin.*?__end' is never matched. By including not whitespace, [^\s;] in
the character class, begin and end have a chance.

sln

--------------------

use strict;
use warnings;

my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

for (my $i = 0; $i < @lines; $i++) {
print "\n\$lines[$i] = \n\n\"$lines[$i]\"\n";
}

__END__

output:

$lines[0] =

"__begin {
abc;
def;
{foo;bar}
} __end"

$lines[1] =

"__begin {
cde;
} __end"

$lines[2] =

"abc"

$lines[3] =

"bad"
 
B

bingfeng

Hello,
Assume I have following string:
my $cmds = <<DOC
 __begin {
    abc;
    def;
    {foo;bar}
 } __end;
 __begin {
    cde;
 } __end;
 abc;
 bad;
DOC
;
I want to split it into an array, the first item is "__begin {
    abc;
    def;
    {foo;bar}
 } __end", the second item  is  "__begin {
    cde;
 } __end", and the third is "abc" and the fourth is "bad".
split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

                                         ^^
my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

You were on the right track. [^;] however is first to match all before ';',
which means it grabs the'   __begin { .. abc;' then the next, then next..
'__begin.*?__end' is never matched. By including not whitespace, [^\s;] in
the character class, begin and end have a chance.
You are right. Thanks for your explanation. My sample is some
oversimple. the standalone sentence may contain other word and space,
with following test message:
my $cmds = <<DOC
__begin {
abc sss;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc kkk;
bad fde;
DOC
;

you solution gives following Dumper result:
$VAR1 = '__begin {
abc sss;
def;
{foo;bar}
} __end';
$VAR2 = '__begin {
cde;
} __end';
$VAR3 = 'abc';
$VAR4 = 'kkk';
$VAR5 = 'bad';
$VAR6 = 'fde';

that's not what I want. Apart from John's solution, I have no other
solution. Thank you
sln

--------------------

use strict;
use warnings;

my $cmds = <<DOC
  __begin {
     abc;
     def;
     {foo;bar}
  } __end;
  __begin {
     cde;
  } __end;
  abc;
  bad;
DOC
;

my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

for (my $i = 0; $i < @lines; $i++) {
        print "\n\$lines[$i] = \n\n\"$lines[$i]\"\n";

}

__END__

output:

$lines[0] =

"__begin {
     abc;
     def;
     {foo;bar}
  } __end"

$lines[1] =

"__begin {
     cde;
  } __end"

$lines[2] =

"abc"

$lines[3] =

"bad"
 
S

sln

Hello,
Assume I have following string:
my $cmds = <<DOC
 __begin {
    abc;
    def;
    {foo;bar}
 } __end;
 __begin {
    cde;
 } __end;
 abc;
 bad;
DOC
;
I want to split it into an array, the first item is "__begin {
    abc;
    def;
    {foo;bar}
 } __end", the second item  is  "__begin {
    cde;
 } __end", and the third is "abc" and the fourth is "bad".
split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

                                         ^^
my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

You were on the right track. [^;] however is first to match all before ';',
which means it grabs the'   __begin { .. abc;' then the next, then next.
'__begin.*?__end' is never matched. By including not whitespace, [^\s;] in
the character class, begin and end have a chance.
You are right. Thanks for your explanation. My sample is some
oversimple. the standalone sentence may contain other word and space,
with following test message:
my $cmds = <<DOC
__begin {
abc sss;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc kkk;
bad fde;
DOC
;

you solution gives following Dumper result:
$VAR1 = '__begin {
abc sss;
def;
{foo;bar}
} __end';
$VAR2 = '__begin {
cde;
} __end';
$VAR3 = 'abc';
$VAR4 = 'kkk';
$VAR5 = 'bad';
$VAR6 = 'fde';

Thats too bad. You made a good attempt and I gave
you credit by saying you almost had it right the first time.
And the regex was altered slightly from why you yourself tried.

I didn't write a regex for you. Because if I did that, you could
always come back and say for example:

#>You are right. Thanks for your explanation. My sample is some
#>oversimple. the standalone sentence may contain other word and space,
#>with following test message ...

But you didn't say that in the first place.
that's not what I want. Apart from John's solution, I have no other
solution.
^^^^^^^^^^^^^
Think again ... you just invalidated his regex.

my @lines = $str =~ /^\s*__begin(?s:.*?)__end;$|^\s*\S+;$/mg;

$lines[0] =
" __begin {
abc sss;
def;
{foo;bar}
} __end;"

$lines[1] =
" __begin {
cde;
} __end;"

What are you going to do now?
We're still in the extremely simple stage.
In fact, the more you add, the simpler it gets.

sln

-------------------------------

Version 2

#################
# Misc Parse 2
#################

use strict;
use warnings;

# the old
my $cmd1 = <<DOC1
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC1
;

# the new
my $cmds2 = <<DOC2
__begin {
abc sss;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc kkk;
bad fde;
DOC2
;

my $str = $cmds2;

my @lines = ($str =~ /\s*(__begin.*?__end|.*?);/sg);

for (my $i = 0; $i < @lines; $i++) {
print "\n\$lines[$i] = \n\n\"$lines[$i]\"\n";
}

__END__

output:

$lines[0] =

"__begin {
abc sss;
def;
{foo;bar}
} __end"

$lines[1] =

"__begin {
cde;
} __end"

$lines[2] =

"abc kkk"

$lines[3] =

"bad fde"
 
B

bingfeng

Hello,
Assume I have following string:
my $cmds = <<DOC
 __begin {
    abc;
    def;
    {foo;bar}
 } __end;
 __begin {
    cde;
 } __end;
 abc;
 bad;
DOC
;
I want to split it into an array, the first item is "__begin {
    abc;
    def;
    {foo;bar}
 } __end", the second item  is  "__begin {
    cde;
 } __end", and the third is "abc" and the fourth is "bad".
split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);
                                         ^^
my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);
You were on the right track. [^;] however is first to match all before';',
which means it grabs the'   __begin { .. abc;' then the next, then next.
'__begin.*?__end' is never matched. By including not whitespace, [^\s;] in
the character class, begin and end have a chance.
You are right. Thanks for your explanation. My sample is some
oversimple. the standalone sentence may contain other word and space,
with following test message:
my $cmds = <<DOC
 __begin {
    abc sss;
    def;
    {foo;bar}
 } __end;
 __begin {
    cde;
 } __end;
 abc kkk;
 bad fde;
DOC
;
you solution gives following Dumper result:
$VAR1 = '__begin {
    abc sss;
    def;
    {foo;bar}
 } __end';
$VAR2 = '__begin {
    cde;
 } __end';
$VAR3 = 'abc';
$VAR4 = 'kkk';
$VAR5 = 'bad';
$VAR6 = 'fde';

Thats too bad. You made a good attempt and I gave
you credit by saying you almost had it right the first time.
And the regex was altered slightly from why you yourself tried.

I didn't write a regex for you. Because if I did that, you could
always come back and say for example:

  #>You are right. Thanks for your explanation. My sample is some
  #>oversimple. the standalone sentence may contain other word and space,
  #>with following test message ...

But you didn't say that in the first place.
that's not what I want. Apart from John's solution, I have no other
solution.

    ^^^^^^^^^^^^^
Think again ... you just invalidated his regex.

my @lines = $str =~ /^\s*__begin(?s:.*?)__end;$|^\s*\S+;$/mg;

$lines[0] =
"  __begin {
     abc sss;
     def;
     {foo;bar}
  } __end;"

$lines[1] =
"  __begin {
     cde;
  } __end;"

What are you going to do now?
We're still in the extremely simple stage.
In fact, the more you add, the simpler it gets.

sln

-------------------------------

Version 2

#################
# Misc Parse 2
#################

use strict;
use warnings;

# the old
my $cmd1 = <<DOC1
  __begin {
     abc;
     def;
     {foo;bar}
  } __end;
  __begin {
     cde;
  } __end;
  abc;
  bad;
DOC1
;

# the new
my $cmds2 = <<DOC2
  __begin {
     abc sss;
     def;
     {foo;bar}
  } __end;
  __begin {
     cde;
  } __end;
  abc kkk;
  bad fde;
DOC2
;

my $str = $cmds2;

my @lines = ($str =~ /\s*(__begin.*?__end|.*?);/sg);

for (my $i = 0; $i < @lines; $i++) {
        print "\n\$lines[$i] = \n\n\"$lines[$i]\"\n";

}

__END__

output:

$lines[0] =

"__begin {
     abc sss;
     def;
     {foo;bar}
  } __end"

$lines[1] =

"__begin {
     cde;
  } __end"

$lines[2] =

"abc kkk"

$lines[3] =

"bad fde"

Wow, I had to admit your regex is simpler, easy to understand and
elegant. I'll study what you said carefully. Anyway, thank you very
much.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,782
Latest member
ThomasGex

Latest Threads

Top