split a multiple lines text

bingfeng · Oct 20, 2008

Hello,
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

but it does not work at all. so how can I do with this?

regards,
bingfeng

John W. Krahn · Oct 20, 2008

bingfeng said:
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

but it does not work at all. so how can I do with this?

my @lines = $cmds =~ /^\s*__begin(?s:.*?)__end;$|^\s*\S+;$/mg;

John

Dr.Ruud · Oct 20, 2008

bingfeng schreef:

Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

but it does not work at all. so how can I do with this?

my @blocks;

for my $re ( qr/__begin.*?__end/s, qr/^[^;]+/m ) {
while ($cmds =~ s/\s*($re)\s*;/" "x ($+[0] - $-[0])/es) {
push @blocks, [ $-[0], $1 ];
}
}

for (sort { $a->[0] <=> $b->[0] } @blocks) {
print "<--\n", $_->[1], "\n-->\n";
}

In the end, $cmds will still have the same length, but will contain only
whitespace (and any unmatched content).

Jürgen Exner · Oct 20, 2008

bingfeng said:
Hello,
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

This sounds suspiciously like an X-Y problem to me. Are you reading this
text from a file? If yes, then if a block starts with '__begin{' you can
read that block until the '__end;' token is reached.

And yes, you can use split() if you do it in two steps:
First split on '__end;' . And in the second step repair the now missing
'__end;' for those items, that have a leading '__begin{' and for the
others split again at ';'.

jue

bingfeng · Oct 20, 2008

bingfeng said:
bingfeng said:

Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

Click to expand...

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

Click to expand...

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

Click to expand...

but it does not work at all. so how can I do with this?

Click to expand...

my @lines = $cmds =~ /^\s*__begin(?s:.*?)__end;$|^\s*\S+;$/mg;

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall- Òþ²Ø±»ÒýÓÃÎÄ×Ö -

- ÏÔÊ¾ÒýÓÃµÄÎÄ×Ö -

Thank you, John, it works very well. You help save some hours!

sln · Oct 20, 2008

Hello,
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

^^
my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

You were on the right track. [^;] however is first to match all before ';',
which means it grabs the' __begin { .. abc;' then the next, then next.
'__begin.*?__end' is never matched. By including not whitespace, [^\s;] in
the character class, begin and end have a chance.

sln

--------------------

use strict;
use warnings;

my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

for (my $i = 0; $i < @lines; $i++) {
print "\n\$lines[$i] = \n\n\"$lines[$i]\"\n";
}

__END__

output:

$lines[0] =

"__begin {
abc;
def;
{foo;bar}
} __end"

$lines[1] =

"__begin {
cde;
} __end"

$lines[2] =

"abc"

$lines[3] =

"bad"

bingfeng · Oct 21, 2008

Hello,
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

Click to expand...

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

Click to expand...

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

Click to expand...

^^
my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

You were on the right track. [^;] however is first to match all before ';',
which means it grabs the' __begin { .. abc;' then the next, then next..
'__begin.*?__end' is never matched. By including not whitespace, [^\s;] in
the character class, begin and end have a chance.

You are right. Thanks for your explanation. My sample is some
oversimple. the standalone sentence may contain other word and space,
with following test message:
my $cmds = <<DOC
__begin {
abc sss;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc kkk;
bad fde;
DOC
;

you solution gives following Dumper result:
$VAR1 = '__begin {
abc sss;
def;
{foo;bar}
} __end';
$VAR2 = '__begin {
cde;
} __end';
$VAR3 = 'abc';
$VAR4 = 'kkk';
$VAR5 = 'bad';
$VAR6 = 'fde';

that's not what I want. Apart from John's solution, I have no other
solution. Thank you

sln

--------------------

use strict;
use warnings;

my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

for (my $i = 0; $i < @lines; $i++) {
print "\n\$lines[$i] = \n\n\"$lines[$i]\"\n";

}

__END__

output:

$lines[0] =

"__begin {
abc;
def;
{foo;bar}
} __end"

$lines[1] =

"__begin {
cde;
} __end"

$lines[2] =

"abc"

$lines[3] =

"bad"

sln · Oct 21, 2008

Hello,
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;

Click to expand...

I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".

Click to expand...

split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);

Click to expand...

^^
my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);

You were on the right track. [^;] however is first to match all before ';',
which means it grabs the' __begin { .. abc;' then the next, then next.
'__begin.*?__end' is never matched. By including not whitespace, [^\s;] in
the character class, begin and end have a chance.

Click to expand...

You are right. Thanks for your explanation. My sample is some
oversimple. the standalone sentence may contain other word and space,
with following test message:
my $cmds = <<DOC
__begin {
abc sss;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc kkk;
bad fde;
DOC
;

you solution gives following Dumper result:
$VAR1 = '__begin {
abc sss;
def;
{foo;bar}
} __end';
$VAR2 = '__begin {
cde;
} __end';
$VAR3 = 'abc';
$VAR4 = 'kkk';
$VAR5 = 'bad';
$VAR6 = 'fde';

Thats too bad. You made a good attempt and I gave
you credit by saying you almost had it right the first time.
And the regex was altered slightly from why you yourself tried.

I didn't write a regex for you. Because if I did that, you could
always come back and say for example:

#>You are right. Thanks for your explanation. My sample is some
#>oversimple. the standalone sentence may contain other word and space,
#>with following test message ...

But you didn't say that in the first place.

that's not what I want. Apart from John's solution, I have no other
solution.

^^^^^^^^^^^^^
Think again ... you just invalidated his regex.

my @lines = $str =~ /^\s*__begin(?s:.*?)__end;$|^\s*\S+;$/mg;

$lines[0] =
" __begin {
abc sss;
def;
{foo;bar}
} __end;"

$lines[1] =
" __begin {
cde;
} __end;"

What are you going to do now?
We're still in the extremely simple stage.
In fact, the more you add, the simpler it gets.

sln

-------------------------------

Version 2

#################
# Misc Parse 2
#################

use strict;
use warnings;

# the old
my $cmd1 = <<DOC1
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC1
;

# the new
my $cmds2 = <<DOC2
__begin {
abc sss;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc kkk;
bad fde;
DOC2
;

my $str = $cmds2;

my @lines = ($str =~ /\s*(__begin.*?__end|.*?);/sg);

for (my $i = 0; $i < @lines; $i++) {
print "\n\$lines[$i] = \n\n\"$lines[$i]\"\n";
}

__END__

output:

$lines[0] =

"__begin {
abc sss;
def;
{foo;bar}
} __end"

$lines[1] =

"__begin {
cde;
} __end"

$lines[2] =

"abc kkk"

$lines[3] =

"bad fde"

bingfeng · Oct 21, 2008

Hello,
Assume I have following string:
my $cmds = <<DOC
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC
;
I want to split it into an array, the first item is "__begin {
abc;
def;
{foo;bar}
} __end", the second item is "__begin {
cde;
} __end", and the third is "abc" and the fourth is "bad".
split obviously cannot be used here, so I use following regex:
my @lines = ($cmds =~ /__begin.*?__end|[^;]+/sg);
^^
my @lines = ($cmds =~ /__begin.*?__end|[^\s;]+/sg);
You were on the right track. [^;] however is first to match all before';',
which means it grabs the' __begin { .. abc;' then the next, then next.
'__begin.*?__end' is never matched. By including not whitespace, [^\s;] in
the character class, begin and end have a chance.

Click to expand...

Click to expand...

You are right. Thanks for your explanation. My sample is some
oversimple. the standalone sentence may contain other word and space,
with following test message:
my $cmds = <<DOC
__begin {
abc sss;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc kkk;
bad fde;
DOC
;

Click to expand...

you solution gives following Dumper result:
$VAR1 = '__begin {
abc sss;
def;
{foo;bar}
} __end';
$VAR2 = '__begin {
cde;
} __end';
$VAR3 = 'abc';
$VAR4 = 'kkk';
$VAR5 = 'bad';
$VAR6 = 'fde';

Click to expand...

Thats too bad. You made a good attempt and I gave
you credit by saying you almost had it right the first time.
And the regex was altered slightly from why you yourself tried.

I didn't write a regex for you. Because if I did that, you could
always come back and say for example:

#>You are right. Thanks for your explanation. My sample is some
#>oversimple. the standalone sentence may contain other word and space,
#>with following test message ...

But you didn't say that in the first place.

that's not what I want. Apart from John's solution, I have no other
solution.

Click to expand...

^^^^^^^^^^^^^
Think again ... you just invalidated his regex.

my @lines = $str =~ /^\s*__begin(?s:.*?)__end;$|^\s*\S+;$/mg;

$lines[0] =
" __begin {
abc sss;
def;
{foo;bar}
} __end;"

$lines[1] =
" __begin {
cde;
} __end;"

What are you going to do now?
We're still in the extremely simple stage.
In fact, the more you add, the simpler it gets.

sln

-------------------------------

Version 2

#################
# Misc Parse 2
#################

use strict;
use warnings;

# the old
my $cmd1 = <<DOC1
__begin {
abc;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc;
bad;
DOC1
;

# the new
my $cmds2 = <<DOC2
__begin {
abc sss;
def;
{foo;bar}
} __end;
__begin {
cde;
} __end;
abc kkk;
bad fde;
DOC2
;

my $str = $cmds2;

my @lines = ($str =~ /\s*(__begin.*?__end|.*?);/sg);

for (my $i = 0; $i < @lines; $i++) {
print "\n\$lines[$i] = \n\n\"$lines[$i]\"\n";

}

__END__

output:

$lines[0] =

"__begin {
abc sss;
def;
{foo;bar}
} __end"

$lines[1] =

"__begin {
cde;
} __end"

$lines[2] =

"abc kkk"

$lines[3] =

"bad fde"

Wow, I had to admit your regex is simpler, easy to understand and
elegant. I'll study what you said carefully. Anyway, thank you very
much.

Parsing multiple lines from text file using regex	0	Oct 27, 2013
Splitting a statement into multiple lines	8	Dec 26, 2012
Python point location of intersect between two lines	0	Feb 28, 2018
matching over multiple lines	4	Nov 21, 2006
"Casting" a split into an array	12	Mar 19, 2007
Split Menu into multiple lines?	0	May 6, 2008
Python and PEP8 - Recommendations on breaking up long lines?	19	Nov 28, 2013
Extracting lines from text files - script with a couple of 'side effects'	3	Sep 25, 2013

split a multiple lines text

bingfeng

John W. Krahn

Dr.Ruud

Jürgen Exner

bingfeng

sln

bingfeng

sln

bingfeng

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads