Hi,
I'm using Perl 5.8.8 on Mac 10.5.6. I found this script online for
matching a pattern across multiple lines. The problem is, it only
prints out one instance of the expression, and I would like it to
print out all instances. What can I change so that it will print out
all instances?
#!/usr/bin/perl
use strict;
use warnings;
open(my $file, "<", "myfile.txt")
or die "Can't open file: $!";
my $text = do { local $/; <$file> };
if ($text =~ /(<\s*script[^<]*>.*?<\/script>)/gs) {
print $1;
}
Thanks, - Dave
'while()' should work as others have said.
The above regex should take into account these forms:
<tag>
<tag/>
<tag attr> content </tag>
<tag attr/>
Try this. It takes into account all the above forms
plus handles attributes fairly well, without the need for
[^<]*, where the actual character '<' can exist in the value
part. Handling attrib/vals correctly and taking acccount of all
valid forms are important, it all goes toward partitioning the
data.
Also, this is a complex parse. It includes multiple atomic
markup units, which is debatably <tag> style and content.
Content being the current state that is not markup.
Ideally, the unit is parsed to find the start element 'script',
recording is turned on, then off at the end element 'script'.
As it is now, the regex you are using won't correctly parse the
$text string below.
Late addition:
But alas, no simple regex is going to handle nesting correctly
unless there is recursion. Below handles recursive tags, but
requires Perl 5.10 or better.
usage: html_rx.pl [<tag name> [file name]] - default, if no params
Cmd line examples:
*> html_rx.pl form junk.html - Finds 'form' blocks in html file
*> html_rx.pl script junk.html - Finds 'script' blocks in html file
*> html_rx.pl "(?i)script|object" junk.html
- Finds either 'script' or 'object' blocks, case insensitive
(good little markups will be properly nested, ie. those that have terminators)
-sln
-------------
*> perl html_rx.pl "(?i)script|object"
File name: __DATA__
Tag name: (?i)script|object
-------------------- ** type 1
<script attr = "asdf" attr = 'wafsd' />
-------------------- ** type 2
<script>
use strict;
use warnings;
print "hello world, I'm a <tag>\n";
<script a = "it's" b= 'terminated'/>
<object></object>
<script>
// comment me out c++ style
/* now c style
*/
</script>
</script>
-------------------- ** type 1
<script />
====================
Summary
File name: __DATA__
Tag name: (?i)script|object
type1 <tag, tag-attr /> = 2
type2 <tag, tag-attr>..</tag> = 1
*>
-------------
## html_rx.pl
## -sln
use strict;
use warnings;
require 5.010_000;
# usage: html_rx.pl [<tag name> [file name]]
# ---------------------------------------------
my ($tag,$fname) = @ARGV;
my $text;
$tag = 'script' unless defined $tag;
if (defined $fname) {
open my $fh, '<', $fname or die "Can't open file '$fname' : $!";
$text = join '',<$fh>;
close $fh;
} else {
$fname = '__DATA__';
$text = join '',<DATA>;
}
my ($terminated, $open, $close) =
(
qr {< (?:$tag) (?:\s+[^>]*)? />}x,
qr {< (?:$tag) (?:\s+[^>]*? \s*[^/]> | \s*>) }x,
qr {</ (?:$tag) \s*> }x
);
my $rx = qr {
(
(?: $terminated ) # <tag [attr] />
)
| # OR ...
(
(?: $open ) # <tag [attr] >
(?:
(?: (?!$open|$close) . )++ # possessive
|
(?2) # recurse group 2
)*
(?: $close ) # </tag>
)
}xs;
print "\n",<<INFO;
File name: $fname
Tag name: $tag\n
INFO
my ($cnt1,$cnt2) = (0,0);
while ( $text =~ /$rx/g) {
print '-'x20;
if (defined $1) {
print " ** type 1\n",$1,"\n" ;
$cnt1++;
} else {
print " ** type 2\n",$2,"\n" ;
$cnt2++;
}
}
print "\n",'='x20,"\nSummary\n",<<SUMMARY;
File name: $fname
Tag name: $tag
type1 <tag, tag-attr /> = $cnt1
type2 <tag, tag-attr>..</tag> = $cnt2\n
SUMMARY
__DATA__
<script attr = "asdf" attr = 'wafsd' />
</script>
<script>
<script>
<script>
use strict;
use warnings;
print "hello world, I'm a <tag>\n";
<script a = "it's" b= 'terminated'/>
<object></object>
<script>
// comment me out c++ style
/* now c style
*/
</script>
</script>
<script>
<script />
<notme>