hwo to match more than 1 line?

G

Geoff Cox

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print "Name: $1\nAddress: $2\n";
}

Gunnar,

the above is not working for me at the moment - if you have the time
(and patience!) it would really help me if you could "talk" me through
it ...

Cheers

Geoff
 
G

Geoff Cox

Geoff said:
[snip]
Ideas please?!

You've discovered that regexes aren't very robust/easy/flexible when it
comes to parsing HTML. Use one of the HTML parsers on CPAN.

There seem to be a large number of them! any recommendation?!

HTML::parser. If you're only interested in extracting text, here's an
example to get you started:

http://search.cpan.org/src/GAAS/HTML-Parser-3.34/eg/htext

There are other example scripts in the parent directory.

HTH - keith

Keith - thanks for the link...

Cheers

Geoff
 
G

Gunnar Hjalmarsson

Geoff said:
Gunnar said:
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print "Name: $1\nAddress: $2\n";
}

the above is not working for me at the moment - if you have the
time (and patience!) it would really help me if you could "talk" me
through it ...

I'd prefer not to. Besides the character classes, which we now have
explained, and a couple of modifiers, whose meaning you can read about
in 'perldoc perlre', it doesn't include anything that was not included
in the regex you posted yourself.

I suggest that you post a minimal but complete program that others can
run and that illustrates that the above regex fails in extracting the
name and address.
 
G

Geoff Cox

Geoff said:
Gunnar said:
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print "Name: $1\nAddress: $2\n";
}

the above is not working for me at the moment - if you have the
time (and patience!) it would really help me if you could "talk" me
through it ...

I'd prefer not to. Besides the character classes, which we now have
explained, and a couple of modifiers, whose meaning you can read about
in 'perldoc perlre', it doesn't include anything that was not included
in the regex you posted yourself.

OK - will do - I follow above except I would have expected that
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

would need
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)<
and
Address.+?<TD[^>]+>([^<]+)

would need

ie the "<" to signify where the ([^<]+) ends - as you do have a "<" in
the .+?<TD[^>]+> section?! I must be missing something?

My code is as follows but it does not work!

---------------------------
use strict;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

while (defined($line=<IN>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example


<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>


Cheers

Geoff
 
G

Gunnar Hjalmarsson

Geoff said:
Gunnar said:
Geoff said:
Gunnar Hjalmarsson wrote:

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print "Name: $1\nAddress: $2\n";
}

I follow above except I would have expected that
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

would need
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)<
and
Address.+?<TD[^>]+>([^<]+)

would need
Address.+?<TD[^>]+>([^<]+)<

ie the "<" to signify where the ([^<]+) ends - as you do have a "<"
in the .+?<TD[^>]+> section?! I must be missing something?

Since [^<]+ matches any character besides <, it stops matching as soon
as a < is reached. Consequently, adding those '<' characters as you
suggest does not make a difference.

If I had used .+? instead, it would have been necessary to do

(.+?)<

HTH
 
G

Geoff Cox

Since [^<]+ matches any character besides <, it stops matching as soon
as a < is reached. Consequently, adding those '<' characters as you
suggest does not make a difference.

If I had used .+? instead, it would have been necessary to do

(.+?)<

Gunnar - yes I follow that now!

Geoff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,142
Messages
2,570,820
Members
47,367
Latest member
mahdiharooniir

Latest Threads

Top