slurp not working? ideas please!

G

Geoff Cox

Keith,

I am trying to implement your ideas but part is OK and part not ...

I am able to use

: main::choice( $attr->{value});

to pass the option data to the sub choice, but not clear how to get
the $text printed to OUT etc (ie into the results.htm file).

Also, not clear how to stop the data being printed to the screen? This
will tell you that I am not at all clear re how your code works!! eg
have I structured the code correctly? Would really appreciate it if
you could give brief explanation.

Cheers

Geoff


use strict;
use warnings;
use HTML::parser;

my $data;
open (IN, "test.htm");
{ local $/; $data = <IN> }

my @to_print = qw[h2 p];
my @get_attr = qw[option];
my $current_tag = '';

sub start_tag {
my ($tag, $attr, $text) = @_;
$current_tag = $tag;
( grep { $current_tag eq $_ } @to_print )
? print $text
# ? main::printtext($text)
# : print $attr->{value};
: main::choice( $attr->{value});
}

sub default {
print shift if grep { $current_tag eq $_ } @to_print;
}


package main;

open( OUT, ">>results.htm" )
|| die "results.htm: $!"; #
print OUT ("<table width='100%' border='1'><tr> \n");

my $parser = HTML::parser->new
(
report_tags => [ @to_print, @get_attr ],
default_h => [ \&default, 'text' ],
start_h => [ \&start_tag, 'tagname, attr, text' ],
)->parse( $data ) or die $!;


sub printtext {
my $text = @_ ;
print OUT ("<p> $text </p> \n");
}

sub choice {
my ($path) = @_;

if ( $path =~ /docs\/gcse\/student-activities\/finance/ ) {
intro($path);
gcsestudentactivitiesfinance($path);
} elsif ( $path =~ /docs\/gcse\/student-activities\/marketing/ ) {
intro($path);
gcsestudentactivitiesmarketing($path);
}

}
 
G

Geoff Cox

That is somewhat unfortunate as the problem in question does not occur
when I run your code. It's a bit tricky to address a bug that is not
reproducible.

Tassilo,

Have you seen the post from Keith? He also got the same wrong order as
me! I have got part of his code working. You can see from my reply to
him that I cannot see yet how the $text is sent to the results.htm
file.

Cheers

Geoff
 
T

Tassilo v. Parseval

Also sprach Geoff Cox:
Have you seen the post from Keith? He also got the same wrong order as
me! I have got part of his code working.

No said:
I say that because I ran the code you supplied and the output file
*was* in the same order as your test.htm. Maybe the problem is
elsewhere?

So I didn't make that up. The code you supplied was ok and should work.
The fact that it doesn't means that the problem is indeed somewhere
else. I suspected it could be buffering, but apparently it's not.

Tassilo
 
T

Tassilo v. Parseval

Also sprach Geoff Cox:
Keith,

I am trying to implement your ideas but part is OK and part not ...

I am able to use

: main::choice( $attr->{value});

to pass the option data to the sub choice, but not clear how to get
the $text printed to OUT etc (ie into the results.htm file).

Also, not clear how to stop the data being printed to the screen? This
will tell you that I am not at all clear re how your code works!! eg
have I structured the code correctly? Would really appreciate it if
you could give brief explanation.

The main difference to your previous subclassing approach is that you no
longer need two namespaces (main and MyParser). It all runs in main.
Furthermore, no file-handle needs to be passed around any longer. OUT is
implicitly global.
use strict;
use warnings;
use HTML::parser;

my $data;
open (IN, "test.htm");
{ local $/; $data = <IN> }

my @to_print = qw[h2 p];
my @get_attr = qw[option];
my $current_tag = '';

sub start_tag {
my ($tag, $attr, $text) = @_;
$current_tag = $tag;
( grep { $current_tag eq $_ } @to_print )
? print $text
# ? main::printtext($text)
# : print $attr->{value};
: main::choice( $attr->{value});

Bah. That's a weird use of the ?: operator. Use a proper if:

if (grep { $current_tag eq $_ } @to_print) {
print OUT $text;
} else {
choice($attr->{value});
}

Note that you no longer need to package qualify choice(). It's all
defined in package main already.
}

sub default {
print shift if grep { $current_tag eq $_ } @to_print;

print OUT shift if grep { $current_tag eq $_ } @to_print;
}


package main;

This is no longer needed either.
open( OUT, ">>results.htm" )
|| die "results.htm: $!"; #
print OUT ("<table width='100%' border='1'><tr> \n");

my $parser = HTML::parser->new
(
report_tags => [ @to_print, @get_attr ],
default_h => [ \&default, 'text' ],
start_h => [ \&start_tag, 'tagname, attr, text' ],
)->parse( $data ) or die $!;


sub printtext {
my $text = @_ ;
print OUT ("<p> $text </p> \n");
}

sub choice {
my ($path) = @_;

if ( $path =~ /docs\/gcse\/student-activities\/finance/ ) {
intro($path);
gcsestudentactivitiesfinance($path);

All those functions (intro() etc.) may also use OUT as output
file-handle. I can't say it often enough: It's all in the main package
now.
} elsif ( $path =~ /docs\/gcse\/student-activities\/marketing/ ) {
intro($path);
gcsestudentactivitiesmarketing($path);
}

}

As you see, just printing to OUT is fine. All you have to ensure is that

open OUT, ...

happens before you use the file-handle. As far as I see, this is the
case in your program.

Tassilo
 
K

ko

Geoff said:
Tassilo,

Have you seen the post from Keith? He also got the same wrong order as
me! I have got part of his code working. You can see from my reply to
him that I cannot see yet how the $text is sent to the results.htm
file.

Cheers

Geoff

I think you misread my reply - or maybe it wasn't clear. The results I
get from your previous code are the *same* for both the 'test.htm' and
the 'results.htm' file, in other words like Tassilo I *can't* reproduce
your problem.

keith
 
G

Geoff Cox

I think you misread my reply - or maybe it wasn't clear. The results I
get from your previous code are the *same* for both the 'test.htm' and
the 'results.htm' file, in other words like Tassilo I *can't* reproduce
your problem.

Keith,

sorry my mistake! hope you can have a look at my efforts with your own
code.

Cheers

Geoff
 
G

Geoff Cox

So I didn't make that up. The code you supplied was ok and should work.
The fact that it doesn't means that the problem is indeed somewhere
else. I suspected it could be buffering, but apparently it's not.

Tassilo,

sorry - my mistake. not sure where to go on that!

Cheers

Geoff
 
G

Geoff Cox

On Mon, 26 Apr 2004 08:40:13 +0200, "Tassilo v. Parseval"

Tassilo,

thanks for the comments - below - will take note.

Cheers

Geoff
The main difference to your previous subclassing approach is that you no
longer need two namespaces (main and MyParser). It all runs in main.
Furthermore, no file-handle needs to be passed around any longer. OUT is
implicitly global.
use strict;
use warnings;
use HTML::parser;

my $data;
open (IN, "test.htm");
{ local $/; $data = <IN> }

my @to_print = qw[h2 p];
my @get_attr = qw[option];
my $current_tag = '';

sub start_tag {
my ($tag, $attr, $text) = @_;
$current_tag = $tag;
( grep { $current_tag eq $_ } @to_print )
? print $text
# ? main::printtext($text)
# : print $attr->{value};
: main::choice( $attr->{value});

Bah. That's a weird use of the ?: operator. Use a proper if:

if (grep { $current_tag eq $_ } @to_print) {
print OUT $text;
} else {
choice($attr->{value});
}

Note that you no longer need to package qualify choice(). It's all
defined in package main already.
}

sub default {
print shift if grep { $current_tag eq $_ } @to_print;

print OUT shift if grep { $current_tag eq $_ } @to_print;
}


package main;

This is no longer needed either.
open( OUT, ">>results.htm" )
|| die "results.htm: $!"; #
print OUT ("<table width='100%' border='1'><tr> \n");

my $parser = HTML::parser->new
(
report_tags => [ @to_print, @get_attr ],
default_h => [ \&default, 'text' ],
start_h => [ \&start_tag, 'tagname, attr, text' ],
)->parse( $data ) or die $!;


sub printtext {
my $text = @_ ;
print OUT ("<p> $text </p> \n");
}

sub choice {
my ($path) = @_;

if ( $path =~ /docs\/gcse\/student-activities\/finance/ ) {
intro($path);
gcsestudentactivitiesfinance($path);

All those functions (intro() etc.) may also use OUT as output
file-handle. I can't say it often enough: It's all in the main package
now.
} elsif ( $path =~ /docs\/gcse\/student-activities\/marketing/ ) {
intro($path);
gcsestudentactivitiesmarketing($path);
}

}

As you see, just printing to OUT is fine. All you have to ensure is that

open OUT, ...

happens before you use the file-handle. As far as I see, this is the
case in your program.

Tassilo
 
G

Geoff Cox

On Mon, 26 Apr 2004 08:40:13 +0200, "Tassilo v. Parseval"

Tassilo and Keith,

When I use the line below- It produces the same *wrong* order, ie

<h2>first
<related text>
<h2>second
<related text>

then

first option data
second option data

rather than

<h2>first
<related text>
<related option data>

<h2>second
<related text>
<related option data>

So as you say Tassilo, it looks as if the problem is elsewhere...

Geoff
 
K

ko

Geoff said:
Keith,

I am trying to implement your ideas but part is OK and part not ...

I am able to use

: main::choice( $attr->{value});

to pass the option data to the sub choice, but not clear how to get
the $text printed to OUT etc (ie into the results.htm file).

Tassilo explained this part quite well already, so I won't go over it again.
Also, not clear how to stop the data being printed to the screen? This
will tell you that I am not at all clear re how your code works!! eg
have I structured the code correctly? Would really appreciate it if
you could give brief explanation.

Before the explanation, if you're *really* going to understand either
this approach or Tassilo's, you need to go over the documentation again
- and again and again if necessary. And believe me, I understand that it
may be hard hard for you and can relate. First attempt at using
HTML::parser was very frustrating (pretty much a novice, only have been
serious about *learning* Perl for about nine months, no other
programming experience) and I pretty much gave up in favor of using
HTML::Treebuilder, which has an interface that made more sense to me at
the time. Until you get a basic understanding of the documentation, you
will have problems with *either* approach.
Cheers

Geoff
[snip/rearranged]

my $parser = HTML::parser->new
(
report_tags => [ @to_print, @get_attr ],
default_h => [ \&default, 'text' ],
start_h => [ \&start_tag, 'tagname, attr, text' ],
)->parse( $data ) or die $!;

To start, you pass the parser expicit options in the new() constructor.
The report_tags option/method basically skips processing any tag *not*
listed, so the *only* tags and related events processed are those specified.

The next two options, 'default_h' and 'start_h' are event handlers. Put
simply, the parser goes through the HTML and is able differentiate
between tags/text. You need to explicitly tell the parser what to do for
each event you are interested in by assigning a subroutine reference
(\&default and \&start_tag). The string after the code ref is an
'Argspec'. Its the information you are interested in, which will be
passed to the coderef - you can pass any number as long as they are valid.

[snip/rearranged]
sub start_tag {
my ($tag, $attr, $text) = @_;
$current_tag = $tag;
( grep { $current_tag eq $_ } @to_print )
? print $text
# ? main::printtext($text)
# : print $attr->{value};
: main::choice( $attr->{value});
}

This handler will be invoked when any *start* tag is recognized. Again,
the only start tags which will be processed are 'h2', 'p', and 'option',
since you told the parser so with 'report_tags' in the constructor. The
Argspec is 'tagname, attr, text', which is what's passed to the sub.

So for instance, with the following markup:

<h2 id='id' align='center'>Level Two Heading</h2>

the variables are assigned:

$tag = 'h2';
$attr = { id => 'id', align => 'center' };
$text = "<h2 id='id' align='center'>";

Then you have to figure out what to do with the variables. You only want
to test for two conditions: (a) if $tag is 'h2' or 'p' print, else (b)
do something with the 'option' tag *if* it has a 'value' attribute. I
saw Tassilo's remark on my use of the ternary operator, which I like -
maybe I should have been more exlicit. Anyway, note that since you
already have everything in $text you don't have to bother hard-coding
the tag in like in your printtext sub.
sub default {
print shift if grep { $current_tag eq $_ } @to_print;
}

This is the default handler, it will process everything *except* for
start tags, which has an explicit handler. Again, since we specified a
list of tags to process, we're only dealing with 'h2', 'p', and
'option'. Specifically, the sub processes (a) text *inside* of the
listed tags (text event), and (b) end tags (end event). Argspec was
specified as 'text', so with the same markup as above:

<h2 id='id' align='center'>Level Two Heading</h2>

The sub will print:

1. 'Level Two Heading' - text event
2. '</h2>' - end event

[snip]

Hope it makes sense, but I cannot stress how important it is for you to
go over the documentation - even if it doesn't make sense now it will
eventually. One suggestion is to go over the 'Argspec' section and if
nothing else play around with some code that prints out the values to
see what kind of information is available to you.

keith
 
G

Geoff Cox

Keith

many thanks for the explanations - am carefully going through them
now. I'm sure you are right about the need to read, read and read
again!

Cheers

Geoff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,148
Messages
2,570,838
Members
47,385
Latest member
Joneswilliam01

Latest Threads

Top