Problem parsing HTML

N

Ninja Li

Hi,

I am trying to parse HTML from the website http://biz.yahoo.com/c/e.html
using HTML::TreeBuilder module and generate a comma-delimited file.
However, I am getting an extra "," at the first line and I would also
like to get rid of the "," at the end of the each line.

Please advise why that happens and the fix. The code is at the end
of the post.

Thanks in advance.

Nick
-----------------------------------------------
Soure code:

use strict;
use LWP::Simple;
use HTML::Tree;
use warnings;

my $url = 'http://biz.yahoo.com/c/e.html';
my $content = get($url);

my $tree = HTML::TreeBuilder->new_from_content($content);

my @tr = $tree->look_down('_tag' => 'tr',
sub { $_[0]->as_text !~ m/My Yahoo!/ &&
$_[0]->as_text !~ m/Welcome/i &&
$_[0]->as_text !~ m/Economic Calendar/i
&&
$_[0]->as_text !~ m/Last Week/i });

foreach my $tr (@tr)
{
if ($tr)
{
my @detail = $tr->look_down('_tag' => 'td');

foreach my $detail (@detail)
{
print $detail->as_text . ",";
}
print "\n";
}
else
{
warn "No detail data";
}
}

$tree->delete;
 
D

Dan Rumney

my @detail = $tr->look_down('_tag' => 'td');

foreach my $detail (@detail)
{
print $detail->as_text . ",";
}
print "\n";

There's your problem.

For every element in @detail, you print that element, plus a comma.
You've programmed it to put a comma after every $detail, so you're going
to get a comma after the last $detail on each line.

Try
my @detail = $tr->look_down('_tag' => 'td');
print join(',',@detail)."\n";


Dan
 
N

Ninja Li

There's your problem.

For every element in @detail, you print that element, plus a comma.
You've programmed it to put a comma after every $detail, so you're going
to get a comma after the last $detail on each line.

Try

 >           my @detail = $tr->look_down('_tag' => 'td');
 >           print join(',',@detail)."\n";

Dan

Dan,

I tried your solution but I got error messages.

Thanks.

Nick
 
J

Jürgen Exner

Ninja Li said:
I tried your solution but I got error messages.

And? Are you keeping those error messages a secret? Kind of difficult to
correct code without knowing the code _and_ the error messages.

jue
 
N

Ninja Li

And? Are you keeping those error messages a secret? Kind of difficult to
correct code without knowing the code _and_ the error messages.

jue- Hide quoted text -

- Show quoted text -

Jurgen,

Thanks for pointing this out. Here is error message after code
change. The new code is at the end of the post:

Thanks.

------------------------------------
HTML::Element=HASH(0xdd3228)
HTML::Element=HASH(0xdee864),HTML::Element=HASH
(0xdee924),HTML::Element=HASH(0xdee9a8),HTML::Element=HASH
(0xdeea44),HTML::Element=HASH(0xdeeb04),HTML::Element=HASH
(0xdeebac),HTML::Element=HASH(0xdeec54),HTML::Element=HASH
(0xdf2bd8),HTML::Element=HASH(0xdf2c80)
HTML::Element=HASH(0xdf2d58),HTML::Element=HASH
(0xdf2dd0),HTML::Element=HASH(0xdf2e0c),HTML::Element=HASH
(0xdf2eb4),HTML::Element=HASH(0xdf2f2c),HTML::Element=HASH
(0xdf2f8c),HTML::Element=HASH(0xdf2fec),HTML::Element=HASH
(0xdf304c),HTML::Element=HASH(0xdf30ac)
HTML::Element=HASH(0xdf313c),HTML::Element=HASH
(0xdf31b4),HTML::Element=HASH(0xdf31f0),HTML::Element=HASH
(0xdf32a4),HTML::Element=HASH(0xdf331c),HTML::Element=HASH
(0xdf337c),HTML::Element=HASH(0xdf33dc),HTML::Element=HASH
(0xdf343c),HTML::Element=HASH(0xdf349c)
HTML::Element=HASH(0xdf352c),HTML::Element=HASH
(0xdf35a4),HTML::Element=HASH(0xdf35e0),HTML::Element=HASH
(0xdf3694),HTML::Element=HASH(0xdf370c),HTML::Element=HASH
(0xdf376c),HTML::Element=HASH(0xdf37cc),HTML::Element=HASH
(0xdf382c),HTML::Element=HASH(0xdf388c)
HTML::Element=HASH(0xdf391c),HTML::Element=HASH
(0xdf3994),HTML::Element=HASH(0xdf39d0),HTML::Element=HASH
(0xdf3a84),HTML::Element=HASH(0xdf3afc),HTML::Element=HASH
(0xdf6480),HTML::Element=HASH(0xdf64e0),HTML::Element=HASH
(0xdf6540),HTML::Element=HASH(0xdf65a0)
HTML::Element=HASH(0xdf6630),HTML::Element=HASH
(0xdf66a8),HTML::Element=HASH(0xdf66e4),HTML::Element=HASH
(0xdf6798),HTML::Element=HASH(0xdf6810),HTML::Element=HASH
(0xdf6870),HTML::Element=HASH(0xdf68d0),HTML::Element=HASH
(0xdf6930),HTML::Element=HASH(0xdf6990)
HTML::Element=HASH(0xdf6a20),HTML::Element=HASH
(0xdf6a98),HTML::Element=HASH(0xdf6ad4),HTML::Element=HASH
(0xdf6b28),HTML::Element=HASH(0xdf6ba0),HTML::Element=HASH
(0xdf6c00),HTML::Element=HASH(0xdf6c60),HTML::Element=HASH
(0xdf6cc0),HTML::Element=HASH(0xdf6d20)
HTML::Element=HASH(0xdf6db0),HTML::Element=HASH
(0xdf6e28),HTML::Element=HASH(0xdf6e64),HTML::Element=HASH
(0xdf6f0c),HTML::Element=HASH(0xdf6f84),HTML::Element=HASH
(0xdf6fe4),HTML::Element=HASH(0xdf7044),HTML::Element=HASH
(0xdf70a4),HTML::Element=HASH(0xdf7104)
HTML::Element=HASH(0xdf7194),HTML::Element=HASH
(0xdf720c),HTML::Element=HASH(0xdf7248),HTML::Element=HASH
(0xdf72f0),HTML::Element=HASH(0xdf7368),HTML::Element=HASH
(0xdf73c8),HTML::Element=HASH(0xdf7428),HTML::Element=HASH
(0xdfae6c),HTML::Element=HASH(0xdfaecc)
HTML::Element=HASH(0xdfaf5c),HTML::Element=HASH
(0xdfafd4),HTML::Element=HASH(0xdfb010),HTML::Element=HASH
(0xdfb064),HTML::Element=HASH(0xdfb0dc),HTML::Element=HASH
(0xdfb13c),HTML::Element=HASH(0xdfb19c),HTML::Element=HASH
(0xdfb1fc),HTML::Element=HASH(0xdfb25c)
HTML::Element=HASH(0xdfb2ec),HTML::Element=HASH
(0xdfb364),HTML::Element=HASH(0xdfb3a0),HTML::Element=HASH
(0xdfb3f4),HTML::Element=HASH(0xdfb46c),HTML::Element=HASH
(0xdfb4cc),HTML::Element=HASH(0xdfb52c),HTML::Element=HASH
(0xdfb58c),HTML::Element=HASH(0xdfb5ec)
HTML::Element=HASH(0xdfb67c),HTML::Element=HASH
(0xdfb6f4),HTML::Element=HASH(0xdfb730),HTML::Element=HASH
(0xdfb7d8),HTML::Element=HASH(0xdfb850),HTML::Element=HASH
(0xdfb8b0),HTML::Element=HASH(0xdfb910),HTML::Element=HASH
(0xdfb970),HTML::Element=HASH(0xdfb9d0)
HTML::Element=HASH(0xdfba60),HTML::Element=HASH
(0xdfbad8),HTML::Element=HASH(0xdfbb14),HTML::Element=HASH
(0xdfbb68),HTML::Element=HASH(0xdfbbe0),HTML::Element=HASH
(0xdfbc40),HTML::Element=HASH(0xdfbca0),HTML::Element=HASH
(0xdfbd00),HTML::Element=HASH(0xdfbd60)
HTML::Element=HASH(0xdfbdf0),HTML::Element=HASH
(0xdfe75c),HTML::Element=HASH(0xdfe798),HTML::Element=HASH
(0xdfe84c),HTML::Element=HASH(0xdfe8c4),HTML::Element=HASH
(0xdfe924),HTML::Element=HASH(0xdfe984),HTML::Element=HASH
(0xdfe9e4),HTML::Element=HASH(0xdfea44)
HTML::Element=HASH(0xdfead4),HTML::Element=HASH
(0xdfeb4c),HTML::Element=HASH(0xdfeb88),HTML::Element=HASH
(0xdfec3c),HTML::Element=HASH(0xdfecb4),HTML::Element=HASH
(0xdfed14),HTML::Element=HASH(0xdfed74),HTML::Element=HASH
(0xdfedd4),HTML::Element=HASH(0xdfee34)
HTML::Element=HASH(0xdfeec4),HTML::Element=HASH
(0xdfef3c),HTML::Element=HASH(0xdfef78),HTML::Element=HASH
(0xdff020),HTML::Element=HASH(0xdff098),HTML::Element=HASH
(0xdff0f8),HTML::Element=HASH(0xdff158),HTML::Element=HASH
(0xdff1b8),HTML::Element=HASH(0xdff218)
HTML::Element=HASH(0xdff2a8),HTML::Element=HASH
(0xdff320),HTML::Element=HASH(0xdff35c),HTML::Element=HASH
(0xdff3b0),HTML::Element=HASH(0xdff428),HTML::Element=HASH
(0xdff488),HTML::Element=HASH(0xdff4e8),HTML::Element=HASH
(0xdff548),HTML::Element=HASH(0xdff5a8)
HTML::Element=HASH(0xdff638),HTML::Element=HASH
(0xdff6b0),HTML::Element=HASH(0xdff6ec),HTML::Element=HASH
(0xe040e0),HTML::Element=HASH(0xe04158),HTML::Element=HASH
(0xe041b8),HTML::Element=HASH(0xe04218),HTML::Element=HASH
(0xe04278),HTML::Element=HASH(0xe042d8)
HTML::Element=HASH(0xe04368),HTML::Element=HASH
(0xe043e0),HTML::Element=HASH(0xe0441c),HTML::Element=HASH
(0xe044d0),HTML::Element=HASH(0xe04548),HTML::Element=HASH
(0xe045a8),HTML::Element=HASH(0xe04608),HTML::Element=HASH
(0xe04668),HTML::Element=HASH(0xe046c8)
HTML::Element=HASH(0xe04758),HTML::Element=HASH
(0xe047d0),HTML::Element=HASH(0xe0480c),HTML::Element=HASH
(0xe04860),HTML::Element=HASH(0xe048d8),HTML::Element=HASH
(0xe04938),HTML::Element=HASH(0xe04998),HTML::Element=HASH
(0xe049f8),HTML::Element=HASH(0xe04a58)
HTML::Element=HASH(0xe04ae8),HTML::Element=HASH
(0xe04b60),HTML::Element=HASH(0xe04b9c),HTML::Element=HASH
(0xe04c44),HTML::Element=HASH(0xe04cbc),HTML::Element=HASH
(0xe04d1c),HTML::Element=HASH(0xe04d7c),HTML::Element=HASH
(0xe04ddc),HTML::Element=HASH(0xe04e3c)
HTML::Element=HASH(0xe04ecc),HTML::Element=HASH
(0xe04f44),HTML::Element=HASH(0xe04f80),HTML::Element=HASH
(0xe04fd4),HTML::Element=HASH(0xe089d8),HTML::Element=HASH
(0xe08a38),HTML::Element=HASH(0xe08a98),HTML::Element=HASH
(0xe08af8),HTML::Element=HASH(0xe08b58)

-------------------------
Source Code:

use strict;
use LWP::Simple;
use HTML::Tree;
use warnings;

my $file = "economic_calendar.dat";
unlink($file);

open FILE, ">$file" or die $!;

my $url = 'http://biz.yahoo.com/c/e.html';
my $content = get($url);

my $tree = HTML::TreeBuilder->new_from_content($content);

my @tr = $tree->look_down('_tag' => 'tr',
sub { $_[0]->as_text !~ m/My Yahoo!/ &&
$_[0]->as_text !~ m/Welcome/i &&
$_[0]->as_text !~ m/Economic Calendar/i
&&
$_[0]->as_text !~ m/Last Week/i });

foreach my $tr (@tr)
{
if ($tr)
{
my @detail = $tr->look_down('_tag' => 'td');
print join(',',@detail)."\n";
}
else
{
warn "No detail data";
}
}

close FILE;

$tree->delete;
 
N

Ninja Li

If the date that you want to scrape is in an HTML table,
then TableExtract is likely to make for prettier and more
robust code:

----------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;

my $html = get 'http://biz.yahoo.com/c/e.html';

my @headers = (
    'Date',
    "Time",
    'Statistic',
    'For',
    'Actual',
    'Briefing Forecast',
    'Market Expects',
    'Prior',
    "Revised\nFrom",
);

my $te = HTML::TableExtract->new( headers => \@headers );
$te->parse($html);

foreach my $ts ( $te->tables ) {
    foreach my $row ($ts->rows) {
        my $csv = join ',', @$row;
        print "$csv\n";
    }}

Tad,

Thanks for your response. HTML::TableExtract looks to be a better
option for dealing with HTML. I tried to apply your code to another
web link, however I didn't get any output. Please advise what might be
wrong. The code is at the end of the post.

Thanks.

Nick

------------------------

use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;

my $html = 'http://www.earnings.com/conferencecall.asp?client=cb';

my @headers =
(
'SYMBOL',
'COMPANY',
'EVENT TITLE',
'WEBCAST',
'TRANSCRIPT',
'TIME'
);


my $te = HTML::TableExtract->new( headers => \@headers );
$te->parse($html);

foreach my $ts ( $te->tables )
{
foreach my $row ($ts->rows)
{
my $csv = join ',', @$row;
print "$csv\n";
}
}
 
N

Ninja Li

Tad,

Maybe I am too slow on this, print $html only shows the web link,
right? I can confirm it does contain valid data. Could you give more
details?

Thanks.

Nick
 
N

Ninja Li

And what _should_ it show?

It should contain HTML, not a URL.

Why does it contain the URL instead of HTML that was fetched
from that URL?




Did you see the part of my code that I underlined in my followup?

Does your code have the part that was underlined?

You will smack your forehead when you see the mistake you've made...

:)

Thanks a million. A smack on the head is required and deserved for me.

Nick
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top