Stupid Q: How to preserve numeric characters

G

goodarm

Gurus,

I am relatively new to Perl so please bear with me. I am trying to
write a simple scrapper for a non-English Web pages. For that purpose I
am using HTML::TokeParser. Now, I am looking to extract some content I
need and generate another HTML page (whch potentially will have notes
in multiple languages). The pages I am scrapping are written using
numeric characters, e.g. оду, when I am extracting
them, then injecting into my HTML page they get converted into
charaecters. All I want - is to preserve the original numeric
characters, as it seems to be the easiest way to build my result page.
How do I do that?

A sample code:

sub parseResponce($$) {
my $data = shift;
my $stream = new HTML::TokeParser($data);

while (my $tag = $stream->get_tag("p")) {
if (...) {
$buff = $stream->get_trimmed_text("/p");
}
}

Thanks in advance, Victor
 
H

himanshu.garg

Gurus,

I am relatively new to Perl so please bear with me. I am trying to
write a simple scrapper for a non-English Web pages. For that purpose I
am using HTML::TokeParser. Now, I am looking to extract some content I
need and generate another HTML page (whch potentially will have notes
in multiple languages). The pages I am scrapping are written using
numeric characters, e.g. оду, when I am extracting
them, then injecting into my HTML page they get converted into
charaecters. All I want - is to preserve the original numeric
characters, as it seems to be the easiest way to build my result page.
How do I do that?

A sample code:

sub parseResponce($$) {
my $data = shift;
my $stream = new HTML::TokeParser($data);

while (my $tag = $stream->get_tag("p")) {
if (...) {
$buff = $stream->get_trimmed_text("/p");
}
}

You could try the method from its parent class

$stream->attr_encoded( 1 );

before calling get_tag.

See Also:-

http://search.cpan.org/~gaas/HTML-Parser-3.55/Parser.pm
Thanks in advance, Victor

Thank You,
++imanshu
 
G

goodarm

Thanks a lot for your reply,

....unfortunately, doesn't seem to work. From the documentation you
refered to "By default, the attr and @attr argspecs will have general
entities for attribute values decoded. Enabling this attribute leaves
entities alone." - so I guess this applies to the attribute value,
while I am trying to scrap the text of the node.

In any case, I did as you suggested and got the same results.

It's drving me crazy, there has to be a very simple way to do that...

Victor
 
G

goodarm

....in the meantime I found a workaround:

instead of
$buff = $stream->get_trimmed_text("/p");
do
$buff = HTML::Entities::encode_entities_numeric(
$stream->get_trimmed_text("/p") );

I am still not convinced it's the right way of dealing with the stuff,
but at least it works...

V
 
H

himanshu.garg

Thanks a lot for your reply,

...unfortunately, doesn't seem to work. From the documentation you
refered to "By default, the attr and @attr argspecs will have general
entities for attribute values decoded. Enabling this attribute leaves
entities alone." - so I guess this applies to the attribute value,
while I am trying to scrap the text of the node.

In any case, I did as you suggested and got the same results.

It's drving me crazy, there has to be a very simple way to do that...

Sorry about the wrong update.

HTML::pullParser doesn't "seem to" have a method for this. However
HTML::parser has ways of doing it and apparently if you asked it to
send 'text' and not 'dtext' it will not decode the entities for you.

Thank You,
Himanshu.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,172
Messages
2,570,934
Members
47,473
Latest member
ChristelPe

Latest Threads

Top