Stupid Q: How to preserve numeric characters

goodarm · Aug 29, 2006

Gurus,

I am relatively new to Perl so please bear with me. I am trying to
write a simple scrapper for a non-English Web pages. For that purpose I
am using HTML::TokeParser. Now, I am looking to extract some content I
need and generate another HTML page (whch potentially will have notes
in multiple languages). The pages I am scrapping are written using
numeric characters, e.g. оду, when I am extracting
them, then injecting into my HTML page they get converted into
charaecters. All I want - is to preserve the original numeric
characters, as it seems to be the easiest way to build my result page.
How do I do that?

A sample code:

sub parseResponce($$) {
my $data = shift;
my $stream = new HTML::TokeParser($data);

while (my $tag = $stream->get_tag("p")) {
if (...) {
$buff = $stream->get_trimmed_text("/p");
}
}

Thanks in advance, Victor

himanshu.garg · Aug 29, 2006

Gurus,

I am relatively new to Perl so please bear with me. I am trying to
write a simple scrapper for a non-English Web pages. For that purpose I
am using HTML::TokeParser. Now, I am looking to extract some content I
need and generate another HTML page (whch potentially will have notes
in multiple languages). The pages I am scrapping are written using
numeric characters, e.g. оду, when I am extracting
them, then injecting into my HTML page they get converted into
charaecters. All I want - is to preserve the original numeric
characters, as it seems to be the easiest way to build my result page.
How do I do that?

A sample code:

sub parseResponce($$) {
my $data = shift;
my $stream = new HTML::TokeParser($data);

while (my $tag = $stream->get_tag("p")) {
if (...) {
$buff = $stream->get_trimmed_text("/p");
}
}

You could try the method from its parent class

$stream->attr_encoded( 1 );

before calling get_tag.

See Also:-

http://search.cpan.org/~gaas/HTML-Parser-3.55/Parser.pm

Thanks in advance, Victor

Thank You,
++imanshu

goodarm · Aug 30, 2006

Thanks a lot for your reply,

....unfortunately, doesn't seem to work. From the documentation you
refered to "By default, the attr and @attr argspecs will have general
entities for attribute values decoded. Enabling this attribute leaves
entities alone." - so I guess this applies to the attribute value,
while I am trying to scrap the text of the node.

In any case, I did as you suggested and got the same results.

It's drving me crazy, there has to be a very simple way to do that...

Victor

goodarm · Aug 30, 2006

....in the meantime I found a workaround:

instead of
$buff = $stream->get_trimmed_text("/p");
do
$buff = HTML::Entities::encode_entities_numeric(
$stream->get_trimmed_text("/p") );

I am still not convinced it's the right way of dealing with the stuff,
but at least it works...

V

himanshu.garg · Aug 30, 2006

Thanks a lot for your reply,

...unfortunately, doesn't seem to work. From the documentation you
refered to "By default, the attr and @attr argspecs will have general
entities for attribute values decoded. Enabling this attribute leaves
entities alone." - so I guess this applies to the attribute value,
while I am trying to scrap the text of the node.

In any case, I did as you suggested and got the same results.

It's drving me crazy, there has to be a very simple way to do that...

Sorry about the wrong update.

HTML:

ullParser doesn't "seem to" have a method for this. However
HTML:

arser has ways of doing it and apparently if you asked it to
send 'text' and not 'dtext' it will not decode the entities for you.

Thank You,
Himanshu.

How to push data from one HTML page to another	4	Jan 3, 2024
How to properly insert a landing page within same container beneath an image element?	1	Oct 7, 2024
How do I follow links stored in an array?	3	Apr 29, 2008
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
How can I add arrows to my FAQ	0	Aug 9, 2023
How to have two html audio players on one page?	0	May 3, 2022
TokeParser	0	Nov 7, 2006
Pass numeric arrays from C extensions to Python	2	Sep 24, 2012

Stupid Q: How to preserve numeric characters

goodarm

himanshu.garg

goodarm

goodarm

himanshu.garg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads