HTML:Parser how to remove "//<![CDATA[ ... //]]>" ?

G

Gerwin

Hi,

I'm using HTML::parser to strip HTML tags from my files. I noticed
how //<![cdata[ ... //]]> and the javascript between that is not
stripped. Any idea how to do this?

-Gerwin
 
G

Gerwin

Hi,

I'm using HTML::parser to strip HTML tags from my files. I noticed
how //<![cdata[ ... //]]> and the javascript between that is not
stripped. Any idea how to do this?

-Gerwin

Well i made a regex to do it:

$content =~ s/(\/\/<!\[.*\/\/]]>)//;

Is this efficient? If not, what is?
 
A

anno4000

Gerwin said:
Hi,

I'm using HTML::parser to strip HTML tags from my files. I noticed
how //<![cdata[ ... //]]> and the javascript between that is not
stripped. Any idea how to do this?

-Gerwin

Well i made a regex to do it:

$content =~ s/(\/\/<!\[.*\/\/]]>)//;

Is this efficient? If not, what is?

Why do you think efficiency matters?

At this point you should be concerned with effectiveness: Does it
match what it is supposed to match, no more and no less? Since I
don't know the variability of the pattern I can't tell. The fact
you are matching only one opening "[" but two closing "]" is a bit
dubious. Shouldn't the string "cdata" be checked somewhere?

Worry about efficiency when your program turns out to be slow.
If that happens, I dare say it won't be this regex that is
responsible.

Anno
 
U

Uri Guttman

s> Use ROBIC0's RxParse, it handles CDATA correctly

you must either be kidding or an alter moron ego of his. his module is
horrible in too many ways to count.

uri
 
D

Dr.Ruud

(e-mail address removed)-berlin.de schreef:
Gerwin:
Gerwin:
I'm using HTML::parser to strip HTML tags from my files. I noticed
how //<![cdata[ ... //]]> and the javascript between that is not
stripped. Any idea how to do this?

Well i made a regex to do it:
$content =~ s/(\/\/<!\[.*\/\/]]>)//;

A different separator makes it more legible:

Why do you think efficiency matters?

Maybe HTML::parser already offers a better way.

Or maybe one doesn't always want to remove it, there can be plain text
inside. Or, but that is more about effectiveness, maybe some CDATA
sections are spread over more than one line.
 
U

Uri Guttman

s> If you are stream processing and as you say spread over more than 1 line,
s> use ROBIC0's approach, buffer untill you have complete comment or cdata:

ok, the first time you backed his code i asked you if it was a
joke. obviously you think it is real code. so i will shred this crap and
hopefully you will see why it is bad code.

s> $RxParseXP1 = qr/(?:<(?:\[CDATA\[(.*?)\]\])|(?:--(.*?[^-])--)>)/s;
s> # ( <( 0 0 )|( 1 1 ) )

why no /x modifier? long and complex regexes always should use /x.

and what if that was inside a comment? or some area that isn't parsed
for html?

s> while (!$done)

loop flags like that are very silly and kiddie code.

s> {
s> $ln_cnt++;

s> # stream processing (if not buffered)
s> if (!$BUFFERED) {

upper case variable names are for constants and such. notice he likes !
all over the place. better to use unless and until.

s> if (!($_ = <$markup_file>)) {

that is just bad. use a named lexical instead of $_. what if the last
line of the file was just '0' with no newline? that fails.

s> # just parse what we have
s> $done = 1;

just bad. really bad to use flags like that. you don't even know what
more code might execute here because the ridiculously long if/else blocks.

s> # boundry check for runnaway
s> if (($complete_comment+$complete_cdata) > 0) {

huh??? is he testing 2 flags for either one being set? has the idiot
ever heard of a boolean or??

s> $ln_cnt--;
we lose lines too??

s> }
s> } else {
s> $$ref_parse_ln .= $_;

s> ## buffer if needing comment/cdata closure
s> next if ($complete_comment && !/-->/);
s> next if ($complete_cdata && !/\]\]>/);

but what about the done flag?? don't mix loop flags with flow control
ops. just dumb. pick one style.

s> ## reset comment/cdata flags
s> $complete_comment = 0;
s> $complete_cdata = 0;

s> ## flag serialized comments/cdata buffering
s> if (/(<!--)|(<!\[CDATA\[)/)

more complex regexes that need /x

s> {
s> if (defined $1) { # complete comment
s> if ($$ref_parse_ln !~ /<!--.*?-->/s) {
s> $complete_comment = 1;
s> next;

the nesting here is getting very deep. a sign of someone who loves
if/else too much. a cleaner design would be simpler, easier to read and
understand.

s> }
s> }
s> elsif (defined $2) { # complete cdata

why check those two separately but match in one regex? just test one
pattern and handle it or test the other. this is bullshit code. way
longer and more complex than it needs to be.

s> if ($$ref_parse_ln !~
s> /<!\[CDATA\[.*?\]\]>/s)
s> {

he uses ! all over and here he switches to !~ which is rarely used.

s> $complete_cdata = 1;
s> next;
s> }
s> }
s> }
s> ## buffer until '>' or eof
s> next if (!/>/);
s> }

can you even tell what this is the else for? it scrolled off my screen.

s> } else {
s> $ln_cnt = 1;
s> $done = 1;

now we're done? do we exit the loop here? who knows? i have to scroll up
and find the if and see it that falls through or what??

s> }

s> When you have it buffered from above (or already buffered), parse it:

s> ## REGEX Parsing loop
s> while ($$ref_parse_ln =~ /$RxParseXP1/g)
s> {
s> ## CDATA
s> if (defined $0) {

oh that is wonderful. a true bug by the way. $0 is the name of the
program and not a grabbed match. not that he does any work in this if

s> }
s> ## COMMENT
s> elsif (defined $1) {

again, an empty else clause. wo what was the purpose of this if/else?
showing off more bad code is my guess.

s> }
s> }

s> }

s> Either way, whatever comes first, comment or CDATA, that takes
s> precedence till its closure.

either way it is horrible code by someone who doesn't know perl or
coding and has proven himself to be very psychotic. you trust this and
you will trust anything. so please stop defending this crap. i didn't
even analyze the actual logic as it is not clear from the poor design.

my guess is that you are his syncophantic alter ego.

uri
 
U

Uri Guttman

s> If you are stream processing and as you say spread over more than 1 line,
s> use ROBIC0's approach, buffer untill you have complete comment or cdata:
s> Not sure its relavent here

then you don't know what /x is for.

s> perhaps a goto is something more to your liking?

no. better code is to my liking. i can't recall needing a loop flag in
perl. that is more like basic. with all the long/deep nested if/elses
you can't track the logic at all.

s> # stream processing (if not buffered)
s> if (!$BUFFERED) {
s> From looking at code BUFFERED is a very important

important? one boolean test does not make something important.

s> I don't think this is true

then you don't know perl. sorry.
s> # just parse what we have
s> $done = 1;
s> huh?

forget it. if you think loop flags are good (especially when done like
this) then please don't code in perl.
s> # boundry check for runnaway
s> if (($complete_comment+$complete_cdata) > 0) {
s> Checking this, the addition and comparison here is half the
s> assembly instruction cycles than to do a double comarison with two
s> jmp's. Could be a performance thing

assembly? what does that have to do with perl? perl is compiled to an
internal form which is interpreted. actually that code is slower in perl
than a boolean test for several reasons. but you won't understand them
so i won't cover it.

s> what is this variable? have you checked?

looks line line count abbreviated. if it isn't, it is named poorly. if
it is, decrementing a line count makes no sense. but you may understand
it. i won't delve into the logic as i said the code is too bad to
bother.

s> It looks like he uses the fall-through method, avoiding the extra
s> machine cycles. I think the $done flag is set elsewhere as well

you again don't understand perl and machine cycles. perl HAS NO MACHINE
CYCLES. it has operations in an op loop. you don't optimize perl that
way. and this code could be optimized, simplified and made much better
with a decent design without the stupid loop flag and all those
if/elses.

would you believe i have a 10k line perl system with about 25 else
clauses in total? and it is very clean and readable code throughout. not
hard to do at all if you know perl and coding. eschew else is my new motto.

s> ## flag serialized comments/cdata buffering
s> if (/(<!--)|(<!\[CDATA\[)/)
s> not sure its relavent for this, could be right, though with no effect

effect? what are you babbling about? /x HAS NO EFFECT on regexes. it is
not meant to have any effect (actually it does on the syntax but not
worth covering here). it is meant for CLARITY. but the author knows not
of that.

s> The if/then/else construct can't be avoided. The machining can be
s> trimmed. He did a good job thinning with single if's comparisons
s> to make a path back. I don't see any faster constructs than what
s> he's got

it can easily be simplfied. just a poor design requires all that
if/else stuff. read what i said above with eschew else. this code is
done in basic style.
s> }
s> }
s> elsif (defined $2) { # complete cdata
s> Yea its either one or the other. Did he expect more?

huh?? i was suggesting how to clean up that mess. match and test one,
then match and test the other. the way he did it is longer, slower,
clunkier.
s> if ($$ref_parse_ln !~

s> Looks like he positively wanted to know if that was case. Looks alright

ok, you have blinders on. i give up.

s> what "else" are you looking at. Apparently, />/ is like a period at
s> the end of a sentence. Otherwise better to not do anything at the
s> termination of this block or will be trouble, you are in the middle
s> of a sentence

you don't get it.

s> } else {
s> $ln_cnt = 1;
s> $done = 1;
s> To me it looks like if its streamed (not BUFFERED) he is waiting
s> for a complete sentence with the />/ to pass through to the formal
s> parser below this. If it is BUFFERED, means the complete file is
s> passed to the formal parser. Streaming, it looks like he waits for
s> a complete sentence, parses, goes back up top, gets another, etc...

insane. you can merge both into one flow with no troubles at all. done
all the time in parsers. you make the input an iterator or sub that
works on the stream or the text. reduces his code by half as there is no
need for such a long if/else block.

s> When you have it buffered from above (or already buffered), parse it:s> ## REGEX Parsing loop
s> while ($$ref_parse_ln =~ /$RxParseXP1/g)
s> {
s> ## CDATA
s> if (defined $0) {
s> My mistake, I transposed his variabes, should have been $1, $2

huh? what transpose? this is your code or his?s> }
s> ## COMMENT
s> elsif (defined $1) {
s> I just meant to condense his code for an example. The unfilled block
s> is left as an exercise

i prefer to read real code, not empty clauses.


s> He's probably psycotic or genious, I don't know of him. The code
s> looks good though. Its possible that the "order" of the parse
s> regex is important. I think it is. Here:

i will bet the house on psychotic. look at his posting history
here. rants and drools and flames all over the place. no one here
respects his code at all.

$done = 1 ;

uri
 
R

Randal L. Schwartz

Sherm> If Robic0's code looks good to you, then you need to learn more
Sherm> Perl. Heck, you need to learn more about programming in *any*
Sherm> language. I don't mean that as a flame, just a simple statement of
Sherm> fact; the fact that you don't under- stand the many flaws in his code
Sherm> says more about your own lack of experience than it does about his
Sherm> code.

Or that we're dealing with a sockpuppet.
 
U

Uri Guttman

s> First off, I'm no Perl expert so I didn't judge his code in that.
s> I have used it with sucess at work and its not bad for me.
s> I had to add to his handler setup and a couple of other things.
s> Overall though the core parse doesen't show conceptual errors,
s> but is'nt compatible sometimes for old html.

s> You mention dislike for "else" clause, especial when nesting and
s> only have 25 of in 10k code lines. If one can avoid an else with a
s> return in the conditional block (or continue) it should be done, no
s> question. But you may want to re-think if it is believed that
s> nested code is being touched every time. No matter how constructs
s> tabularized through indirection (compiled) the end result is
s> machine jump code being executed. Wether its relative (conditional)
s> or absolute jump. Your code paths are absent of jumps? Conditional
s> jumps are the fastest instruction per effectiveness a cpu can do.


you are very lost here. machine code and perl execution of conditionals
have ABSOLUTELY nothing in common. the layers separating perl from
machine code are deep and nasty. you obviously don't know about how
interpreters are written. please stop this illogical line you seem to
think matters. the key bottleneck in perl is the op code dispatch loop
and not any particular machine instruction. you don't see to realize how
loops/branches are done in perl and that they are about the same speed
as most builtin simple ops. the machine jumps are not even on the radar
at that level. please study some interpreter designs and learn about
them. this machine language babble of your is so off the mark it is not
funny.


s> I don't know of Perl "op" loops you talk about. I would have to
s> imagine them being the translation table into an execution
s> processor (interpreter) creating code pages for the processor. That
s> again produces a binary based on the "op" codes of the underlying
s> processor, when executed. To me, an "op code", from back in my
s> dissasembly days, is the character translation of the assembly
s> major "operation" into its binary code from the table.

no no no no. since you don't know about perl's guts why do you insist on
talking about them at a machine level? the machine is several levels
down from perl's source and you could never tell what machine code is
being execute for any perl op. and as i keep telling you perl ops are
way larger in cpu usage than ant single machine instruction. your
assembler background is useless in understanding perl optimization.

s> Perl provides a C like language construct. I hope its a close
s> proximity in its constructs in performance, otherwise it should not
s> be.

it has no closeness to the metal at all. you don't understand
interpreter design at all. except for a few special cases which do some
JIT code generation, none generate any machine code directly. and even
those that do are not at the same level as hand written c.

s> Finally, Perl can't do pointers that I know of. Pointers offer more
s> granularity in code design when it comes to parsers, for instance, xml
s> is highly controlled by escape codes, albeit ascii ones. If one could
s> write code on that level, jump if >,!=,<,== based on the result in the
s> accumulator, the its the fastest possible. But a string comparison is
s> really out of the question.

again, you don't know what you are talking about. perl has references
and can do most anything with them that c could except for pointer math
and that perl's references are safer and can't cause core dumps. i will
take refs over pointers any day.

s> Appaently in C, you can parse xml/html, using a more granular approach
s> with pointers and bit-mapped state variables, as it pertains to control
s> characters. This however is not possible in Perl. As well I don't see
s> pointer arithmatic being available.

huh??? you are making no sense. there are many ways to parse
anything. the issue is that robic's parser is BAD and BUGGY perl. it may
work for your special cases but in the general sense it is broken in
many ways. it could be optimized in many ways, cleaned up, whatever but
it is crappy code. you are probably the only user of it besides its
psychotic author. just wait until you try to communicate with him for a
bug fix or improvement

s> To me this guy looks like did ok. You could probably make his code
s> more efficient. Sounds like alot of folks here don't like him much

your approval means little. you have not shown any understanding of good
perl code, perl optimization, interpreter design, parser design,
etc. this means you are not experienced enough to properly give a
professional opinion on that module. as for people not liking him,
please google for his past posting and you will see why. as for his
code, i am not the only one who thinks it is crap. you are the only one
who likes it. think about that.

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,202
Messages
2,571,057
Members
47,661
Latest member
sxarexu

Latest Threads

Top