htmltokenizer bug?

Horacio Sanson · Nov 28, 2005

I am using htmltokenizer to extract the links of some web pages, my script
worked perfectly until I started to parse pages with "<" and ">" chars in the
text.

a html string like this

<a href="an_uri" > this is a <link> </a>

causes the htmlparser to raise and exception; Error, tag is nil....

Is there a patch or any way to make htmlparser to parse this text??

regards,
Horacio

Dick Davies · Nov 28, 2005

I am using htmltokenizer to extract the links of some web pages, my scrip= t
worked perfectly until I started to parse pages with "<" and ">" chars in= the
text.

a html string like this

<a href=3D"an_uri" > this is a <link> </a>

causes the htmlparser to raise and exception; Error, tag is nil....

Is there a patch or any way to make htmlparser to parse this text??

I think most *browsers* would choke on that

Have you tried using entities instead ?

( < instead of < and > instead of >)

Daniel Schierbeck · Nov 28, 2005

Horacio said:
I am using htmltokenizer to extract the links of some web pages, my script
worked perfectly until I started to parse pages with "<" and ">" chars in the
text.

a html string like this

<a href="an_uri" > this is a <link> </a>

causes the htmlparser to raise and exception; Error, tag is nil....

Is there a patch or any way to make htmlparser to parse this text??

regards,
Horacio

Your HTML isn't valid. Either use the proper entities (< = < and > =
&gt

or make a CDATA section, though the latter isn't really that
well-supported in most browsers.

<a href="an_uri"><![CDATA[this is a <link>]]></a>

Cheers,
Daniel

Horacio Sanson · Nov 28, 2005

Well the problem is that this HTML is not mine, retrieving the pages from t=
he=20
Internet.=20

Guess I will skip this page from my script.

thanks,
Horacio

Monday 28 November 2005 21:52=E3=80=81Daniel Schierbeck =E3=81=95=E3=82=93=
=E3=81=AF=E6=9B=B8=E3=81=8D=E3=81=BE=E3=81=97=E3=81=9F:

Horacio said:
Horacio said:

I am using htmltokenizer to extract the links of some web pages, my
script worked perfectly until I started to parse pages with "<" and ">"
chars in the text.

a html string like this

<a href=3D"an_uri" > this is a <link> </a>

causes the htmlparser to raise and exception; Error, tag is nil....

Is there a patch or any way to make htmlparser to parse this text??

regards,
Horacio

Click to expand...

Your HTML isn't valid. Either use the proper entities (< =3D < and > = =3D
&gt or make a CDATA section, though the latter isn't really that
well-supported in most browsers.

<a href=3D"an_uri"><![CDATA[this is a <link>]]></a>

Cheers,
Daniel

Daniel Amelang · Dec 2, 2005

Sorry for the late reply.

I'm surprised no one mentioned RubyfulSoup:

http://www.crummy.com/software/RubyfulSoup/

If I understand your problem correctly, it's exactly what you need: a
forgiving html parser.

Dan

Horacio said:
Well the problem is that this HTML is not mine, retrieving the pages from the
Internet.

Guess I will skip this page from my script.

thanks,
Horacio

Monday 28 November 2005 21:52$B!"(BDaniel Schierbeck $B$5$s$O=q$-$^$7$?(B:

Horacio said:

I am using htmltokenizer to extract the links of some web pages, my
script worked perfectly until I started to parse pages with "<" and ">"
chars in the text.

a html string like this

<a href="an_uri" > this is a <link> </a>

causes the htmlparser to raise and exception; Error, tag is nil....

Is there a patch or any way to make htmlparser to parse this text??

regards,
Horacio

Click to expand...

Your HTML isn't valid. Either use the proper entities (< = < and > =
&gt or make a CDATA section, though the latter isn't really that
well-supported in most browsers.

<a href="an_uri"><![CDATA[this is a <link>]]></a>

Cheers,
Daniel

Click to expand...

James Britt · Dec 2, 2005

Daniel said:
Sorry for the late reply.

I'm surprised no one mentioned RubyfulSoup:

http://www.crummy.com/software/RubyfulSoup/

If I understand your problem correctly, it's exactly what you need: a
forgiving html parser.

I recently tried using RubyfulSoup to parse a Web page, and it had some
peculiar behavior, such as stripping all attributes. Either I was not
using it correctly, or it was a bit too casual in making sense of the input.

I ended up using some crude string parsing to extract just the subset of
the page I wanted, which gave me well-formed XML suitable for REXML
manipulation. I got a phenomenal speed increase from that as well;
RubyfulSoup seems quite slow.

James
--

http://www.ruby-doc.org - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

An unknown bug doesn't allow the quotes app to work. What's the issue?	3	Apr 23, 2023
Possible PHP/WP problem with code, trouble accessing custom archive links	1	Jan 5, 2023
HTMLParser and non-ascii html pages	0	Sep 20, 2011
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
Help with Visual Lightbox: Scripts	2	May 3, 2023
Iframe link overlapping text	4	Jan 18, 2021
Only one table shows up with the information	2	Mar 29, 2023

htmltokenizer bug?

Horacio Sanson

Dick Davies

Daniel Schierbeck

Horacio Sanson

Daniel Amelang

James Britt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads