The definitive statement on parsing HTML with regular expressions

T

Tim McDaniel

I'd have to say that at
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
the first answer is definitive. I know that The Pony is real, for I
have fed carrots to His Effulgent Face. And I don't even know what
"Effulgent" means, except that it means His Face.

Actually, I just saw it on the Cheezburger Network and thought it was
funny.

And yes, if you *know* that your HTML is simple and limited (for
example, generated by a known program), you may be able to parse those
particular files with regexps.
 
T

Tim McDaniel

Quoth (e-mail address removed):

That was a posted a *long* time ago...

March 2012 now counts as "a *long* time ago" in Interweb Time.
In any event, I wrote,
(A BtVS reference?)

If so, only by accident. Looking up "effulgent", I should have
written "Darkly Effulgent" for better effect.
It is, in fact, possible to parse HTML correctly with Perl regexen ....
Below is a pattern which matches valid XML

Um, your goalposts seem to be moving.
However, it's currently rather difficult to modify it to do
anything *useful* with the result, most importantly because of the
limitations on both (?(DEFINE)) and (?{}).

Bit of a drawback, eh wot? as few people want to merely recognize XML.

In any event, I think it's difficult to parse HTML or XML *correctly*
with *any* technology, due to corner cases and features. In general,
a better answer is usually to use an existing module.
 
R

Rainer Weikusat

(e-mail address removed) (Tim McDaniel) writes:

[...]
Bit of a drawback, eh wot? as few people want to merely recognize XML.

In any event, I think it's difficult to parse HTML or XML *correctly*
with *any* technology, due to corner cases and features. In general,
a better answer is usually to use an existing module.

The conclusion "it is difficult" => "everybody else must have solved
it correctly already" seems a little flimsy to me ...
 
C

Charlton Wilbur

RW> (e-mail address removed) (Tim McDaniel) writes: [...]

RW> The conclusion "it is difficult" => "everybody else must have
RW> solved it correctly already" seems a little flimsy to me ...

How the hell do you make that leap?

It is difficult, so it is better to use a mature code package that many
people have used (and thus tested) than it is to roll your own.

Charlton
 
C

Charlton Wilbur

bdf> It's certainly funny, and was dogma until tchrist actually
bdf> solved it with a recursive regex in a different Stackoverflow
bdf> answer:

bdf> http://stackoverflow.com/questions/4231382/regular-expression-
bdf> pattern-not-matching-anywhere-in-string/4234491#4234491

To be honest, before tchrist's answer it was dogma that was known to be
false by those of us who either understand the theory of computation
(since Perl's regular expressions stopped being strictly regular some
time ago) or who had to update or maintain a dog's breakfast of HTML
"parsing" using regular expressions.

tchrist does continue to say that even though you CAN parse HTML with
Perl regular expressions, you probably SHOULDN'T, because the larger and
more sophisticated the problem, the better it is to use a real parser.
Which is wisdom, and I am not just saying that because I have been
saying it for 10+ years at this point.

Charlton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top