The definitive statement on parsing HTML with regular expressions

Tim McDaniel · Jan 29, 2013

I'd have to say that at
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
the first answer is definitive. I know that The Pony is real, for I
have fed carrots to His Effulgent Face. And I don't even know what
"Effulgent" means, except that it means His Face.

Actually, I just saw it on the Cheezburger Network and thought it was
funny.

And yes, if you *know* that your HTML is simple and limited (for
example, generated by a known program), you may be able to parse those
particular files with regexps.

Tim McDaniel · Jan 30, 2013

Quoth (e-mail address removed):

That was a posted a *long* time ago...

March 2012 now counts as "a *long* time ago" in Interweb Time.
In any event, I wrote,

(A BtVS reference?)

If so, only by accident. Looking up "effulgent", I should have
written "Darkly Effulgent" for better effect.

It is, in fact, possible to parse HTML correctly with Perl regexen ....
Below is a pattern which matches valid XML

Um, your goalposts seem to be moving.

However, it's currently rather difficult to modify it to do
anything *useful* with the result, most importantly because of the
limitations on both (?(DEFINE)) and (?{}).

Bit of a drawback, eh wot? as few people want to merely recognize XML.

In any event, I think it's difficult to parse HTML or XML *correctly*
with *any* technology, due to corner cases and features. In general,
a better answer is usually to use an existing module.

Rainer Weikusat · Jan 30, 2013

(e-mail address removed) (Tim McDaniel) writes:

[...]

Bit of a drawback, eh wot? as few people want to merely recognize XML.

In any event, I think it's difficult to parse HTML or XML *correctly*
with *any* technology, due to corner cases and features. In general,
a better answer is usually to use an existing module.

The conclusion "it is difficult" => "everybody else must have solved
it correctly already" seems a little flimsy to me ...

Charlton Wilbur · Jan 30, 2013

RW> (e-mail address removed) (Tim McDaniel) writes: [...]

RW> The conclusion "it is difficult" => "everybody else must have
RW> solved it correctly already" seems a little flimsy to me ...

How the hell do you make that leap?

It is difficult, so it is better to use a mature code package that many
people have used (and thus tested) than it is to roll your own.

Charlton

brian d foy · Jan 31, 2013

Tim McDaniel said:
I'd have to say that at

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-
self-contained-tags
the first answer is definitive.

It's certainly funny, and was dogma until tchrist actually solved it
with a recursive regex in a different Stackoverflow answer:

http://stackoverflow.com/questions/4231382/regular-expression-pattern-no
t-matching-anywhere-in-string/4234491#4234491

Charlton Wilbur · Jan 31, 2013

bdf> It's certainly funny, and was dogma until tchrist actually
bdf> solved it with a recursive regex in a different Stackoverflow
bdf> answer:

bdf> http://stackoverflow.com/questions/4231382/regular-expression-
bdf> pattern-not-matching-anywhere-in-string/4234491#4234491

To be honest, before tchrist's answer it was dogma that was known to be
false by those of us who either understand the theory of computation
(since Perl's regular expressions stopped being strictly regular some
time ago) or who had to update or maintain a dog's breakfast of HTML
"parsing" using regular expressions.

tchrist does continue to say that even though you CAN parse HTML with
Perl regular expressions, you probably SHOULDN'T, because the larger and
more sophisticated the problem, the better it is to use a real parser.
Which is wisdom, and I am not just saying that because I have been
saying it for 10+ years at this point.

Charlton

Parsing with complex regular expressions	1	Apr 24, 2007
HTML Parsing and Indexing	5	Nov 13, 2006
Help with regular expressions	3	Aug 26, 2003
Lalr(n) parsing with reg	1	Apr 25, 2005
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
With this artifact, everyone can easily invent new languages	5	Jan 11, 2014
HTML Correctness and Validators	7	Dec 29, 2008
Writing HTML parser wasn't as hard as I thought it'd be	19	Apr 20, 2007

The definitive statement on parsing HTML with regular expressions

Tim McDaniel

Tim McDaniel

Rainer Weikusat

Charlton Wilbur

brian d foy

Charlton Wilbur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads