FAQ 6.4 How do I match XML, HTML, or other nasty, ugly things with a regex?

P

PerlFAQ Server

This is an excerpt from the latest version perlfaq6.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .

--------------------------------------------------------------------

6.4: How do I match XML, HTML, or other nasty, ugly things with a regex?

(contributed by brian d foy)

If you just want to get work done, use a module and forget about the
regular expressions. The "XML::parser" and "HTML::parser" modules are
good starts, although each namespace has other parsing modules
specialized for certain tasks and different ways of doing it. Start at
CPAN Search ( http://search.cpan.org ) and wonder at all the work people
have done for you already! :)

The problem with things such as XML is that they have balanced text
containing multiple levels of balanced text, but sometimes it isn't
balanced text, as in an empty tag ("<br/>", for instance). Even then,
things can occur out-of-order. Just when you think you've got a pattern
that matches your input, someone throws you a curveball.

If you'd like to do it the hard way, scratching and clawing your way
toward a right answer but constantly being disappointed, besieged by bug
reports, and weary from the inordinate amount of time you have to spend
reinventing a triangular wheel, then there are several things you can
try before you give up in frustration:

* Solve the balanced text problem from another question in perlfaq6

* Try the recursive regex features in Perl 5.10 and later. See perlre

* Try defining a grammar using Perl 5.10's "(?DEFINE)" feature.

* Break the problem down into sub-problems instead of trying to use a
single regex

* Convince everyone not to use XML or HTML in the first place

Good luck!



--------------------------------------------------------------------

The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in
perlfaq.pod.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,692
Latest member
JenniferTi

Latest Threads

Top