how do you convert an html page into an xml page ?
How long is a piece of string ?
How many pages are you dealing with ? Is this a one-off "I want to
convert my site" or a regular "I want to scrape stock prices from
another site and make them into an XML feed" ?
What's "HTML" ? Is this well-coded valid HTML 3.2 / 4.0, XHTML or
some tag-soup written by a M$oft tool ? What happens if it's not
valid ? Can your code crash, abandon the page, scream for human help,
or must it make a best-attempt ?
Can you avoid this altogether ? Can you obtain the content by some
friendlier means, such as RSS, direct access to the database, or some
other source ?
Why do you want to do it ? There are no "XML pages", there are only
XML documents. If you want to end up with "a web page" at the end of
it, then raw XML isn't enough of a finishing point, you need to take
it further.
What is "XML" ? What DTD or Schema are you aiming at ?
For one-offs, use Dave Raggett's Tidy (easily obtained via HTMLKit).
Even if you're not looking for an XHTML output, Tidy can be an
excellent pre-processor for sorting out ugly Tag Soup.
For screen-scrapes, use your favourite scripting language (Perl is
always a good start, but you could use Python or even JavaScript) and
use someone else's HTML parser.
RSS 1.0 is a good XML Schema to target at for generic screen scraping,
even if you don;t think your content is "relevant" to a newseed (but
RSS 0.92 isn't)