Parsing HTML, extracting text and changing attributes.

sebzzz · Jun 18, 2007

Hi,

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

Rob Wolfe · Jun 18, 2007

[email protected] said:
So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

Take a look at parsing example on this page:
http://wiki.python.org/moin/SimplePrograms

Stefan Behnel · Jun 18, 2007

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

lxml is what you're looking for, especially if you're familiar with XPath.

http://codespeak.net/lxml/dev

Stefan

Neil Cerutti · Jun 18, 2007

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I
should use to do this and what technique I should use.

You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.

Jay Loden · Jun 18, 2007

Neil said:
You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.

OP was looking for a way to parse out part of the file and apply classes to certain types of tags. Using lynx/links wouldn't help, since the output of links or lynx is going to end up as plain text and the desire isn't to strip all the formatting.

Someone else mentioned lxml but as I understand it lxml will only work if it's valid XHTML that they're working with. Assuming it's not (since real-world HTML almost never is), perhaps BeautifulSoup will fare better.

http://www.crummy.com/software/BeautifulSoup/documentation.html

-Jay

Stefan Behnel · Jun 18, 2007

Jay said:
Someone else mentioned lxml but as I understand it lxml will only work if
it's valid XHTML that they're working with.

No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

Stefan

Jay Loden · Jun 18, 2007

Stefan said:
No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

I stand corrected, I missed that whole part of the LXML documentation

Jay Loden · Jun 18, 2007

Stefan said:
No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

I stand corrected, I missed that whole part of the LXML documentation

sebzzz · Jun 18, 2007

I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...

Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work?

Stefan Behnel · Jun 18, 2007

I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...

Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work?

Well, as I said, use lxml. It's fast, pythonically easy to use, extremely
powerful and extensible. Apart from being the main author

, I actually use
it for lots of tiny things more or less like what you're off to. It's just
plain great for a quick script that gets you from A to B for a bag of documents.

Parse it in with HTML parser (even from URLs), then use XPath to extract
(exactly) what you want and then work on it as you wish. That's short and
simple in lxml.

http://codespeak.net/lxml/dev/tutorial.html
http://codespeak.net/lxml/dev/parsing.html#parsing-html
http://codespeak.net/lxml/dev/xpathxslt.html#xpath

Stefan

Changing .html in URL	3	Jul 11, 2022
"input-group-text" help	7	Aug 10, 2023
Stuck with html and css	25	Dec 14, 2022
Data saving in condition of changing reality	0	Apr 29, 2022
HTML Site Problems	11	Nov 25, 2019
Measuring a string of text	1	Sep 15, 2022
HTML Anchor tag not working	2	Dec 15, 2020
HTML Parsing and Indexing	5	Nov 13, 2006

Parsing HTML, extracting text and changing attributes.

sebzzz

Rob Wolfe

Stefan Behnel

Neil Cerutti

Jay Loden

Stefan Behnel

Jay Loden

Jay Loden

sebzzz

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads