changing URLs in webpages, python solutions?

Simon Forman · Jan 18, 2009

Hey folks,

I'm having trouble finding this through google so I figured I'd ask
here.

I want to take a webpage, find all URLs (links, img src, etc.) and
rewrite them in-place, and I'd like to do it in python (pure python
preferred.)

I know I could probably roll my own halfway decent solution in a day
or so, like with Beautiful Soup or something, but I'd much rather re-
use someone else's work if I can. I'm pretty sure this falls into the
category of "someone else has had, and solved, this problem before."

Any recommendations?

Thanks in advance,
~Simon

Abandoned · Jan 18, 2009

Hey folks,

I'm having trouble finding this through google so I figured I'd ask
here.

I want to take a webpage, find all URLs (links, img src, etc.) and
rewrite them in-place, and I'd like to do it in python (pure python
preferred.)

I know I could probably roll my own halfway decent solution in a day
or so, like with Beautiful Soup or something, but I'd much rather re-
use someone else's work if I can. I'm pretty sure this falls into the
category of "someone else has had, and solved, this problem before."

Any recommendations?

Thanks in advance,
~Simon

You can use code search engines.
I founded there but now i don't have.

Stefan Behnel · Jan 18, 2009

Simon said:
I want to take a webpage, find all URLs (links, img src, etc.) and
rewrite them in-place, and I'd like to do it in python (pure python
preferred.)

lxml.html has functions specifically for this problem.

http://codespeak.net/lxml/lxmlhtml.html#working-with-links

Code would be something like

html_doc = lxml.html.parse(b"http://.../xyz.html")
html_doc.rewrite_links( ... )
print( lxml.html.tostring(html_doc) )

It also handles links in CSS or JavaScript, as well as broken HTML documents.

Stefan

Simon Forman · Jan 19, 2009

lxml.html has functions specifically for this problem.

http://codespeak.net/lxml/lxmlhtml.html#working-with-links

Code would be something like

html_doc = lxml.html.parse(b"http://.../xyz.html")
html_doc.rewrite_links( ... )
print( lxml.html.tostring(html_doc) )

It also handles links in CSS or JavaScript, as well as broken HTML documents.

Stefan

Thank you so much! This is exactly what I needed.

(for what it's worth, parse() seems to return a "plain" ElementTree
object but document_fromstring() gave me a special html object with
the rewrite_links() method. I think, but I didn't try it, that

html_doc = lxml.html.parse(b"http://.../xyz.html")
lxml.html.rewrite_links(html_doc, ... )

would work too.)

Thanks again!
Regards,
~Simon

(For gurus) Embed functions in webpages like in PHP?	11	Apr 16, 2004
Parse cisco's "show ip route" output in Python 2.7	0	Mar 5, 2012
Long rant about Python in Education	2	Aug 12, 2010
help: SIGABRT intermittent crash for threaded website crawleron python 2.4.4c1	0	Oct 3, 2008
Dealing with application names in a JEE web app	18	May 23, 2011
Embedding Python in Windows tutorial?	2	Oct 19, 2004
Old Paranoia Game in Python	15	Jan 9, 2005
PyWart: PEP8: a seething cauldron of inconsistencies.	1	Jul 28, 2011

changing URLs in webpages, python solutions?

Simon Forman

Abandoned

Stefan Behnel

Simon Forman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads