changing URLs in webpages, python solutions?

S

Simon Forman

Hey folks,

I'm having trouble finding this through google so I figured I'd ask
here.

I want to take a webpage, find all URLs (links, img src, etc.) and
rewrite them in-place, and I'd like to do it in python (pure python
preferred.)

I know I could probably roll my own halfway decent solution in a day
or so, like with Beautiful Soup or something, but I'd much rather re-
use someone else's work if I can. I'm pretty sure this falls into the
category of "someone else has had, and solved, this problem before."

Any recommendations?

Thanks in advance,
~Simon
 
A

Abandoned

Hey folks,

I'm having trouble finding this through google so I figured I'd ask
here.

I want to take a webpage, find all URLs (links, img src, etc.) and
rewrite them in-place, and I'd like to do it in python (pure python
preferred.)

I know I could probably roll my own halfway decent solution in a day
or so, like with Beautiful Soup or something, but I'd much rather re-
use someone else's work if I can.  I'm pretty sure this falls into the
category of "someone else has had, and solved, this problem before."

Any recommendations?

Thanks in advance,
~Simon

You can use code search engines.
I founded there but now i don't have.
 
S

Stefan Behnel

Simon said:
I want to take a webpage, find all URLs (links, img src, etc.) and
rewrite them in-place, and I'd like to do it in python (pure python
preferred.)

lxml.html has functions specifically for this problem.

http://codespeak.net/lxml/lxmlhtml.html#working-with-links

Code would be something like

html_doc = lxml.html.parse(b"http://.../xyz.html")
html_doc.rewrite_links( ... )
print( lxml.html.tostring(html_doc) )

It also handles links in CSS or JavaScript, as well as broken HTML documents.

Stefan
 
S

Simon Forman

lxml.html has functions specifically for this problem.

http://codespeak.net/lxml/lxmlhtml.html#working-with-links

Code would be something like

        html_doc = lxml.html.parse(b"http://.../xyz.html")
        html_doc.rewrite_links( ... )
        print( lxml.html.tostring(html_doc) )

It also handles links in CSS or JavaScript, as well as broken HTML documents.

Stefan

Thank you so much! This is exactly what I needed.

(for what it's worth, parse() seems to return a "plain" ElementTree
object but document_fromstring() gave me a special html object with
the rewrite_links() method. I think, but I didn't try it, that

html_doc = lxml.html.parse(b"http://.../xyz.html")
lxml.html.rewrite_links(html_doc, ... )

would work too.)

Thanks again!
Regards,
~Simon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,299
Messages
2,571,545
Members
48,299
Latest member
Ruby87897

Latest Threads

Top