cut strings and parse for images

Andreas Volz · Dec 6, 2004

Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url

Perhaps you've some tips how to solve this problems?

regards
Andreas

Paul McGuire · Dec 6, 2004

Andreas Volz said:
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

Check out the urlparse module (in std distribution). For images, you can
provide a default addressing scheme, so you can expand "images/marine.jpg"
relative to the current location.

-- Paul

Steve Holden · Dec 6, 2004

Andreas said:
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url

Perhaps you've some tips how to solve this problems?

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):

def reset(self):
SGMLParser.reset(self)
self.urls = []
self.images = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

def do_img(self, attrs):
"We assume each image *has* a src attribute."
for k, v in attrs:
if k == 'src':
self.images.append(v)
break

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
print "URLs:"
for url in parser.urls:
print url
print "IMGs:"
for img in parser.images:
print img

$ python sgml1.py
URLs:
about.php
install.php
features.php
http://www.stratagus.org/
http://www.stratagus.org/
http://www.blizzard.com/
http://sourceforge.net/projects/stargus/
IMGs:
images/stargus_banner.jpg
images/marine.jpg
http://sourceforge.net/sflogo.php?group_id=119561&type=1

regards
Steve

Andreas Volz · Dec 6, 2004

Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:

Check out the urlparse module (in std distribution). For images, you
can provide a default addressing scheme, so you can expand
"images/marine.jpg" relative to the current location.

Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas

Paul McGuire · Dec 7, 2004

Andreas Volz said:
Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:

Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas

No problem. Googling for 'python urlparse' gets us immediately to:
http://www.python.org/doc/current/lib/module-urlparse.html. This online doc
has some examples built into it.

But as a newbie, it would also be good to get comfortable with dir() and
help() and trying simple commands at the >>> Python prompt. If I type the
following at the Python prompt:I get almost the same output straight from the Python source.

dir(urlparse) gives me just a list of the global symbol names from the
module, but sometimes that's enough of a clue without reading the whole doc.

Now is where the intrepid Pythonista-to-be uses the Python interactive
prompt and the tried-and-true Python methodology known as "Just Trying Stuff
Out".
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: 'module' object is not callable
(Damn! forgot to prefix with urlparse.)('', '', 'images/marine.jpeg', '', '', '')

Now you can start to predict what kind of tuples you'll get back from
urlparse, you can visualize how you might merge the data from the img
fragment and the url fragment. Wait, I didn't read all of the doc - let's
try urljoin!Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'urljoin' is not defined
(Damn! forgot to prefix with urlparse AGAIN!)'http://www.example.com/dir/images/marine.jpeg'

Is this in the ballpark of where you are trying to go?

-- Paul
Give a man a fish and you feed him for a day; give a man a fish every day
and you feed him for the rest of his life.

Andreas Volz · Dec 7, 2004

Am Tue, 07 Dec 2004 00:40:02 GMT schrieb Paul McGuire:

Is this in the ballpark of where you are trying to go?

Yes, thanks. You helped me a lot.

Andreas

Python multithreading problem	3	Mar 26, 2006
SGML parsing tags and leeping track	3	May 2, 2006
How to exctract title of links	2	Apr 26, 2005
[perl-python] find & replace strings for all files in a dir	1	Jan 31, 2005
HTMLParser problems.	11	Oct 30, 2003
A Comparison of Python Class Objects and Init Files for Program Configuration	0	Sep 12, 2006
Can't make this page work	6	Mar 8, 2006

cut strings and parse for images

Andreas Volz

Paul McGuire

Steve Holden

Andreas Volz

Paul McGuire

Andreas Volz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads