cut strings and parse for images

A

Andreas Volz

Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url


Perhaps you've some tips how to solve this problems?

regards
Andreas
 
P

Paul McGuire

Andreas Volz said:
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

Check out the urlparse module (in std distribution). For images, you can
provide a default addressing scheme, so you can expand "images/marine.jpg"
relative to the current location.

-- Paul
 
S

Steve Holden

Andreas said:
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url


Perhaps you've some tips how to solve this problems?
from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):

def reset(self):
SGMLParser.reset(self)
self.urls = []
self.images = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

def do_img(self, attrs):
"We assume each image *has* a src attribute."
for k, v in attrs:
if k == 'src':
self.images.append(v)
break


if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
print "URLs:"
for url in parser.urls:
print url
print "IMGs:"
for img in parser.images:
print img

$ python sgml1.py
URLs:
about.php
install.php
features.php
http://www.stratagus.org/
http://www.stratagus.org/
http://www.blizzard.com/
http://sourceforge.net/projects/stargus/
IMGs:
images/stargus_banner.jpg
images/marine.jpg
http://sourceforge.net/sflogo.php?group_id=119561&amp;type=1

regards
Steve
 
A

Andreas Volz

Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:
Check out the urlparse module (in std distribution). For images, you
can provide a default addressing scheme, so you can expand
"images/marine.jpg" relative to the current location.

Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas
 
P

Paul McGuire

Andreas Volz said:
Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:


Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas

No problem. Googling for 'python urlparse' gets us immediately to:
http://www.python.org/doc/current/lib/module-urlparse.html. This online doc
has some examples built into it.

But as a newbie, it would also be good to get comfortable with dir() and
help() and trying simple commands at the >>> Python prompt. If I type the
following at the Python prompt:I get almost the same output straight from the Python source.

dir(urlparse) gives me just a list of the global symbol names from the
module, but sometimes that's enough of a clue without reading the whole doc.

Now is where the intrepid Pythonista-to-be uses the Python interactive
prompt and the tried-and-true Python methodology known as "Just Trying Stuff
Out".
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: 'module' object is not callable
(Damn! forgot to prefix with urlparse.)('', '', 'images/marine.jpeg', '', '', '')

Now you can start to predict what kind of tuples you'll get back from
urlparse, you can visualize how you might merge the data from the img
fragment and the url fragment. Wait, I didn't read all of the doc - let's
try urljoin!Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'urljoin' is not defined
(Damn! forgot to prefix with urlparse AGAIN!)'http://www.example.com/dir/images/marine.jpeg'

Is this in the ballpark of where you are trying to go?

-- Paul
Give a man a fish and you feed him for a day; give a man a fish every day
and you feed him for the rest of his life.
 
A

Andreas Volz

Am Tue, 07 Dec 2004 00:40:02 GMT schrieb Paul McGuire:
Is this in the ballpark of where you are trying to go?

Yes, thanks. You helped me a lot.

Andreas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,213
Messages
2,571,107
Members
47,699
Latest member
lovelybaghel

Latest Threads

Top