A
Andreas Volz
Hi,
I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:
http://www.example.com/dir/example.html
Now I like to cut the string, so that only domain and directory is
left over. Expected result:
http://www.example.com/dir/
I know how to do this in bash programming, but not in python. How could
this be done?
The next problem is not only to extract href's, but also images. A href
is easy:
<a href="install.php">Install</a>
But a image is a little harder:
<img class="bild" src="images/marine.jpg">
This is my current example code:
from sgmllib import SGMLParser
leach_url = "http://stargus.sourceforge.net/"
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Perhaps you've some tips how to solve this problems?
regards
Andreas
I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:
http://www.example.com/dir/example.html
Now I like to cut the string, so that only domain and directory is
left over. Expected result:
http://www.example.com/dir/
I know how to do this in bash programming, but not in python. How could
this be done?
The next problem is not only to extract href's, but also images. A href
is easy:
<a href="install.php">Install</a>
But a image is a little harder:
<img class="bild" src="images/marine.jpg">
This is my current example code:
from sgmllib import SGMLParser
leach_url = "http://stargus.sourceforge.net/"
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Perhaps you've some tips how to solve this problems?
regards
Andreas