Crummy BS Script

flebber · Oct 1, 2010

I have a simple question regarding the Beuatiful soup crummy script.
The last line is f.write('%s, %s, %s, %s, %s \n' % (i, t[0], t[1],
t[2], t[3])), But where is this saving the imported file and under
what name?

#!/usr/bin/env python
# ogm-sampples.py
# Author: Matt Mayes
# March 11, 2008
"""
-- This requires the Beautiful Soup mod: http://www.crummy.com/software/BeautifulSoup/
--
Steps:
1. Identify all <ul>'s that are preceded with '<font color="#3C378C"
size="2">' (which denotes a header here)
2. Pull that font text, and store as dictionary key
3. Extract all links and link text from the list, generate a link
title and type (pdf/html/404) store as tuples in
appropriate dict key (note that some list items contain more than 1
link, this handles it) If it's a 404, it will
not be added to the list.
4. Identify if it's linking to an HTML page or PDF
5. If it's a local pdf referenced by a root value ("/file.pdf"), it
strips the slash. Modify to suit your needs.
6. Generate a CSV file of results
"""

import urllib2, re
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.givegoodweb.com/examples/ogm-
samples.html")
soup = BeautifulSoup(page)
fontStart = re.compile(r'<font[a-zA-Z-",0-9= ]*>?')
fontEnd = re.compile(r'</font>')
titleSearch = re.compile(r'title=')
getTitle = re.compile(r'<title>(.*)</title>',re.DOTALL|re.MULTILINE)
emailSearch = re.compile(r'mailto')

def removeNL(x):
"""cleans a string of new lines and spaces"""
s = x.split('\n')
s = [x.strip() for x in s]
x = " ".join(s)
return x.lstrip()

ul_tags = {}

for ul in soup.html.body.findAll('ul'):
links = []
x = ul.findPrevious('font', color="#3C378C").renderContents()
if '\n' in x:
x = removeNL(x)
for li in ul.findAll('li'):
line = []
for a in li.findAll('a'):
c = removeNL(str(a.contents[0]))
c = fontStart.sub('', c)
c = fontEnd.sub('', c)
href = str(a.get('href'))
if href[-3:].lower() == 'pdf':
type = 'pdf'
title = "PDF sample"
elif emailSearch.search(href):
title = 'email'
else:
type = 'html'
try:
f = urllib2.urlopen(href)
# reading in 2000 characters should to it
t = getTitle.search(f.read(2000))
if t :
title = t.group(1)
title = removeNL(title)
else : title = "open link"
except urllib2.HTTPError, e:
title = 404
f.close()
if title != 404:
line.append((c, href.lstrip('/'), type, title))
links.append(line)
ul_tags[x] = links

page.close()

f = open('samples.csv', 'w')

for i in ul_tags.iterkeys():
for x in ul_tags:
for t in x:
f.write('%s, %s, %s, %s, %s \n' % (i, t[0], t[1], t[2], t[3]))

f.close()

I got it from http://pastie.textmate.org/164503

Burton Samograd · Oct 1, 2010

flebber said:
But where is this saving the imported file and under what name?

Looks like samples.csv:

MRAB · Oct 2, 2010

Looks like samples.csv:

It'll be in the current working directory, which is given by:

os.getcwd()

flebber · Oct 2, 2010

It'll be in the current working directory, which is given by:

os.getcwd()

So how do I call the output to direct it to file? I can't see which
part to get.

flebber · Oct 2, 2010

I don't understand your question. What do you mean "call the output" --
you normally don't call the output, you call a function or program to get
output. The output is already directed to a file, as you were shown -- it
is written to the file samples.csv in the current directory.

Perhaps if you explain your question more carefully, we might be able to
help a little more.

How do change where output goes and what its called

John Bokma · Oct 2, 2010

flebber said:
How do change where output goes and what its called

f = open('samples.csv', 'w')

were else? Maybe read a beginners book on Python before you start on a
path of Cargo Cult Coding?

flebber · Oct 3, 2010

f = open('samples.csv', 'w')

were else? Maybe read a beginners book on Python before you start on a
path of Cargo Cult Coding?

--
John Bokma j3b

Blog:http://johnbokma.com/ Facebook:http://www.facebook.com/j.j.j.bokma
Freelance Perl & Python Development:http://castleamber.com/

Cargo Cult Coding?

Not sure what it is but it sounds good.

flebber · Oct 3, 2010

Cargo Cult Coding?

Not sure what it is but it sounds good.

When I get an error from this when using Alan's site as a test this a
result of the script being unable to pass page elements isn't it?

http://www.freenetpages.co.uk/hp/alan.gauld/

Traceback (most recent call last):
File "C:\Sayth\Scripts\BSScriptCrummy.py", line 38, in <module>
for ul in soup.html.body.findAll('ul'):
AttributeError: 'NoneType' object has no attribute 'findAll'
Script terminated.

Nobody · Oct 3, 2010

Cargo Cult Coding?

Not sure what it is but it sounds good.

Imitation without understanding, aka monkey-see-monkey-do.

http://en.wikipedia.org/wiki/Cargo_cult

I need help fixing my website	2	Oct 15, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Only one table shows up with the information	2	Mar 29, 2023
Different font sizes inside same div	2	Dec 3, 2023
TypeError: not all arguments converted during string formatting	2	Dec 13, 2013
Multi select options in a menu	1	Oct 30, 2022
Help me sort out this script	1	Oct 17, 2023
Improving the web page download code.	5	Aug 27, 2013

Crummy BS Script

flebber

Burton Samograd

MRAB

flebber

flebber

John Bokma

flebber

flebber

Nobody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads