Help with using findAll() in BeautifulSoup

A

Alexnb

Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now. I am making an app that screen scapes
dictionary.com for definitions. However, I would like to have the type of
the word for each definition. For example if def1 and def2 are noun
defintions but def3 isn't:


noun
def1
def2
verb
def3

Something like that. Now I can get the definitions just fine. But the
problem comes when I want to get the type. I can get the types, but I don't
know for what definitions they go with. So I can get noun and verb, but for
all I know noun is def1, and verb is 2 and 3. I am wondering if there is a
way to use findAll() but like stop once it hits a certain thing, or a way to
do just that. for example, if I have

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <table> things are after it, or before the next so I know how many
defintions it has.

Here is the code I am using(I used "cheese" because that is kinda my test
word for everything in the app.):

import urllib
from BeautifulSoup import BeautifulSoup

class defWord:
def __init__(self, word):
self.word = word

def get_types(term):
soup =
BeautifulSoup(urllib.urlopen('http://dictionary.reference.com/search?q=%s' %
term))

for tabs in soup.findAll('span', {'class': 'pg'}):
yield tabs.contents[0].string

self.mainList = list(get_types(self.word))
print self.mainList

type = defWord("cheese")

I don't know if this is really something anyone can help me fix or if I have
to do it on my own. But I would love some help.
 
S

Stefan Behnel

Alexnb said:
Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now.

Consider using lxml.html and lxml.cssselect.

http://codespeak.net/lxml/

I am making an app that screen scapes
dictionary.com for definitions.

Do they have a policy for doing that?

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <table> things are after it, or before the next so I know how many
defintions it has.

You didn't say where the "span" is in the HTML code, but lxml.cssselect should
get you pretty close to what you want. If your tables are descendants of the
"span"s, a selector like:

"span.pg table"

might work. There's also a CSS syntax for siblings.

Stefan
 
P

Paul McGuire

Do they have a policy for doing that?

From the Dictionary.com Terms of Use (http://dictionary.reference.com/
help/terms.html):

3.2 You will not modify, publish, transmit, participate in the
transfer or sale, create derivative works, or in any way exploit, any
of the content, in whole or in part, found on the Site. You will
download copyrighted content solely for your personal use, but will
make no other use of the content without the express written
permission of Lexico and the copyright owner. You will not make any
changes to any content that you are permitted to download under this
Agreement, and in particular you will not delete or alter any
proprietary rights or attribution notices in any content. You agree
that you do not acquire any ownership rights in any downloaded
content.

IANAL, but it seems pretty clear that, unless this content scraper is
"solely for your personal use," you'll need to get written permission
to include content that you have scraped from Dictionary.com into your
app.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top