Question concerning this list

Thomas Ploch · Dec 31, 2006

Hello fellow pythonists,

I have a question concerning posting code on this list.

I want to post source code of a module, which is a homework for
university (yes yes, I know, please read on...).

It is a web crawler (which I will *never* let out into the wide world)
which uses regular expressions (and yes, I know, thats not good, too). I
have finished it (as far as I can), but since I need a good mark to
actually finish the course, I am wondering if I can post the code, and I
am wondering if anyone of you can review it and give me possible hints
on how to improve things.

So is this O.K.? Or is this a blatantly idiotic idea?

I hope I am not the idiot of the month right now...

Thanks in advance,
Thomas

P.S.:

I might give some of my Christmas chocolate away as a donation to this
list...

Steven D'Aprano · Dec 31, 2006

Hello fellow pythonists,

I have a question concerning posting code on this list.

I want to post source code of a module, which is a homework for
university (yes yes, I know, please read on...).

So long as you understand your university's policy on collaborations.

It is a web crawler (which I will *never* let out into the wide world)

If you post it on Usenet, you will have let it out into the wide world.
People will see it. Some of those people will download it. Some of them
will run it. And some of them will run it, uncontrolled, on the WWW.

Out of curiosity, if your web crawler isn't going to be used on the web,
what were you intending to use it on?

which uses regular expressions (and yes, I know, thats not good, too).

Regexes are just a tool. Sometimes they are the right tool for the job.
Sometimes they aren't.

I have finished it (as far as I can), but since I need a good mark to
actually finish the course, I am wondering if I can post the code, and I
am wondering if anyone of you can review it and give me possible hints
on how to improve things.

That would be collaborating. What's your university's policy on
collaborating? Are you allowed to do so, if you give credit? Is it
forbidden?

It probably isn't a good idea to post a great big chunk of code and expect
people to read it all. If you have more specific questions than "how can
I make this better?", that would be good. Unless the code is fairly
short, it might be better to just post a few extracted functions and see
what people say about them, and then you can extend that to the rest of
your code.

Thomas Ploch · Dec 31, 2006

Steven said:
So long as you understand your university's policy on collaborations.

Well, collaborations are wanted by my prof, but I think he actually
meant it in a way of getting students bonding with each other and
establishing social contacts. He just said that he will reject copy &
paste stuff and works that actually have nothing to do with the topic
(when we were laughing, he said we couldn't imagine what sometimes is
handed in).

If you post it on Usenet, you will have let it out into the wide world.
People will see it. Some of those people will download it. Some of them
will run it. And some of them will run it, uncontrolled, on the WWW.

Out of curiosity, if your web crawler isn't going to be used on the web,
what were you intending to use it on?

It's a final homework, as I mentioned above, and it shouldn't be used
anywhere but our university server to test it (unless timing of requests
(i.e. only two fetches per second), handling of 'robots.txt' is
implemented). But you are right with the Usenet thing, havn't thought
about this actually, so I won't post the whole portion of the code.

Regexes are just a tool. Sometimes they are the right tool for the job.
Sometimes they aren't.

Alright, my prof said '... to process documents written in structural
markup languages using regular expressions is a no-no.' (Because of
nested Elements? Can't remember) So I think he wants us to use regexes
to learn them. He is pointing to HTMLParser though.

It probably isn't a good idea to post a great big chunk of code and expect
people to read it all. If you have more specific questions than "how can
I make this better?", that would be good. Unless the code is fairly
short, it might be better to just post a few extracted functions and see
what people say about them, and then you can extend that to the rest of
your code.

You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)
- Getting Handlers for different MIME/ContentTypes and specify callbacks
only for specific Content-Types / MIME-Types (a lot of work and complex
checks)
- Handle different encodings right.

I will follow your suggestions and post my code concerning specifically
these problems, and not the whole chunk.

Thanks,
Thomas

Marc 'BlackJack' Rintsch · Dec 31, 2006

Thomas Ploch said:
Alright, my prof said '... to process documents written in structural
markup languages using regular expressions is a no-no.' (Because of
nested Elements? Can't remember) So I think he wants us to use regexes
to learn them. He is pointing to HTMLParser though.

Problem is that much of the HTML in the wild is written in a structured
markup language but it's in many cases broken. If you just search some
words or patterns that appear somewhere in the documents then regular
expressions are good enough. If you want to actually *parse* HTML "from
the wild" better use the BeautifulSoup_ parser.

... _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

You are probably right. For me it boils down to these problems:
- Implementing a stack for large queues of documents which is faster
than list.pop(index) (Is there a lib for this?)

If you need a queue then use one: take a look at `collections.deque` or
the `Queue` module in the standard library.

Ciao,
Marc 'BlackJack' Rintsch

Thomas Ploch · Dec 31, 2006

Marc said:
Problem is that much of the HTML in the wild is written in a structured
markup language but it's in many cases broken. If you just search some
words or patterns that appear somewhere in the documents then regular
expressions are good enough. If you want to actually *parse* HTML "from
the wild" better use the BeautifulSoup_ parser.

.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

Yes, I know about BeautifulSoup. But as I said it should be done with
regexes. I want to extract tags, and their attributes as a dictionary of
name/value pairs. I know that most of HTML out there is *not* validated
and bollocks.

This is how my regexes look like:

import re

class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
% self.namePattern)
self.attrPattern = re.compile(

If you need a queue then use one: take a look at `collections.deque` or
the `Queue` module in the standard library.

Which of the two would you recommend for handling large queues with fast
response times?

Thomas

Marc 'BlackJack' Rintsch · Dec 31, 2006

Thomas Ploch said:
This is how my regexes look like:

import re

class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
% self.namePattern)
self.attrPattern = re.compile(
r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
% self.namePattern)

Have you tested this with tags inside comments?

Which of the two would you recommend for handling large queues with fast
response times?

`Queue.Queue` builds on `collections.deque` and is thread safe. Speedwise
I don't think this makes a difference as the most time is spend with IO
and parsing. So if you make your spider multi-threaded to gain some speed
go with `Queue.Queue`.

Ciao,
Marc 'BlackJack' Rintsch

Thomas Ploch · Dec 31, 2006

Marc said:
Thomas Ploch said:

This is how my regexes look like:

import re

class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
% self.namePattern)
self.attrPattern = re.compile(
r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
% self.namePattern)

Click to expand...

Have you tested this with tags inside comments?

No, but I already see your point that it will parse _all_ tags, even if
they are commented out. I am thinking about how to solve this. Probably
I just take the chunks between comments and feed it to the regular
expressions.

`Queue.Queue` builds on `collections.deque` and is thread safe. Speedwise
I don't think this makes a difference as the most time is spend with IO
and parsing. So if you make your spider multi-threaded to gain some speed
go with `Queue.Queue`.

I think I will go for collections.deque (since I have no intention of
making it multi-threaded) and have several queues, one for each server
in a list to actually finish one server before being directed to the
next one straight away (Is this a good approach?).

Thanks a lot,
Thomas

John Nagle · Dec 31, 2006

Very true. HTML is LALR(0), that is, you can parse it without
looking ahead. Parsers for LALR(0) languages are easy, and
work by repeatedly getting the next character and using that to
drive a single state machine. The first character-level parser
yields tokens, which are then processed by a grammar-level parser.
Any compiler book will cover this.

Using regular expressions for LALR(0) parsing is a vice inherited
from Perl, in which regular expressions are easy and "get next
character from string" is unreasonably expensive. In Python, at least
you can index through a string.

John Nagle

Thomas Ploch · Dec 31, 2006

John said:
Very true. HTML is LALR(0), that is, you can parse it without
looking ahead. Parsers for LALR(0) languages are easy, and
work by repeatedly getting the next character and using that to
drive a single state machine. The first character-level parser
yields tokens, which are then processed by a grammar-level parser.
Any compiler book will cover this.

Using regular expressions for LALR(0) parsing is a vice inherited
character from string" is unreasonably expensive. In Python, at least
you can index through a string.

John Nagle

I take it with LALR(0) you mean that HTML is a language created by a
Chomsky-0 (regular language) Grammar?

Thomas

Diez B. Roggisch · Jan 1, 2007

Thomas said:
I take it with LALR(0) you mean that HTML is a language created by a
Chomsky-0 (regular language) Grammar?

Nope.

LALR is a context free grammar parsing technique.

Regular expressions can't express languages like

a^n b^n

but something like

<div><div></div></div>

is <div>^2</div>^2

Diez

May I post a link to our graduate research project concerning Java Concurrency?	0	Sep 6, 2022
Question concerning array.array and C++	0	Nov 5, 2008
Question concerning TkText list	1	Jun 24, 2008
Hello everyone, thanks for creating this wonderful space! I have a beginner's question please.	0	Mar 15, 2023
Dictionary and List	1	Apr 26, 2021
How is this list comprehension evaluated?	3	Sep 16, 2013
Newbee List question	5	Jan 9, 2014
Interesting list() un-optimization	18	Mar 7, 2013

Question concerning this list

Thomas Ploch

Steven D'Aprano

Thomas Ploch

Marc 'BlackJack' Rintsch

Thomas Ploch

Marc 'BlackJack' Rintsch

Thomas Ploch

John Nagle

Thomas Ploch

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads