html5lib not thread safe. Is the Python SAX library thread-safe?

J

John Nagle

"html5lib" is apparently not thread safe.
(see "http://code.google.com/p/html5lib/issues/detail?id=189")
Looking at the code, I've only found about three problems.
They're all the usual "cached in a global without locking" bug.
A few locks would fix that.

But html5lib calls the XML SAX parser. Is that thread-safe?
Or is there more trouble down at the bottom?

(I run a multi-threaded web crawler, and currently use BeautifulSoup,
which is thread safe, although dated. I'm looking at converting to
html5lib.)

John Nagle
 
C

Cameron Simpson

| "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189")
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
|
| But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
|
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated. I'm looking at converting to
| html5lib.)

IIRC, BeautifulSoup4 may do that for you:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
"Beautiful Soup 4 uses html.parser by default, but you can plug in
lxml or html5lib and use that instead."

Just for interest, re locking, I wrote a little decorator the other day,
thus:

@locked_property
def foo(self):
compute foo here ...
return foo value

and am rolling its use out amongst my classes. Code:

def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None):
''' A property whose access is controlled by a lock if unset.
'''
if prop_name is None:
prop_name = '_' + func.func_name
def getprop(self):
''' Attempt lockless fetch of property first.
Use lock if property is unset.
'''
p = getattr(self, prop_name)
if p is unset_object:
with getattr(self, lock_name):
p = getattr(self, prop_name)
if p is unset_object:
p = func(self)
setattr(self, prop_name, p)
return p
return property(getprop)

It tries to be lockless in the common case. I suspect it is only safe in
CPython where there is a GIL. If raw python assignments and fetches can
overlap (eg Jypthon I think?) I probably need shared "read" lock around
the first "p = getattr(self, prop_name). Any remarks?

Cheers,
--
Cameron Simpson <[email protected]> DoD#743
http://www.cskk.ezoshosting.com/cs/

Ed Campbell's <[email protected]> pointers for long trips:
1. lay out the bare minimum of stuff that you need to take with you, then
put at least half of it back.
 
J

John Nagle

| "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189")
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
|
| But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
|
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated. I'm looking at converting to
| html5lib.)

IIRC, BeautifulSoup4 may do that for you:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
"Beautiful Soup 4 uses html.parser by default, but you can plug in
lxml or html5lib and use that instead."

I want to use HTML5 standard parsing of bad HTML. (HTML5 formally
defines how to parse bad comments, for example.) I currently have
a modified version of BeautifulSoup that's more robust than the
standard one, but it doesn't handle errors the same way browsers do.

John Nagle
 
S

Stefan Behnel

John Nagle, 11.03.2012 21:30:
"html5lib" is apparently not thread safe.
(see "http://code.google.com/p/html5lib/issues/detail?id=189")
Looking at the code, I've only found about three problems.
They're all the usual "cached in a global without locking" bug.
A few locks would fix that.

But html5lib calls the XML SAX parser. Is that thread-safe?
Or is there more trouble down at the bottom?

(I run a multi-threaded web crawler, and currently use BeautifulSoup,
which is thread safe, although dated. I'm looking at converting to
html5lib.)

You may also consider moving to lxml. BeautifulSoup supports it as a parser
backend these days, so you wouldn't even have to rewrite your code to use
it. And performance-wise, well ...

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan
 
J

John Nagle

John Nagle, 11.03.2012 21:30:

You may also consider moving to lxml. BeautifulSoup supports it as a parser
backend these days, so you wouldn't even have to rewrite your code to use
it. And performance-wise, well ...

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan

I want to move to html5lib because it handles HTML errors as
specified by the HTML5 spec, which is what all newer browsers do.
The HTML5 spec actually specifies, in great detail, how to parse
common errors in HTML. It's amusing seeing that formalized.
Malformed comments ( <- instead of <-- ) are now handled in
a standard way, for example. So I'm trying to get html5parser
fixed for thread safety.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,954
Messages
2,570,116
Members
46,704
Latest member
BernadineF

Latest Threads

Top