html5lib not thread safe. Is the Python SAX library thread-safe?

John Nagle · Mar 11, 2012

"html5lib" is apparently not thread safe.
(see "http://code.google.com/p/html5lib/issues/detail?id=189")
Looking at the code, I've only found about three problems.
They're all the usual "cached in a global without locking" bug.
A few locks would fix that.

But html5lib calls the XML SAX parser. Is that thread-safe?
Or is there more trouble down at the bottom?

(I run a multi-threaded web crawler, and currently use BeautifulSoup,
which is thread safe, although dated. I'm looking at converting to
html5lib.)

John Nagle

Cameron Simpson · Mar 11, 2012

| "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189")
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
|
| But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
|
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated. I'm looking at converting to
| html5lib.)

IIRC, BeautifulSoup4 may do that for you:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
"Beautiful Soup 4 uses html.parser by default, but you can plug in
lxml or html5lib and use that instead."

Just for interest, re locking, I wrote a little decorator the other day,
thus:

@locked_property
def foo(self):
compute foo here ...
return foo value

and am rolling its use out amongst my classes. Code:

def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None):
''' A property whose access is controlled by a lock if unset.
'''
if prop_name is None:
prop_name = '_' + func.func_name
def getprop(self):
''' Attempt lockless fetch of property first.
Use lock if property is unset.
'''
p = getattr(self, prop_name)
if p is unset_object:
with getattr(self, lock_name):
p = getattr(self, prop_name)
if p is unset_object:
p = func(self)
setattr(self, prop_name, p)
return p
return property(getprop)

It tries to be lockless in the common case. I suspect it is only safe in
CPython where there is a GIL. If raw python assignments and fetches can
overlap (eg Jypthon I think?) I probably need shared "read" lock around
the first "p = getattr(self, prop_name). Any remarks?

Cheers,
--
Cameron Simpson <[email protected]> DoD#743
http://www.cskk.ezoshosting.com/cs/

Ed Campbell's <[email protected]> pointers for long trips:
1. lay out the bare minimum of stuff that you need to take with you, then
put at least half of it back.

John Nagle · Mar 12, 2012

| "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189")
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
|
| But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
|
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated. I'm looking at converting to
| html5lib.)

IIRC, BeautifulSoup4 may do that for you:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
"Beautiful Soup 4 uses html.parser by default, but you can plug in
lxml or html5lib and use that instead."

I want to use HTML5 standard parsing of bad HTML. (HTML5 formally
defines how to parse bad comments, for example.) I currently have
a modified version of BeautifulSoup that's more robust than the
standard one, but it doesn't handle errors the same way browsers do.

John Nagle

Paul Rubin · Mar 12, 2012

John Nagle said:
But html5lib calls the XML SAX parser. Is that thread-safe?
Or is there more trouble down at the bottom?

According to

http://xmlbench.sourceforge.net/results/features200303/index.html

libxml and expat both purport to be thread-safe. I've used the python
expat library (not from multiple threads) and it works fine, though the
python calls slow it down by worse than an order of magnitude.

Stefan Behnel · Mar 12, 2012

John Nagle, 11.03.2012 21:30:

"html5lib" is apparently not thread safe.
(see "http://code.google.com/p/html5lib/issues/detail?id=189")
Looking at the code, I've only found about three problems.
They're all the usual "cached in a global without locking" bug.
A few locks would fix that.

But html5lib calls the XML SAX parser. Is that thread-safe?
Or is there more trouble down at the bottom?

(I run a multi-threaded web crawler, and currently use BeautifulSoup,
which is thread safe, although dated. I'm looking at converting to
html5lib.)

You may also consider moving to lxml. BeautifulSoup supports it as a parser
backend these days, so you wouldn't even have to rewrite your code to use
it. And performance-wise, well ...

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan

John Nagle · Mar 12, 2012

John Nagle, 11.03.2012 21:30:

You may also consider moving to lxml. BeautifulSoup supports it as a parser
backend these days, so you wouldn't even have to rewrite your code to use
it. And performance-wise, well ...

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan

I want to move to html5lib because it handles HTML errors as
specified by the HTML5 spec, which is what all newer browsers do.
The HTML5 spec actually specifies, in great detail, how to parse
common errors in HTML. It's amusing seeing that formalized.
Malformed comments ( <- instead of <-- ) are now handled in
a standard way, for example. So I'm trying to get html5parser
fixed for thread safety.

John Nagle

"urlopen" not thread safe	1	Mar 16, 2007
Thread safe singleton to access the cache?	1	Oct 6, 2005
The future of Python immutability	50	Sep 3, 2009
python-dev Summary for 2005-04-16 through 2005-04-30	7	May 16, 2005
Download the JAVA , .NET and SQL Server interview with answers	0	Sep 14, 2006
Download the JAVA , .NET and SQL Server interview PDF	0	Sep 17, 2006
python-dev Summary for 2004-08-01 through 2004-08-15	17	Aug 24, 2004
Real Java Interview Questions	27	Nov 15, 2007

html5lib not thread safe. Is the Python SAX library thread-safe?

John Nagle

Cameron Simpson

John Nagle

Paul Rubin

Stefan Behnel

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads