design choice: multi-threaded / asynchronous wxpython client?

  • Thread starter bullockbefriending bard
  • Start date
B

bullockbefriending bard

I am a complete ignoramus and newbie when it comes to designing and
coding networked clients (or servers for that matter). I have a copy
of Goerzen (Foundations of Python Network Programming) and once
pointed in the best direction should be able to follow my nose and get
things sorted... but I am not quite sure which is the best path to
take and would be grateful for advice from networking gurus.

I am writing a program to display horse racing tote odds in a desktop
client program. I have access to an HTTP (open one of several URLs,
and I get back an XML doc with some data... not XML-RPC.) source of
XML data which I am able to parse and munge with no difficulty at all.
I have written and successfully tested a simple command line program
which allows me to repeatedly poll the server and parse the XML. Easy
enough, but the real world production complications are:

1) The data for the race about to start updates every (say) 15
seconds, and the data for earlier and later races updates only every
(say) 5 minutes. There is no point for me to be hammering the server
with requests every 15 seconds for data for races after the upcoming
race... I should query for this perhaps every 150s to be safe. But for
the upcoming race, I must not miss any updates and should query every
~7s to be safe. So... in the middle of a race meeting the situation
might be:
race 1 (race done with, no-longer querying), race 2 (race done with,
no longer querying) race 3 (about to start, data on server for this
race updating every 15s, my client querying every 7s), races 4-8 (data
on server for these races updating every 5 mins, my client querying
every 2.5 mins)

2) After a race has started and betting is cut off and there are
consequently no more tote updates for that race (it is possible to
determine when this occurs precisely because of an attribute in the
XML data), I need to stop querying (say) race 3 every 7s and remove
race 4 from the 150s query group and begin querying its data every 7s.

3) I need to dump this data (for all races, not just current about to
start race) to text files, store it as BLOBs in a DB *and* update real
time display in a wxpython windowed client.

My initial thought was to have two threads for the different update
polling cycles. In addition I would probably need another thread to
handle UI stuff, and perhaps another for dealing with file/DB data
write out. But, I wonder if using Twisted is a better idea? I will
still need to handle some threading myself, but (I think) only for
keeping wxpython happy by doing all this other stuff off the main
thread + perhaps also persisting received data in yet another thread.

I have zero experience with these kinds of design choices and would be
very happy if those with experience could point out the pros and cons
of each (synchronous/multithreaded, or Twisted) for dealing with the
two differing sample rates problem outlined above.

Many TIA!
 
E

Eric Wertman

HI, that does look like a lot of fun... You might consider breaking
that into 2 separate programs. Write one that's threaded to keep a db
updated properly, and write a completely separate one to handle
displaying data from your db. This would allow you to later change or
add a web interface without having to muck with the code that handles
data.
 
D

David

1) The data for the race about to start updates every (say) 15
seconds, and the data for earlier and later races updates only every
(say) 5 minutes. There is no point for me to be hammering the server
with requests every 15 seconds for data for races after the upcoming

Try using an HTTP HEAD instruction instead to check if the data has
changed since last time.
 
B

bullockbefriending bard

HI, that does look like a lot of fun... You might consider breaking
that into 2 separate programs.  Write one that's threaded to keep a db
updated properly, and write a completely separate one to handle
displaying data from your db.  This would allow you to later change or
add a web interface without having to muck with the code that handles
data.

Thanks for the good point. It certainly is a lot of 'fun'. One of
those jobs which at first looks easy (XML, very simple to parse data),
but a few gotchas in the real-time nature of the beast.

After thinking about your idea more, I am sure this decoupling of
functions and making everything DB-centric can simplify a lot of
issues. I quite like the idea of persisting pickled or YAML data along
with the raw XML (for archival purposes + occurs to me I might be able
to do something with XSLT to get it directly into screen viewable form
without too much work) to a DB and then having a client program which
queries most recent time-stamped data for display.

A further complication is that at a later point, I will want to do
real-time time series prediction on all this data (viz. predicting
actual starting prices at post time x minutes in the future). Assuming
I can quickly (enough) retrieve the relevant last n tote data samples
from the database in order to do this, then it will indeed be much
simpler to make things much more DB-centric... as opposed to
maintaining all this state/history in program data structures and
updating it in real time.
 
J

Jorge Godoy

bullockbefriending said:
A further complication is that at a later point, I will want to do
real-time time series prediction on all this data (viz. predicting
actual starting prices at post time x minutes in the future). Assuming
I can quickly (enough) retrieve the relevant last n tote data samples
from the database in order to do this, then it will indeed be much
simpler to make things much more DB-centric... as opposed to
maintaining all this state/history in program data structures and
updating it in real time.

If instead of storing XML and YAML you store the data points, you can do
everything from inside the database.

PostgreSQL supports Python stored procedures / functions and also support
using R in the same way, for manipulating data. Then you can work with
everything and just retrieve the resulting information.

You might try storing the raw data and the XML / YAML, but I believe that
keeping those sync'ed might cause you some extra work.
 
B

bullockbefriending bard

Try using an HTTP HEAD instruction instead to check if the data has
changed since last time.

Thanks for the suggestion... am I going about this the right way here?

import urllib2
request = urllib2.Request("http://get-rich.quick.com")
request.get_method = lambda: "HEAD"
http_file = urllib2.urlopen(request)

print http_file.headers

->>>
Age: 0
Date: Sun, 27 Apr 2008 16:07:11 GMT
Content-Length: 521
Content-Type: text/xml; charset=utf-8
Expires: Sun, 27 Apr 2008 16:07:41 GMT
Cache-Control: public, max-age=30, must-revalidate
Connection: close
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 1.1.4322
Via: 1.1 jcbw-nc3 (NetCache NetApp/5.5R4D6)

Date is the time of the server response and not last data update. Data
is definitely time of server response to my request and bears no
relation to when the live XML data was updated. I know this for a fact
because right now there is no active race meeting and any data still
available is static and many hours old. I would not feel confident
rejecting incoming data as duplicate based only on same content length
criterion. Am I missing something here?

Actually there doesn't seem to be too much difficulty performance-wise
in fetching and parsing (minidom) the XML data and checking the
internal (it's an attribute) update time stamp in the parsed doc. If
timings got really tight, presumably I could more quickly check each
doc's time stamp with SAX (time stamp comes early in data as one might
reasonably expect) before deciding whether to go the whole hog with
minidom if the time stamp has in fact changed since I last polled the
server.

But if there is something I don't get about HTTP HEAD approach, please
let me know as a simple check like this would obviously be a good
thing for me.
 
J

Jorge Godoy

bullockbefriending said:
3) I need to dump this data (for all races, not just current about to
start race) to text files, store it as BLOBs in a DB *and* update real
time display in a wxpython windowed client.

Why in a BLOB? Why not into specific data types and normalized tables? You
can also save the BLOB for backup or auditing, but this won't allow you to
use your DB to the best of its capabilities... It will just act as a data
container, the same as a network share (which would not penalize you too
much to have connections open/closed).
 
B

BJörn Lindqvist

I think twisted is overkill for this problem. Threading, elementtree
and urllib should more than suffice. One thread polling the server for
each race with the desired polling interval. Each time some data is
treated, that thread sends a signal containing information about what
changed. The gui listens to the signal and will, if needed, update
itself with the new information. The database handler also listens to
the signal and updates the db.
 
B

bullockbefriending bard

If instead of storing XML and YAML you store the data points, you can do
everything from inside the database.

PostgreSQL supports Python stored procedures / functions and also support
using R in the same way, for manipulating data.  Then you can work with
everything and just retrieve the resulting information.

You might try storing the raw data and the XML / YAML, but I believe that
keeping those sync'ed might cause you some extra work.

Tempting thought, but one of the problems with this kind of horse
racing tote data is that a lot of it is for combinations of runners
rather than single runners. Whilst there might be (say) 14 horses in a
race, there are 91 quinella price combinations (1-2 through 13-14,
i.e. the 2-subsets of range(1, 15)) and 364 trio price combinations.
It is not really practical (I suspect) to have database tables with
columns for that many combinations?

I certainly DO have a horror of having my XML / whatever else formats
getting out of sync. I also have to worry about the tote company later
changing their XML format. From that viewpoint, there is indeed a lot
to be said for storing the tote data as numbers in tables.
 
B

bullockbefriending bard

I think twisted is overkill for this problem. Threading, elementtree
and urllib should more than suffice. One thread polling the server for
each race with the desired polling interval. Each time some data is
treated, that thread sends a signal containing information about what
changed. The gui listens to the signal and will, if needed, update
itself with the new information. The database handler also listens to
the signal and updates the db.

So, if i understand you correctly:

Assuming 8 races and we are just about to start the race 1, we would
have 8 polling threads with the race 1 thread polling at faster rate
than the other ones. after race 1 betting closed, could dispense with
that thread, change race 2 thread to poll faster, and so on...? I had
been rather stupidly thinking of just two polling threads, one for the
current race and one for races not yet run... but starting out with a
thread for each extant race seems simpler given there then is no need
to handle the mechanics of shifting the polling of races from the
omnibus slow thread to the current race fast thread.

Having got my minidom parser working nicely, I'm inclined to stick
with it for now while I get other parts of the problem licked into
shape. However, I do take your point that it's probably overkill for
this simple kind of structured, mostly numerical data and will try to
find time to experiment with the elementtree approach later. No harm
at all in shaving the odd second off document parse times.
 
J

Jorge Godoy

bullockbefriending said:
Tempting thought, but one of the problems with this kind of horse
racing tote data is that a lot of it is for combinations of runners
rather than single runners. Whilst there might be (say) 14 horses in a
race, there are 91 quinella price combinations (1-2 through 13-14,
i.e. the 2-subsets of range(1, 15)) and 364 trio price combinations.
It is not really practical (I suspect) to have database tables with
columns for that many combinations?

I certainly DO have a horror of having my XML / whatever else formats
getting out of sync. I also have to worry about the tote company later
changing their XML format. From that viewpoint, there is indeed a lot
to be said for storing the tote data as numbers in tables.

I don't understand anything about horse races... But it should be possible
to normalize such information into some tables (not necessarily one). But
then, there is nothing that prevents you from having dozens of columns on
one table if it is needed (it might not be the most efficient solution
performance and disk space-wise depending on what you have, but it works).

Using things like that you can even enhance your system and provide more
information about each horse, its race history, price history, etc.

I love working with data and statistics, so even though I don't know the
rules and workings of horse racings, I can think of several things I'd like
to track or extract from the information you seem to have :)

How does that price thing work? Are these the ratio of payings for bets?
What is a quinella or a trio? Two or three horses in a defined order
winning the race?
 
D

David

Date is the time of the server response and not last data update. Data
is definitely time of server response to my request and bears no
relation to when the live XML data was updated. I know this for a fact
because right now there is no active race meeting and any data still
available is static and many hours old. I would not feel confident
rejecting incoming data as duplicate based only on same content length
criterion. Am I missing something here?

It looks like the data is dynamically generated on the server, so the
web server doesn't know if/when the data changed. You will usually see
this for static content (images, html files, etc). You could go by the
Cache-Control line and only fetch data every 30 seconds, but it's
possible for you to miss some updates this way.

Another thing you could try (if necessary, this is a bit of an
overkill) - download the first part of the XML (GET request with a
range header), and check the timestamp you mentinoed. If that changed
then re-request the doc (a download resume is risky, the XML might
change between your 2 requests).

David.
 
D

David

3) I need to dump this data (for all races, not just current about to
start race) to text files, store it as BLOBs in a DB *and* update real
time display in a wxpython windowed client.

A few important questions:

1) How real-time must the display be? (should update immediately after
you get new XML data, or is it ok to update a few seconds later?).

2) How much data is being processed at peak? (100 records a second, 1000?)

3) Does your app need to share fetched data with other apps? If so,
how? (read from db, download HTML, RPC, etc).

4) Does your app need to use data from previous executions? (eg: if
you restart it, does it need to have a fully populated UI, or can it
start from an empty UI and start updating as it downloads new XML
updates).

How you answer the above questionss determines what kind of algorithm
will work best.

David.

PS: I suggest that you contact the people you're downloading the XML
from if you haven't already. eg: it might be against their TOS to
constantly scrape data (I assume not, since they provide XML). You
don't want them to black-list your IP address ;-). Also, maybe they
have ideas for efficient data retrieval (eg: RSS feeds).
 
D

David

Tempting thought, but one of the problems with this kind of horse
racing tote data is that a lot of it is for combinations of runners
rather than single runners. Whilst there might be (say) 14 horses in a
race, there are 91 quinella price combinations (1-2 through 13-14,
i.e. the 2-subsets of range(1, 15)) and 364 trio price combinations.
It is not really practical (I suspect) to have database tables with
columns for that many combinations?

If you normalise your tables correctly, these will be represented as
one-to many or many-to-many relationships in your database. Like the
other poster I don't know the first thing about horses, and I may be
misunderstanding something, but here is one (basic) normalised db
schema:

tables & descriptions:

- horse - holds info about each horse
- race - one record per race. Has times, etc
- race_hourse - holds records linking horses and races together.

You can derive all possible horse combinations from the above info.
You don't need to store it in the db unless you need to link something
else to it (eg: betting data). In which case:

- combination - represents one combination of horses.
- combination_horse - links a combinaition to 1 horse. 1 of these per
horse per combination.
- bet - Represents a bet. Has foreign relationship with combination
(and other tables, eg: better, race)

With a structure like the above you don't need hudreds of database columns :)

David.
 
B

Bjoern Schliessmann

bullockbefriending said:
1) The data for the race about to start updates every (say) 15
seconds, and the data for earlier and later races updates only
every
(say) 5 minutes. There is no point for me to be hammering the
server with requests every 15 seconds for data for races after the
upcoming race... I should query for this perhaps every 150s to be
safe. But for the upcoming race, I must not miss any updates and
should query every
~7s to be safe. So... in the middle of a race meeting the
situation might be:

I don't fully understand this, but can't you design the server in a
way that you can connect to it and it notifies you about important
things? IMHO, polling isn't ideal.
My initial thought was to have two threads for the different
update polling cycles. In addition I would probably need another
thread to handle UI stuff, and perhaps another for dealing with
file/DB data write out.

No need for any additional threads. UI, networking and file I/O can
operate asynchronously. Using wxPython's timers with callback
functions, you should need only standard Python modules (except
wx).
But, I wonder if using Twisted is a better idea?

IMHO that's only advisable if you like to create own protocols and
reuse them in different apps, or need full-featured customisable
implementations of advanced protocols.

Additionally, you'd *have to* use multiple threads: One for the
Twisted event loop and one for the wxPython one.

There is a wxreactor in Twisted which integrates the wxPython event
loop, but I stopped using it due to strange deadlock problems which
began with some wxPython version. Also, it seems it's no more in
development. But my alternative works perfectly (main thread with
Twisted, and a GUI thread for wxPython, communicating over Python
standard queues).

You'd only need additional threads if you would do heavy number
crunching inside the wxPython or Twisted thread. For the respective
event loop not to hang, it's advisable to use a separate thread for
long-running calculations.
I have zero experience with these kinds of design choices and
would be very happy if those with experience could point out the
pros and cons of each (synchronous/multithreaded, or Twisted) for
dealing with the two differing sample rates problem outlined
above.

I'd favor "as few threads as neccessary" approach. In my experience
this saves pain (i. e. deadlocks and boilerplate queueing code).

Regards,


Björn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,782
Latest member
ThomasGex

Latest Threads

Top