A faster way of finding historical highs/lows

E

Eamonn Sullivan

This is a basic algorithm question (from a history major who never
took an algorithm course), but I'm also looking for the fastest way to
do this in python -- for example, maximizing the existing library C
code. I have two searches in an application that I'm trying to speed
up:

1. Find the most recent date when there was an equal or higher (or
lower) value than X. The data (stock, currency and commodity prices)
looks like:

Col A Col B (a few other columns of prices)
Date Price

Right now, the working code just goes through chronologically and
returns the value when the condition is met:

for item in price_list:
if val <= item['price']:
return item['date'], item['price']

I'm wondering if anyone has any ideas about a faster way that a) won't
unnecessarily slow down the simple cases (highest since yesterday, or
last week), while b) speeding up the edge cases (highest/lowest since
1947...).

2. Find the last time the price fell/rose an equal or more than X
times in a row.

Same data format.

Right now, this is implemented inefficiently, something like this:

streak_now = get_streak(price_list[0])
while i < len(price_list):
streak_then = get_streak(price_list)
if streak_then >= streak_now:
return price_list['date'], streak_then
i += 1

That works OK most of the time because long streaks are rare, so
matches usually happen in at most months. But edge cases can be
extreme -- many years -- and take a while (a minute, say) to find.

In either of these, the data can be quite long (time wise), but not
large, so it would be possible to copy, reorder and find the answer
that way. Any ideas?
 
P

Peter Hansen

Eamonn said:
1. Find the most recent date when there was an equal or higher (or
lower) value than X.

The fastest algorithm might depend on how you use the data, as well.
For example, do you update the data often, and search it rarely,
or update it rarely and do the search very often? If searching
many times between updates, some preprocessing will likely make things
go much faster.

Both of your examples sound to me like they would benefit by
using sort(), then a binary search. Sort is very fast relative
to things like the Python loops you are doing, so using it to
prepare the data before the search can be a good step.

-Peter
 
H

Humpdydum

Peter Hansen said:
The fastest algorithm might depend on how you use the data, as well.
For example, do you update the data often, and search it rarely,
or update it rarely and do the search very often? If searching
many times between updates, some preprocessing will likely make things
go much faster.

Both of your examples sound to me like they would benefit by
using sort(), then a binary search. Sort is very fast relative
to things like the Python loops you are doing, so using it to
prepare the data before the search can be a good step.

One thing to keep in mind there is that it depends what is the "average"
number of iterations you expect to have to make before finding a value. If
you only have to go back about 50 data points on average to find a peak,
sorting 100000 items might be overkill. Yet, if you sort once and do a large
number of searches, may well be worth it. If new values can be added into a
sorted container, and get put in the right spot, you only need to sort once
too. Etc.

Oliver
 
E

Eamonn Sullivan

Peter Hansen said:
The fastest algorithm might depend on how you use the data, as well.
For example, do you update the data often, and search it rarely,
or update it rarely and do the search very often? If searching
many times between updates, some preprocessing will likely make things
go much faster.

Both of your examples sound to me like they would benefit by
using sort(), then a binary search. Sort is very fast relative
to things like the Python loops you are doing, so using it to
prepare the data before the search can be a good step.

-Peter

Thanks for this. At the moment, the software answers a few questions
(highest/lowest, longest streak, etc.) once per retrieval of data. The
database retrieval, though, is *by far* the biggest time sapper, so a
little preprocessing would be almost unnoticeable in comparison,
probably (depends on how much, of course).

So, I'm guessing I could sort on the particular price field I'm using
(decorate-sort-undecorate), and then just find the most recent date
among the subset of data that meets the criteria (higher or lower). Is
that what you mean? By binary search, do you mean further reorganizing
the data into a binary tree using date?
 
D

David Fraser

Eamonn said:
Thanks for this. At the moment, the software answers a few questions
(highest/lowest, longest streak, etc.) once per retrieval of data. The
database retrieval, though, is *by far* the biggest time sapper, so a
little preprocessing would be almost unnoticeable in comparison,
probably (depends on how much, of course).

So, I'm guessing I could sort on the particular price field I'm using
(decorate-sort-undecorate), and then just find the most recent date
among the subset of data that meets the criteria (higher or lower). Is
that what you mean? By binary search, do you mean further reorganizing
the data into a binary tree using date?

If you have this in a relational database, you might find that the
database can answer the question quicker for you, using indexes if
neccessary ...
select max(xxx) from yyy where zzz > 40
with an index on xxx and zzz will usually be done quickly internally by
the database, and then you just get the result returned rather than
having to process it

David
 
E

Eamonn Sullivan

David Fraser said:
If you have this in a relational database, you might find that the
database can answer the question quicker for you, using indexes if
neccessary ...
select max(xxx) from yyy where zzz > 40
with an index on xxx and zzz will usually be done quickly internally by
the database, and then you just get the result returned rather than
having to process it

David


Unfortunately, I don't have that level of control. It's market data
through an ActiveX control (on Windows) or C API (elsewhere) with only
a handful of calls -- e.g. GetHistoricalData or Subscribe (for
realtime). So, what I do is pull down the history (up to the present
moment) on a security for a number of price fields and then answer a
bunch of questions on a local copy: high/low value since, biggest
gain/loss since, longest streak since, percentage of average volume,
etc. User may then click a Refresh button (to fill today's field with
fresher data) and then typically moves on to another security. In most
cases, the hourglass stays on for 3-4 seconds or less to answer *all*
of the questions, or as long as 15 seconds if, for example, it's the
longest streak in eons.

The reason I'm doing this, though, is that I'd like to eventually make
the Refresh button unnecessary -- automatically pull in near realtime
data as it happens. The several seconds then become a bit too much
overhead. I know I can do other tricks (if the value hasn't exceeded a
certain window, then the high/low since doesn't need recalculating),
but I'm taking the problem in bits and seeing if I can improve the
searching algorithms first.
 
P

Peter Hansen

Eamonn said:
Thanks for this. At the moment, the software answers a few questions
(highest/lowest, longest streak, etc.) once per retrieval of data. The
database retrieval, though, is *by far* the biggest time sapper, so a
little preprocessing would be almost unnoticeable in comparison,
probably (depends on how much, of course).

So, I'm guessing I could sort on the particular price field I'm using
(decorate-sort-undecorate), and then just find the most recent date
among the subset of data that meets the criteria (higher or lower). Is
that what you mean?

That's exactly the type of thing I meant, and even the specific
improvement I was considering.
By binary search, do you mean further reorganizing
the data into a binary tree using date?

The binary search is just a lot faster way of homing in on a region
of the data compared to most other approaches. If it's sorted by
price, then binary search to get to the chunk with prices equal
to what you want. The last one (assuming ascending dates) is the
"most recent with an equal or lower price", or if there is none
equal, pick the one to the left of the final binary search point...
that kind of thing.

For a binary search, you just examine the data element halfway between
two end points each time, comparing it with the search target.
Decide if you need to continue search to the left or the right of
the current position, then pick a point halfway into _that_ region
and do it again. Eventually you are down to a region with a single
point, and then you're done. It takes only log2(n) comparisons to
search in n points, and often less...

-Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,735
Latest member
HikmatRamazanov

Latest Threads

Top