historical data from many sources: design questions

H

Hicham Mouline

Hello,

I have measurements done daily (work days) for the past 20 years or so, in
the order then of 5000 or so entries.
I currently have them in a text file (I've hand written the parser but I'll
move to boost::spirit eventually)

The application is growing:
.. I may move to 20 years worth of measures every few seconds and arrive at a
range of 50 000 000 entries. Each entry is probably 64 bytes.
.. I may use a database
.. I may receive the data over a network socket

I have a class 'historical_data' that currently holds the 5000 in memory. A
standalone function (in the same namespace as that class) parses from text
file into that class.
My application typically iterates from earliest to latest on this data.

I am wondering what incremental changes to introduce to the code I have to :
1. Make some factory function to create a full 'historical_data' from text
file/database/network socket
2. allow for 50 000 000 instead of 5000 entries, and possibly keep just a
part in memory and the rest on text file/database/network, and access this
transparently

rds,
 
G

Goran

Hello,

I have measurements done daily (work days) for the past 20 years or so, in
the order then of 5000 or so entries.
I currently have them in a text file (I've hand written the parser but I'll
move to boost::spirit eventually)

The application is growing:
. I may move to 20 years worth of measures every few seconds and arrive at a
range of 50 000 000 entries. Each entry is probably 64 bytes.
. I may use a database
. I may receive the data over a network socket

I have a class 'historical_data' that currently holds the 5000 in memory. A
standalone function (in the same namespace as that class) parses from text
file into that class.
My application typically iterates from earliest to latest on this data.

I am wondering what incremental changes to introduce to the code I have to :
1. Make some factory function to create a full 'historical_data' from text
file/database/network socket
2. allow for 50 000 000 instead of 5000 entries, and possibly keep just a
part in memory and the rest on text file/database/network, and access this
transparently

Database management systems are for munching such quantities of data.
A home-grown solution, even is very constrained and simplified through
assumptions of data structure characteristics, is likely
* going to cost a lot and still be suboptimal
* be much harder to grow in functionality and scale
* will make you learn about on-disk indexing and caching of data
(knowing is a good thing, but this is a big subject, and not your
actual goal ;-) )

You should consider a database. A BerkeleyDB comes to my mind, but
here's one search for alternatives: http://stackoverflow.com/questions/260804/alternative-to-berkeleydb.

Goran.
 
P

Pavel

Hicham said:
Hello,

I have measurements done daily (work days) for the past 20 years or so, in
the order then of 5000 or so entries.
I currently have them in a text file (I've hand written the parser but I'll
move to boost::spirit eventually)

The application is growing:
. I may move to 20 years worth of measures every few seconds and arrive at a
range of 50 000 000 entries. Each entry is probably 64 bytes.
. I may use a database
. I may receive the data over a network socket

I have a class 'historical_data' that currently holds the 5000 in memory. A
standalone function (in the same namespace as that class) parses from text
file into that class.
My application typically iterates from earliest to latest on this data.

I am wondering what incremental changes to introduce to the code I have to :
1. Make some factory function to create a full 'historical_data' from text
file/database/network socket
2. allow for 50 000 000 instead of 5000 entries, and possibly keep just a
part in memory and the rest on text file/database/network, and access this
transparently

rds,
Not enough information, so just assuming everything:

If your standalone function has a string parameter for the file name,
come up with a prefix to denote another data sources (e.g.
"@tcp<ip-address>", "@database<data-source-description>" or just
"<filename>" to support the current usage, given your filenames don't
start with '@').

Otherwise, introduce this parameter with the empty string default and
load your old file when the argument is an empty string.

From your description, the API for the historical_data class itself
does not need to change unless you provide access to the data "in
place"; then you would need have to change it to make client
applications provide a buffer for the portion of data they want to
process as they iterate and recompile/retest them all.

Then, incrementally implement new functionality under the hood (keeping
data in memory in parts, database and network access etc).

Hope this helps,
Pavel
 
G

Goran

A home-grown solution, even is very constrained and simplified through
assumptions of data structure characteristics, is likely...

Whoops! Shoud have read "even __if__ very constrained and simplified...
 
J

Jorgen Grahn

Hello,

I have measurements done daily (work days) for the past 20 years or so, in
the order then of 5000 or so entries.
I currently have them in a text file (I've hand written the parser but I'll
move to boost::spirit eventually)

If your parser is broken or you want to learn boost::spirit, that's a
good idea. Otherwise not.
The application is growing:
. I may move to 20 years worth of measures every few seconds and arrive at a
range of 50 000 000 entries. Each entry is probably 64 bytes.

That's an odd change -- things that happen once a day usually don't
suddenly increase in frequency by a factor 10000. Especially not if
the frequency has been fixed for 20 years.
. I may use a database
. I may receive the data over a network socket

I have a class 'historical_data' that currently holds the 5000 in memory. A
standalone function (in the same namespace as that class) parses from text
file into that class.
My application typically iterates from earliest to latest on this data.

So it's typically a waste to keep all the data in class historical_data.
I am wondering what incremental changes to introduce to the code I have to :
1. Make some factory function to create a full 'historical_data' from text
file/database/network socket

Don't you have one already? Just add two more. I think it's overkill
to do fancy Design Pattern stuff here. However:
2. allow for 50 000 000 instead of 5000 entries, and possibly keep just a
part in memory and the rest on text file/database/network, and access this
transparently

Here it becomes obvious that class historical_data is an inefficient
design. You cannot do any processing until the I/O is done, and you
must fit all into memory.

I think you should switch focus to the individual samples (let's say
class Sample) and ways to operate on sequences of Samples. For some
uses you may need to feed your samples into a std::vector<Sample> or
similar and then process it; for other uses you can just let them stream
by. It's the Unix pipe/stream idea.

Reading from text file or TCP socket ... one design which I find
fast and flexible is this one:

- read a chunk of data from somewhere into a buffer
- try to parse and use as many class Sample from it as possible.
This may be 0 or many Samples, and it may or may not
consume the whole buffer
- remove the consumed part of the buffer
- read another chunk of data, appending to the buffer
- try to parse, etc

The part which needs to know about your class Sample can look like

pair<vector<Sample>, const char*>
parse(const char* begin, const char* end);

or if it's responsible for /using/ the Samples too:

const char* parse(const char* begin, const char* end);

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,815
Latest member
treekmostly22

Latest Threads

Top