P
Prateek
I have a bit of a specialized request.
I'm reading a table of strings (specifically fixed length 36 char
uuids generated via uuid.uuid4() in the standard library) from a file
and creating a set out of it.
Then my program is free to make whatever modifications to this set.
When I go back to save this set, I'd like to be able to only save the
new items. Currently I am creating a copy of the set as soon as I load
it and then when I go back to save it, i'm calculating the difference
and saving just the difference.
There are many problems with this approach so far:
1) Calculating the difference via the standard set implementation is
not very scalable -> O(n) I presume
2) Maintaining a copy wastes memory
3) I don't have a good solution if I delete items from the set
(calculating the difference will return an empty set but I need to
actually delete stuff).
I was thinking of writing a specialized set implementation (I have no
idea how to start on this) which tracks new items (added after
initialization) in a separate space and exposes a new API call which
would enable me to directly get those values. This is kind of ugly and
doesn't solve problem 3.
I also thought of using a hastable but I'm storing many ( > 1 million)
of these sets in the same file (most of them will be empty or contain
just a few values but a few of them will be very large - excess of
10,000 items). The file is already separated into tablespaces.
Does anyone have any ideas?
Thanks in advance.
Prateek
PS: Yes - I need blazing fast performance - simply pickling/unpickling
won't do. Memory constraints are important but definitely secondary.
Disk space constraints are not very important.
I'm reading a table of strings (specifically fixed length 36 char
uuids generated via uuid.uuid4() in the standard library) from a file
and creating a set out of it.
Then my program is free to make whatever modifications to this set.
When I go back to save this set, I'd like to be able to only save the
new items. Currently I am creating a copy of the set as soon as I load
it and then when I go back to save it, i'm calculating the difference
and saving just the difference.
There are many problems with this approach so far:
1) Calculating the difference via the standard set implementation is
not very scalable -> O(n) I presume
2) Maintaining a copy wastes memory
3) I don't have a good solution if I delete items from the set
(calculating the difference will return an empty set but I need to
actually delete stuff).
I was thinking of writing a specialized set implementation (I have no
idea how to start on this) which tracks new items (added after
initialization) in a separate space and exposes a new API call which
would enable me to directly get those values. This is kind of ugly and
doesn't solve problem 3.
I also thought of using a hastable but I'm storing many ( > 1 million)
of these sets in the same file (most of them will be empty or contain
just a few values but a few of them will be very large - excess of
10,000 items). The file is already separated into tablespaces.
Does anyone have any ideas?
Thanks in advance.
Prateek
PS: Yes - I need blazing fast performance - simply pickling/unpickling
won't do. Memory constraints are important but definitely secondary.
Disk space constraints are not very important.