Support for new items in set type

P

Prateek

I have a bit of a specialized request.

I'm reading a table of strings (specifically fixed length 36 char
uuids generated via uuid.uuid4() in the standard library) from a file
and creating a set out of it.
Then my program is free to make whatever modifications to this set.

When I go back to save this set, I'd like to be able to only save the
new items. Currently I am creating a copy of the set as soon as I load
it and then when I go back to save it, i'm calculating the difference
and saving just the difference.

There are many problems with this approach so far:
1) Calculating the difference via the standard set implementation is
not very scalable -> O(n) I presume
2) Maintaining a copy wastes memory
3) I don't have a good solution if I delete items from the set
(calculating the difference will return an empty set but I need to
actually delete stuff).

I was thinking of writing a specialized set implementation (I have no
idea how to start on this) which tracks new items (added after
initialization) in a separate space and exposes a new API call which
would enable me to directly get those values. This is kind of ugly and
doesn't solve problem 3.

I also thought of using a hastable but I'm storing many ( > 1 million)
of these sets in the same file (most of them will be empty or contain
just a few values but a few of them will be very large - excess of
10,000 items). The file is already separated into tablespaces.

Does anyone have any ideas?

Thanks in advance.
Prateek
PS: Yes - I need blazing fast performance - simply pickling/unpickling
won't do. Memory constraints are important but definitely secondary.
Disk space constraints are not very important.
 
A

Alex Martelli

Prateek said:
I have a bit of a specialized request.

I'm reading a table of strings (specifically fixed length 36 char
uuids generated via uuid.uuid4() in the standard library) from a file
and creating a set out of it.
Then my program is free to make whatever modifications to this set.

When I go back to save this set, I'd like to be able to only save the
new items. Currently I am creating a copy of the set as soon as I load
it and then when I go back to save it, i'm calculating the difference
and saving just the difference.

There are many problems with this approach so far:
1) Calculating the difference via the standard set implementation is
not very scalable -> O(n) I presume

Yep, you do have to look at every item, so O(N) is an obvious lower
bound.
2) Maintaining a copy wastes memory
3) I don't have a good solution if I delete items from the set
(calculating the difference will return an empty set but I need to
actually delete stuff).

(3) is easy -- the difference originalset-finalset is the set of things
you have to delete.
Does anyone have any ideas?

Your problem is really a thinly disguised version of one of the most
ancient and venerable ones in data processing -- the "master file /
transaction file" problem. With sets (which are built as hash tables)
you have very powerful weapons for this fray (it used to be, many years
ago, that sorting and "in-step processing of sorted files" was the only
feasible approach to such problems), but you still have to wield them
properly.

In your shoes, I would write a class whose instances hold three sets:
-- the "master set" is what you originally read from the file
-- the "added set" is the set of things you've added since then
-- the "deleted set" is the set of things you've deleted since them

You can implement a set-like interface; for example, your .add method
will need to remove the item from the deleted set (if it was there),
then add it to the added set (if it wasn't already in the master set);
and so on, for each and every method you actually desire (including no
doubt __contains__ for membership testing). E.g., with suitably obvious
names for the three sets, we could have...:

class cleverset(object):
...
def __contains__(self, item):
if item in self.deleted: return False
elif item in self.added: return True
else: return item in self.master

When the time comes to save back to disk, you'll perform deletions and
additions based on self.deleted and self.added, of course.

I'm not addressing the issue of how to best represent such sets on disk
because it's not obvious to me what operations you need to perform on
the on-disk representation -- for example, deletions can be very costly
unless you can afford to simply set a "logically deleted" flag on
deleted items; but, depending on how often you delete stuff, that will
eventually degrade performance due to fragmentation, and maintaining the
indexing (if you need one) can be a bear.

Normally, I would use a good relational database (which has no doubt
already solved on my behalf all of these issues) for purpose of
persistent storage of such data -- usually sqlite for reasonably small
amounts of data, PostgreSQL if I need enormous scalability and other
"enterprise" features, but I hear other people are happy with many other
engines for relational DBs, such as Oracle (used to be superb if you
could afford full-time specialized db admins), IBM's DB2, SAP's database
(now opensourced), Firebird, even MySQL (I've never, but NEVER, had any
good experience with the latter, but I guess maybe it's only me, as
otherwise it could hardly be so popular). The main point here is that a
relational DB's tables _should_ be already very well optimized for
storing "sets", since those table ARE exactly nothing but sets of tuples
(and a 1-item tuple is a tuple for all that:).


Alex
 
S

Steven D'Aprano

I have a bit of a specialized request.

I'm reading a table of strings (specifically fixed length 36 char
uuids generated via uuid.uuid4() in the standard library) from a file
and creating a set out of it.
Then my program is free to make whatever modifications to this set.

When I go back to save this set, I'd like to be able to only save the
new items.

This may be a silly question, but why? Why not just save the modified set,
new items and old, and not mess about with complicated transactions?

After all, you say:
PS: Yes - I need blazing fast performance - simply pickling/unpickling
won't do. Memory constraints are important but definitely secondary.
Disk space constraints are not very important.

Since disk space is not important, I think that you shouldn't care that
you're duplicating the original items. (Although maybe I'm missing
something.)

Perhaps what you should be thinking about is writing a custom pickle-like
module optimized for reading/writing sets quickly.
 
P

Prateek

This may be a silly question, but why? Why not just save the modified set,
new items and old, and not mess about with complicated transactions?

I tried just that. Basically ignored all the difficulties of
difference calculation and just overwrote the entire tablespace with
the new set. At about 3000 entries per file (and 3 files) along with
all the indexing etc. etc. just the extra I/O cost me 28% performance.
I got 3000 entries committed in 53s with difference calculation but in
68s with writing the whole thing.
After all, you say:


Since disk space is not important, I think that you shouldn't care that
you're duplicating the original items. (Although maybe I'm missing
something.)

Perhaps what you should be thinking about is writing a custom pickle-like
module optimized for reading/writing sets quickly.

I already did this. I'm not using the pickle module at all - Since I'm
guaranteed that my sets contain a variable number of fixed length
strings, I write a header at the start of each tablespace (using
struct.pack) marking the number of rows and then simply save each
string one after the other without delimiters. I can do this simply by
issuing "".join(list(set_in_question)) and then saving the string
after the header. There are a few more things that I handle (such as
automatic tablespace overflow)

Prateek
 
P

Prateek

(3) is easy -- the difference originalset-finalset is the set of things
you have to delete.

Now why didn't I think of that?
Your problem is really a thinly disguised version of one of the most
ancient and venerable ones in data processing -- the "master file /
transaction file" problem. With sets (which are built as hash tables)
you have very powerful weapons for this fray (it used to be, many years
ago, that sorting and "in-step processing of sorted files" was the only
feasible approach to such problems), but you still have to wield them
properly.

In your shoes, I would write a class whose instances hold three sets:
-- the "master set" is what you originally read from the file
-- the "added set" is the set of things you've added since then
-- the "deleted set" is the set of things you've deleted since them

You can implement a set-like interface; for example, your .add method
will need to remove the item from the deleted set (if it was there),
then add it to the added set (if it wasn't already in the master set);
and so on, for each and every method you actually desire (including no
doubt __contains__ for membership testing). E.g., with suitably obvious
names for the three sets, we could have...:

class cleverset(object):
...
def __contains__(self, item):
if item in self.deleted: return False
elif item in self.added: return True
else: return item in self.master

When the time comes to save back to disk, you'll perform deletions and
additions based on self.deleted and self.added, of course.

Ok. This class sounds like a good idea. I'll try it.
I'm not addressing the issue of how to best represent such sets on disk
because it's not obvious to me what operations you need to perform on
the on-disk representation -- for example, deletions can be very costly
unless you can afford to simply set a "logically deleted" flag on
deleted items; but, depending on how often you delete stuff, that will
eventually degrade performance due to fragmentation, and maintaining the
indexing (if you need one) can be a bear.

I actually don't have very many transactions I need to perform on
disk. Once I've loaded these sets, I normally use the contains
operation and union/intersection/difference operations.
I can definitely use a tombstone on the record to mark it as logically
deleted. The problem right now is that I don't need indexing within
sets (I either load the whole set or don't bother at all) - I'm only
indexing the entire file (one entry in the index per tablespace). So
even if I use tombstones, I'll have to still go through all the
tablespace to find the tombstoned files. Alternatively, I could write
the deleted data at the end of the tablespace and mark it with the
tombstone (since I don't care about space so much) and when I load the
data, I can eliminate the tombstoned entries. I'm give that a try and
report my performance results on this thread - I absolutely love that
I can make this important a change in one or two days using Python!
Normally, I would use a good relational database (which has no doubt
already solved on my behalf all of these issues) for purpose of
persistent storage of such data -- usually sqlite for reasonably small
amounts of data, PostgreSQL if I need enormous scalability and other
"enterprise" features, but I hear other people are happy with many other
engines for relational DBs, such as Oracle (used to be superb if you
could afford full-time specialized db admins), IBM's DB2, SAP's database
(now opensourced), Firebird, even MySQL (I've never, but NEVER, had any
good experience with the latter, but I guess maybe it's only me, as
otherwise it could hardly be so popular). The main point here is that a
relational DB's tables _should_ be already very well optimized for
storing "sets", since those table ARE exactly nothing but sets of tuples
(and a 1-item tuple is a tuple for all that:).

Thanks Alex, but we're actually implementing a (non-relational)
database engine. This particular file is one of many files (which are
all kept in sync by the engine) I need to persist (some files use
other methods - including pickling - for persistence). I've already
solved most of our other problems and according to hotshot, this is
one of the next hurdles (this commit is responsible for 8% of the
total CPU time @ 3000 items and 3 files).
If you're interested, you can check it out at http://www.brainwavelive.com
or email me.

Prateek
 
A

Aahz

Thanks Alex, but we're actually implementing a (non-relational)
database engine.

Why are you reinventing the wheel? Why not just implement your
functionality on top of an existing database as your backing store?
 
D

Dennis Lee Bieber

I already did this. I'm not using the pickle module at all - Since I'm
guaranteed that my sets contain a variable number of fixed length
strings, I write a header at the start of each tablespace (using
struct.pack) marking the number of rows and then simply save each
string one after the other without delimiters. I can do this simply by
issuing "".join(list(set_in_question)) and then saving the string
after the header. There are a few more things that I handle (such as
automatic tablespace overflow)
Seems that you could add a bitmap to that header, flagging "deleted"
records... 0 => data, 1 => deleted... New data would just be appended to
the bottom, deleted data just has the bitmap toggled, every-so-often you
run a compress operation. "number of records" would have to indicate the
total so you'd know where to start appends. This would require you to
maintain the record order, of course, so the proper bit gets toggled
when you determine a record is to be deleted.

pseudo-XML
<header>
<next_available_record>cnt</next_available_record>
<bmsize>(cnt + overflow_space allotment)/8</bmsize>
<bitmap>"\x00" * bmsize</bitmap> <!-- initial, no deleted, spares
-->
</header>
<data>
<record>fixed length string</record>
...
</data>

File updates would only be to the header (where you were already keeping
a counter) and to the end of the file. On input, since you are doing all
processing in core, you could tag the data with the delete flag from the
bitmap. I could see that tag holding UNCHANGED, UPDATE, DELETED, NEW
states. On update, the bitmap gets 0 for UNCHANGED/UPDATE/NEW, 1 for
DELETE; NEW records get appended to file, UPDATE records get
overwritten, others untouched; last header count and bitmap get written.

When the record cnt approaches the bitmap size, run a
defrag/compress, which computes, if needed, a new bitmap size, moves
nondeleted data upwards, clears the bitmap, and writes the new record
count value.

As mentioned, the only draw back is that you need to maintain record
order during the in-core processing so you know where it was in the file
(maybe the tag includes the computed seek() offset, use 0 for new
records -- that would allow you to strip deleted records from the
internal set).
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
P

Prateek

Oh dear god, I implemented this and it overall killed performance by
about 50% - 100%. The same script (entering 3000 items) takes between
88 - 109s (it was running in 55s earlier).

Here is the new Set implementation:
class SeaSet(set):
__slots__ = ['master', 'added', 'deleted']
def __init__(self, s=None):
if s is None: s = []
self.master = set(s)
self.added = set()
self.deleted = set()

def add(self, l):
if l not in self.master:
self.added.add(l)
try:
self.deleted.remove(l)
except KeyError:
pass

def remove(self, l):
try:
self.master.remove(l)
self.deleted.add(l)
except KeyError:
try:
self.added.remove(l)
except KeyError:
pass

def __contains__(self, l):
if l in self.deleted:
return False
elif l in self.added:
return True
else:
return l in self.master

def __len__(self):
return len(self.master) + len(self.added)

def __iter__(self):
for x in self.master:
yield x
for x in self.added:
yield x

def __or__(self, s):
return self.union(s)

def __and__(self, s):
return self.intersection(s)

def __sub__(self, s):
return self.difference(s)

def union(self, s):
"""docstring for union"""
if isinstance(s, (set, frozenset)):
return s | self.master | self.added
elif isinstance(s, SeaSet):
return self.master | self.added | s.master | s.added
else:
raise TypeError

def intersection(self, s):
"""docstring for intersection"""
if isinstance(s, (set, frozenset)):
return s & self.master & self.added
elif isinstance(s, SeaSet):
return self.master & self.added & s.master & s.added
else:
raise TypeError

def difference(self, s):
"""docstring for difference"""
if isinstance(s, (set, frozenset)):
self.deleted |= (self.master - s)
self.master -= s
self.added -= s
elif isinstance(s, SeaSet):
t = (s.master | s.deleted)
self.deleted |= self.master - t
self.master -= t
self.added -= t
else:
raise TypeError


The surprising thing is that commits *ARE* running about 50% faster
(according to the time column in the hotshot profiler). But, now, the
longest running operations seem to be the I/O operations which are
taking 10 times longer! (even if they're only reading or writing a few
bytes. Could this have something to do with the set implementation
being in Python as opposed to C?

For instance, this method:
def __readTableHeader(self, f):
hdr = f.read(sz__TABLE_HEADER_FORMAT__)
if len(hdr) < sz__TABLE_HEADER_FORMAT__:
raise EOFError
t = THF_U(hdr)
#t = unpack(__TABLE_HEADER_FORMAT__, hdr)
return t

is now taking > 13s when it was taking less than 0.8s before! (same
number of calls, nothing changed except the set implementation)

sz__TABLE_HEADER_FORMAT__ is a constant = struct.calcsize("<II37s") =
45
THF_U is a reference to the struct.unpack function for the relevant
Struct object

What the heck happenned?!

Prateek
 
G

Gabriel Genellina

Oh dear god, I implemented this and it overall killed performance by
about 50% - 100%. The same script (entering 3000 items) takes between
88 - 109s (it was running in 55s earlier).

Here is the new Set implementation:
class SeaSet(set): [...]

The surprising thing is that commits *ARE* running about 50% faster
(according to the time column in the hotshot profiler). But, now, the
longest running operations seem to be the I/O operations which are
taking 10 times longer! (even if they're only reading or writing a few
bytes. Could this have something to do with the set implementation
being in Python as opposed to C?

Hard to tell - you have posted only your SeaSet implementation, and no I/O
is involved in that code.
For instance, this method:
def __readTableHeader(self, f):
hdr = f.read(sz__TABLE_HEADER_FORMAT__)
if len(hdr) < sz__TABLE_HEADER_FORMAT__:
raise EOFError
t = THF_U(hdr)
#t = unpack(__TABLE_HEADER_FORMAT__, hdr)
return t

is now taking > 13s when it was taking less than 0.8s before! (same
number of calls, nothing changed except the set implementation)

I don't see where your SeaSet class is used.
 
P

Prateek

I don't see where your SeaSet class is used.Actually that is the point. According to the hotshot profile, the
problem code doesn't use the SeaSet implementation. Yet that same code
was running much faster earlier. I tried multiple time (2-3 times).
From what I can fathom, nothing else changed - just the set
implementation.

It seems obvious that the read() call in the __readTableHeader method
is blocking for longer periods although it isn't exactly obvious why
that might be.

I was hoping someone with an idea of Python-C interaction could shed
some light on this. I'm on a different computer right now, I'll log
back in later and post more code if that helps.

Again, thanks to anyone who can help.

Prateek
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top