S
Stefan Behnel
Hallvard said:It's purity that they don't return the modified tree/dict/whatever.
They can still return the deleted element and remain pure.
Fair enough.
Stefan
Hallvard said:It's purity that they don't return the modified tree/dict/whatever.
They can still return the deleted element and remain pure.
Stefan Behnel said:I doubt that there are many AVL implementations that do that. Plus, if
deletion doesn't delete, I'd happily consider that a bug.
Hash tables (dicts) are useful for many of the same things that trees are
useful for, but they are different data structures with different
strengths and weaknesses, and different uses. To argue that we don't need
trees because we have dicts makes as much sense as arguing that we don't
need dicts because we have lists.
(In particular, WRT the bisect module, although insertion and deletion
are O(N), the constant factor for doing a simple memory move at C speed
swamps bytecode until N gets very large -- and we already have
collections.deque() for some other common use cases.)
Wish I had asked this before this year's GSoC started.Aahz said:The problem is that trees are like standards: there are so many to choose
from. Each has its own tradeoffs, and because Python dicts and lists can
substitute for many of the traditionals uses of trees, the tradeoffs are
less clear. I think nobody would object to adding trees to the standard
library, but it will certainly require a clear PEP, preferably with a
pointer to an existing PyPI library that has acquired real-world use.
João Valverde said:What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)).
[...]
I'm genuinely surprised to know
there are no data structures that efficiently support such a common need
in Python.
What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)). The particular algorithm to achieve this is a secondary
issue. It's a BST for sure, AVL vs RBT vs something else. It's my fault
for having opened the topic with simply "trees" instead, it would have
avoided this vagueness problem, but I'm genuinely surprised to know
there are no data structures that efficiently support such a common need
in Python. And I really enjoy the using this language.
It's purity that they don't return the modified tree/dict/whatever.
They can still return the deleted element and remain pure.
Why no trees in the standard library, if not as a built in?
What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)).
in Python. And I really enjoy the using this language.
João Valverde said:What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)).
> I'm genuinely surprised to know
there are no data structures that efficiently support such a common need
in Python.
greg said:Careful here -- you can't get away from the need for
hashability just by using a tree. Even if you don't
need to actually hash the values, it's still important
that the criterion for ordering between objects doesn't
change while they're in the tree, otherwise they'll
be in the wrong place and won't be found by subsequent
lookups.
Obviously my experience differs, but those were my expectations.Is it really all that common? If it truly were common,
there probably *would* be something for it in the
stdlib by now.
To answer the question of what I need the BSTs for, without getting intoWhat sort of things are you doing that you want such
a structure for? Maybe we can suggest a way of using
the existing data structures to achieve the same
goal.
Whynotrees in the standard library, if not as a built in? I searched
the archive but couldn't find a relevant discussion. Seems like a
glaring omission considering the batteries included philosophy,
particularly balanced binary search trees.Nointerest,nogood
implementations, something other reason? Seems like a good fit for the
collections module. Can anyone shed some light?
Thanks.
Crap, this sentence is totally confusing. I meant kept the merge code asJoão Valverde said:I'm awareTechnically it's necessary to define a total ordering on
the set of keys.
Obviously my experience differs, but those were my expectations.
To answer the question of what I need the BSTs for, without getting
into too many boring details it is to merge and sort IP blocklists,
that is, large datasets of ranges in the form of (IP address, IP
address, string). Originally I was also serializing them in a binary
format (but no more after a redesign). I kept the "merge and sort"
part as a helper script, but that is considerably simpler to implement.
Aahz said:Why AVL/RBT instead of B*? It's not that simple.... Another problem is
that unless the tree is coded in C, the constant factors are going to
swamp algorithmic complexity for many use cases -- that in turn makes it
more difficult to deploy a PyPI library for real-world testing.
Anyway, I'm *not* trying to discourage you, just explain some of the
roadblocks to acceptance that likely are why it hasn't already happened.
If you're serious about pushing this through, you have two options:
* Write the code and corresponding PEP yourself (which leads to the
second option, anyway)
* Lobby on the python-ideas mailing list
João Valverde said:I wouldn't consider anything other than C for such a module on
efficiency alone, unless it was a prototype of course. But I have little
knowledge about the Python C API.
There's also another issue raise by Paul Rubin I wasn't even aware of,João Valverde said:Currently I don't have a strong need for this. I just believe it would
be a benefit to a language I like a lot. Lobbying isn't my thing. I'd
rather write code, but neither am I the most qualified person for the
job. It would certainly be interesting and fun and challenging in a
good way and a great way to learn some new stuff. But I would
definitely need mentoring or asking some silly questions on the
mailing list. Maybe I'll seriously consider it some other time.
João Valverde said:Currently I don't have a strong need for this.
João Valverde said:To answer the question of what I need the BSTs for, without getting
into too many boring details it is to merge and sort IP blocklists,
that is, large datasets of ranges in the form of (IP address, IP
address, string). Originally I was also serializing them in a binary
format (but no more after a redesign). I kept the "merge and sort"
part as a helper script, but that is considerably simpler to
implement.
...
As an anecdotal data point (honestly not trying to raise the "Python
is slow" strawman), I implemented the same algorithm in C and
Python, using pyavl. Round numbers were 4 mins vs 4 seconds, against
Python (plus pyavl). Even considering I'm a worse Python programmer
than C programmer, it's a lot. I know many will probably think I
tried to do "C in Python" but that's not the case, at least I don' t
think so. Anyway like I said, not really relevant to this discussion.
I was using a bytes subclass. I'm not free to choose CIDR notation,Miles said:What format were you using to represent the IP addresses? (Is it a
Python class?) And why wouldn't you use a network address/subnet mask
pair to represent block ranges? (It seems like being able to
represent ranges that don't fit into a subnet's 2^n block wouldn't be
that common of an occurrence, and that it might be more useful to make
those ranges easier to manipulate.)
I would flip your statement and say one of the advantages of using treesOne of the major disadvantages of using a tree container is that
usually multiple comparisons must be done for every tree operation.
When that comparison involves a call into Python bytecode (for custom
cmp/lt methods) the cost can be substantial. Compare that to Python's
hash-based containers, which only need to call comparison methods in
the event of hash collisions (and that's hash collisions, not hash
table bucket collisions, since the containers cache each object's hash
value). I would imagine that tree-based containers would only be
worth using with objects with comparison methods implemented in C.
Thanks, and I'm not trying to start a religion either.Not that I'm trying to be an apologist, or reject your arguments; I
can definitely see the use case for a well-implemented, fast
tree-based container for Python. And so much the better if, when you
need one, there was a clear consensus about what package to use (like
PIL for image manipulation--it won't meet every need, and there are
others out there, but it's usually the first recommended), rather than
having to search out and evaluate a dozen different ones.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.